thoughtt ...thinking aloud

Mining Twitter for Training Language Models [DRAFT]

In the last couple of days, One of my friends from UCI and I have started a new project, a very different kind to my taste! I have pretty much zero knowledge of natural language processing (NLP) and I wanted to change that once for all. So I joined my friend in doing some empirical study of social networks.

There are lots of courses and books about modern NLP powered by neural networks, but almost all of them are for the English language. (I’d better say, they are not for my native language). So I decided to collect my own database of Persian text and start doing experiments on them. Twitter is an amazing source of such texts, and more importantly, with official API, mining it is super easy and easy and convenient.

Capturing Stream of Tweets with Tweepy

There are multiple Python libraries to interact with Twitter API, my choice is Tweepy (disclaimer: it was a random first choice). Setting it up is pretty easy, you only need to get consumer_key, consumer_secret, key and secret parameters from the developer’s website of Twitter.

After getting those values, you can setup Tweepy like this:

import sys
import tweepy as tw
import json
import io
import sys

auth = tw.OAuthHandler("consumer_key", "consumer_secret")
auth.set_access_token("key", "secret")

api = tw.API(auth, wait_on_rate_limit=True)

There is a Streaming API that allows us to automatically receive a detailed JSON object corresponding to each tweet satisfying our Query. After receiving the JSON object, you can push it into a database or like me, write a simple dump to disk function. We’ve collected more than 6 million tweets exactly like using the following code. It is simple and it works.

class MyStreamListener(tw.StreamListener):
  def __init__(self, api):
    self.api = api
    self.counter = 0
    self.listOfTweets = []
    super(tw.StreamListener, self).__init__()

  def dump_on_disk(self):
    with open('tweets_stopwords_db.txt', 'a') as f:
      for t in self.listOfTweets:
        f.write(t + "\n\n")
    self.listOfTweets = []

  def on_status(self, status):
    self.counter += 1

    tj = json.dumps(status._json)

    if (len(self.listOfTweets)%2000 == 0):
      print("Dumping on Disk {}".format(self.counter))

Unfortunately, Twitter’s API has one annoying limitation, you can’t query all tweets in a particular language. To overcome this limitation, in our first run, we collected over 2 million tweets using a query containing 15 words (which we believed they are important). From these 2 million tweets, we filtered most commonly used words, which gave us this:

words = ['به', 'از', 'که', 'در', 'این', 'را', 'با', 'است', 'رو', 'هم', 'برای', 'تو', 'ما']

Now in our second run, we queried for these most common words. Once in a while, we were receiving a network-related error. After digging Tweepy’s Github issues, We ran Tweepy inside a guarded loop, which automatically resumes a crashed Stream.

from ssl import SSLError
from requests.exceptions import Timeout, ConnectionError
from urllib3.exceptions import ReadTimeoutError

myStreamListener = MyStreamListener(api)
myStream = tw.Stream(auth = api.auth, listener=myStreamListener)

# create a zombie
while not myStream.running:
    # start stream synchronously
    print("Started listening to twitter stream...")
    myStream.filter(languages=['fa'], track=words)
  except (Timeout, SSLError, ReadTimeoutError, ConnectionError) as e:
    print("Network error occurred. Keep calm and carry on.", str(e))
  except Exception as e:
    print("Unexpected error!", e)
    print("Stream has crashed. System will restart twitter stream!")
print("Somehow zombie has escaped...!")

Do some plotting

Before jumping into the NLP part, and after collecting some data, It is always nice to plot some aspect of you collected data. The very first thing to come to mind is to plot a heatmap of twitter’s activity in each hour. When doing this for all of your tweets, regardless of a tweet is an original tweet or a re-tweet, it gives an interesting chart about what hour people go to sleep.

TBC ..

  1. Tweepy’s Github Page
  2. D3.js
  3. D3 Heatmap Tutorial
  4. D3 - Day / Hour Heatmap

Image Credits

  1. Background vector created by starline -