Data mining Twitter for sentiments about the new consoles

Clay · November 15, 2013, 6:29am

Howdy. I’m writing a series of front page articles, for Tom, about mining Twitter. The idea is to look at generalized sentiments in the Twittersphere related to the PS4 and Xbox One, and how those sentiments develop over the holiday season. The first post will go live on the homepage today (11/15/2013). In it I say that I’ll post code here, so that others can follow along or create their own Twitter analysis project. So, without further ado, here we go…

I’m writing this as a guide for Windows, but you can do all of this on a Mac. Just be certain, if you’re a developer on a Mac, that you don’t screw up your homebrew install by installing python on top of it. If you’re a developer on a Mac, you probably know enough to avoid these types of problems.

[B]Capturing a stream of Tweets

Overview
[/B]I wrote a quick script to capture a stream of tweets, through the official Twitter API, related to the Xbox One, PS4, and WiiU (trying to be fair, here…). To run this type of script on your machine, you need:

[ul]
[li]A twitter account. Any account will do – even a throwaway one. Just go to Twitter.com to register one, or use your existing one.[/li][li]To register your twitter account as a developer account. This is free.[/li][li]Python installed on your computer.[/li][/ul]

Registering as a Twitter Developer

[ul]
[li]Go to http://dev.twitter.com and log in with your Twitter account.[/li][li]Click on your profile image, in the upper-right, and select “My Applications.”[/li][li]Click “Create a new application”[/li][li]Fill out the information. You can use any name or website that you want to. The name has to be unique. I use something like clay_streaming_python_test (or something similar). Nobody will see this name but you, so it doesn’t really matter what it is. Leave the Callback URL blank. Agree to the terms and then click “Create a twitter application.”[/li][li]You’ll be taken to the Details tab of your application’s page. Scroll to the bottom and click “Create my access token.” Wait about 30 seconds and then refresh the page. Your access token should appear at the bottom.[/li][li]Create a .txt file on your computer and copy the Consumer key, Consumer secret, Access token, and Access token secret for your application into that file, for future reference. Don’t share them with anybody.[/li][/ul]

Note that the Twitter developer documents are good. You can access them through the same site where you registered as a dev.

Get a code editor
You’ll need a good text / code editor – one that provides Python code syntax highlighting. I use and suggest Sublime Text 2, which has an unlimited trial (that will nag you from time to time). Other people like Notepad++. If you’re on a Mac, try Textmate 2, which is free and good.

Get Python
You need to get Python on your machine. If you already have it and know how to use it, then skip this section. If you don’t know whether you have it, click your Start button and type Powershell to launch the enhanced shell app on Windows. Type into it:

python --version

If you see a bunch of red text, then you either don’t have python or don’t have it in your path. If you think you have it, then add it to your path (Google will tell you how).

If you need to install Python, realize that there are a lot of different distributions of it and that version 3 isn’t entirely backwards compatible with version 2. There are tons of libraries for version 2 that are stable and good, so I recommend installing version 2. More specifically, I recommend downloading and installing the open source Anaconda distribution of Python. Get it by clicking here. Make sure you get the version that matches your operating system (32 bit, 64 bit, etc.). The Anaconda package comes with a lot of libraries that are useful for data analysis.

Run your Python installer. If you have Powershell open, quit and relaunch it.

Clay · November 15, 2013, 6:29am

Install the Twython package for Python
This guide uses the Twython package to connect to the Twitter API through Python. There are some useful examples in the Twython GitHub repository. You can find the official documents for Twython here. There are many ways to access the Twitter API through Twython – this guide only uses the TwythonStreamer.

Launch Powershell
Make sure that your Python installation worked by typing python --version at the prompt.
Type the following code to install the Twython package.

easy_install twython

After the install, check that it worked by typing python at the Powershell prompt. The interactive Python shell should launch.
Type import twython at the >>> prompt. If nothing happens and you see another >>>, then you are good to go.
You can type quit() at the >>> prompt to get out of interactive Python.

If you have errors, you can post them in this thread and I’ll try to help.

Save the script
Launch Sublime Text (or whatever code editor you are using) and create a new document. Save it as twit_stream.py somewhere you can find it easily. I created a folder on my desktop called “streaming” and saved it in there.

Now… there are 1000 ways to approach this. I wrote this script quickly and it is a little disorganized, but it works. If you’re a python coder and/or developer and my code gives you hives, then feel free to modify it or post a better version.

Copy this code and put it in your twit_stream.py file and save.


# Import the libraries that we need
from twython import TwythonStreamer
import json
import csv
import re

# These are the keys for your twitter application, as discussed earlier in the tutorial
APP_KEY            = 'INSERT YOUR TWITTER CONSUMER KEY HERE'
APP_SECRET         = 'INSERT YOUR TWITTER CONSUMER SECRET HERE'
OAUTH_TOKEN        = 'INSERT YOUR TWITTER ACCESS TOKEN HERE'
OAUTH_TOKEN_SECRET = 'INSERT YOUR TWITTER ACCESS TOKEN SECRET HERE'

# Prompt the user for the terms to track
track_terms = raw_input("What terms do you want to track? (separate multiple with a comma): ")

# Prompt the user for how many tweets they want to keep
keep_tweets = int(raw_input("How many tweets do you want to keep? (-1 for unlimited): "))

# Adjust the input number
if keep_tweets < 0:
    keep_tweets = 999999999
elif keep_tweets == 0:
    keep_tweets = 10

# This will cause the script to keep only english language tweets
# Change to 'all' to keep all tweets regardless of language
keep_lang = 'en'

# Counter for keeping track how many tweets we've saved
counter = 0

# Variable to track whether we've written the header to the CSV file
header_done = False

# Variable to use to name sequential files full of tweets
file_name_suffix = 0

# Prompt the user for how many tweets they want per sequential file
tweets_per_file = int(raw_input("How many tweets do you want to save per file? "))

if tweets_per_file <= 0:
    tweets_per_file = 50000

# This class will process incoming tweets and is called from MyStreamer
# in the on_success() method
class TweetMonkey:
    # Remove some nasty characters that can break the CSV
    def clean(self,text):
        text = text.replace("
","; ")
        text = text.replace('"', "'")
        text = text.replace(','," ")
        return text

    # Method to create the CSV header in each file
    def create_header(self):
        global file_name_suffix

        header = []
        header.append("id")
        header.append("lang")
        header.append("user_name")
        header.append("tweet")
        header.append("retweeted")
        header.append("favorite_count")
        header.append("source")
        header.append("in_reply_to_status_id")
        header.append("in_reply_to_screen_name")
        header.append("in_reply_to_user_id")
        header.append("possibly_sensitive")
        header.append("geo")
        header.append("created_at")

        # Write the header to the file
        tweets = open("tweets_" + str(file_name_suffix) + ".csv", 'ab+')
        wr     = csv.writer(tweets, dialect='excel')
        wr.writerow(header)
        tweets.close()

    # This is the method that does the heavy lifting for processing a tweet
    # and putting it into the CSV file
    def process(self, tweet):
        global header_done
        global file_name_suffix
        global counter
        global tweets_per_file

        if counter % 1000 == 0:
            print counter, "tweets processed..."

        # Increment the file name 
        if counter % tweets_per_file == 0:
            file_name_suffix += 1
            header_done = False # reenable if you want every file to include the header

        if not header_done:
            self.create_header()
            header_done = True

        # Create the file or append to the existing
        theOutput = []

        theOutput.append(           tweet['id'])
        theOutput.append(           tweet['lang'].encode('utf-8'))

        # There is redundant scrubbing of the username because I was
        # having trouble removing all of the 
 characters
        uname = tweet['user']['name'].encode('utf-8', 'replace')
        newuname = re.sub('
','',uname)
        theOutput.append(           newuname)

        # There is redundant scrubbing of the tweet because I was
        # having trouble removing all of the 
 characters
        twt = self.clean(tweet['text']).encode('utf-8', 'replace')
        newtwt = re.sub('
','',twt)
        theOutput.append(newtwt)
        
        theOutput.append(           tweet['retweeted'])
        theOutput.append(           tweet['favorite_count'])
        theOutput.append(self.clean(tweet['source']).encode('utf-8', 'replace'))
        theOutput.append(           tweet['in_reply_to_status_id'])
        theOutput.append(           tweet['in_reply_to_screen_name'])
        theOutput.append(           tweet['in_reply_to_user_id'])

        if tweet.get('possibly_sensitive'):
            theOutput.append(       tweet['possibly_sensitive'])
        else:
            theOutput.append("False")

        if tweet['geo'] is not None:
            if tweet['geo']['type'] == 'Point':
                lat = str(tweet['geo']['coordinates'][0]) + " "
                lon = str(tweet['geo']['coordinates'][1])
                theOutput.append(lat + lon)
            else:
                theOutput.append(tweet['geo'])
        else:
            theOutput.append(tweet['geo'])
        theOutput.append(tweet['created_at'])

        # Write the tweet to the CSV File
        tweets = open("tweets_" + str(file_name_suffix) + ".csv", 'ab+')
        wr     = csv.writer(tweets, dialect='excel')
        wr.writerow(theOutput)
        tweets.close()

# This is the subclass of TwythonStreamer that handles incoming tweets
class MyStreamer(TwythonStreamer):
    # Do this if the tweet is successfully captured
    def on_success(self, data):
        global counter
        global keep_lang
        global keep_tweets
        if 'text' in data:
            if keep_lang == 'all' or data['lang'] == keep_lang:
                # Uncomment this if you want to keep the JSON in a single file
                # g = open("games.json", "ab+")
                # json.dump(data,g)
                # g.write("
")
                # g.close()

                # Keep the CSV
                counter += 1
                writer   = TweetMonkey()
                writer.process(data)

        # Disconnect when we have the number of requested tweets
        if counter >= keep_tweets:
            self.disconnect()
            print "All done."

    # Do this if there's an error with the tweet
    def on_error(self, status_code, data):
        print "There was an error:
"
        print status_code, data

# Create an instance of the MyStreamer class 
stream = MyStreamer(APP_KEY,APP_SECRET,OAUTH_TOKEN,OAUTH_TOKEN_SECRET)

# Tell the instance of the MyStreamer class what you want to track
stream.statuses.filter(track=track_terms)

Make sure you put your credentials from your Twitter dev account/app into the script. After saving it, hop into Powershell and navigate to the directory where you saved it. To run the script, you just have to type python and the name of the script. For instance, I would type:

python twit_stream.py

You’ll be prompted for how many tweets you want to save, how many per file, and what search terms you want to use for the stream.

For search terms, there are a few rules.

A comma acts like an inclusive “OR” statement. If you have “ps4,xbox” then you will get a stream with tweets that either contain ps4 or xbox or both terms.

If you use a phrase, then tweets with that phrase – with the words in any order – also will be in your stream. For instance, if you have “ps4,xbox one” then you will get a stream with tweets that either contain ps4 or “… one … xbox …” or “… xbox one …” or “… xbox … one …” or any combination of those. The Twitter developer documents contain information about how to work with the streaming search terms.

Watch and Wait
After launching the script, look in the folder where it lives. You’ll see a series of .csv files appear, depending on your input parameters.

Examining your spoils
Excel does not open .csv files as UTF-8 encoded by default. You therefore are likely to end up with jumbled characters if you just double click to open these .csv files. Instead, launch Excel and find the option to “Import Text.” Set the delimiter to comma, mark the first row as a header, and set the encoding to UTF-8.

Personally, I prefer to open the .csv files in Sublime Text when I need to view their contents. In later series in this guide, I’ll import those .csv files into other applications for analysis.

** I had a bit of trouble pasting in all of the text/code here. If you have trouble running the script, let me know and I’ll help you troubleshoot it.**

Clay · November 15, 2013, 6:30am

Hold for Part 3

Clay · November 15, 2013, 12:11pm

And… after analyzing the first day of tweets following the PS4 release, the single most influential and retweeted tweet is from a member of boy band One Direction, Louis Tomlinson. Here’s the tweet in question:

I guess it’s a good argument for celebrity endorsements.

Telefrog · November 15, 2013, 12:58pm

Clay, thanks for all of this. It’s been really inetersting following this from the front page to here.

Also, One Direction are Twitter gods. Tweens love Twitter.

Clay · November 15, 2013, 3:43pm

Thanks! It’s a fun little project. I haven’t dug into most of the data yet – I need to clean it some and set up a few variables for analysis before I can begin to report on sentiments, etc.

Dejin · November 16, 2013, 10:46am

Nifty. Looking forward to seeing more of your reports!

Courteous_D · November 16, 2013, 11:44am

Neat project! Looking forward to seeing what you can pull out of this. Are you going to be trying to auto-code any of the content, or will you be looking strictly at the metadata? (Apologies: I duped this on the FP article, too; do you have a preference?)

Clay · November 16, 2013, 1:49pm

I’m fine with comments either place, but I guess it makes more sense for them to be here, since they’ll develop with the project.

As for the project, the focus of my analysis is going to be to isolate key phrases and sentiments, along with influential tweets, probably per hour of collected tweets. Basically, I’ll stem the words in the tweets and process them against a customized sentiment dictionary. With luck, we’ll see some changes in sentiment trends over time. I also plan to isolate the names of the games available at release (for the Xbone and PS4) and see if I can derive any generalized opinion about them based on the contents of the tweets. For the most part, this ignores the metadata. However, I haven’t yet really thought about what to do with the metadata. If you have any ideas, let me know!

Jupiter_Jones · November 17, 2013, 8:33am

Thanks! This is awesome.

Clay · November 17, 2013, 10:25am

The first several million tweets are up on Google Drive. You can find them here:
https://drive.google.com/folderview?id=0Bzvk4in6r1jjWmJaemk4OTJnQ2c&usp=sharing

I’m going to write a script to clean/parse them a bit. My plan was to process them in R (primarily, since it is open source), but since R loads everything into memory, I’m having trouble creating matrices for text analysis. They would take ~ 15 Gb of RAM… so, I think I’m going to write a script to separate them into hourly files. SAS and Enterprise Miner will process them without a problem, but since those are very expensive (thanks, free academic license!), I’m going to try to avoid using them here.

Courteous_D · November 17, 2013, 11:28am

Man, that is a lot of data. I worked mainly with SPSS, which was reasonably good at handling large data sets, but even so we’d do our cleaning and most of our coding in a SQL database and then export to SPSS to analyze and draw pictures.

I’m not even sure how I’d go about dealing with the spam! Exclude if tweet includes hashtag(s) not in a defined set? Or leave as is?

Cool stuff, Clay. It’s been a while since I’ve actually missed having familiar tools at hand to crunch this kind of stuff.

Clay · November 17, 2013, 6:24pm

The nice thing about the Twitter streaming API is that it only delivers to you the terms that you include in the filter statement. While I’m sure there is some spam in there, it’s not enough to impact the analysis of the rest of the tweets. The larger issue, perhaps, is that A LOT of the tweets are retweets from other people. The way that I’m going to handle this is with a python script that looks for the substring of “RT” in a tweet and flags a variable as True if it is a retweet. Honestly, some of this stuff is in the original JSON that comes from Twitter, but I didn’t think to keep that flag when I originally converted to CSV. Live and learn! Data cleaning is part of the fun.

If you haven’t seen it, you might want to look at the Python NLTKlibrary. I’m probably going to use the AFINN sentiment dictionary to rank the sentiment of tweets, though I also plan to flag racist tweets and try to track whether a tweet discusses a particular game(s).

ArmandoPenblade · November 17, 2013, 8:18pm

I wonder how the “new style” of RT will affect you here (wherein users select to retweet and the entire tweet–original username and all–appears in their timeline and shows up for their followers, assuming said follower hasn’t “turned off retweets from this user,” which is an option. As far as I know, the text “RT” doesn’t appear in these items, unless that was the JSON entry from Twitter you were talking about having dropped).

Anyway, just wanted to post mostly to say that I’m reading along and loving this. I’ve had a passing interest in programming since childhood, since it’s what my dad does for way more money than I make mostly ;), and it looked pretty cool! Thanks for posting :D

Clay · November 18, 2013, 7:16am

That’s an interesting question. Certainly a lot of the tweets contain the text “RT.” People seem to like to append a bit of text to a retweet. It also may depend on the client that you’re using for tweeting. I’ll look at the JSON again and modify the streaming script to catch the “official” record of a retweet. But since that won’t help me on the tweets I’ve already recorded, I’ll continue with the script I was writing to manually detect retweets.

Clay · November 18, 2013, 7:47am

A few other things that I need to catch – I need the number of follower for the user that posts each tweet so that I can create a baseline measure of influence of the tweet. Since these are coming in a stream, most of them don’t have a number of retweets stored by the time they are pushed my way. It would be interesting to write a secondary processing script that went back and checked the number of retweets that each tweet received within its first 24 hours of life… Hmm… more to think about.

Courteous_D · November 18, 2013, 9:07am

Yeah, I noticed the suspicious-looking “retweeted” variable that was uniformly populated with “FALSE”. :) From the first and last files, it looks like the RTs are running at about 35% of the total, which seems high vs. tweets globally. I wonder if that stays constant with traffic. I’m sure having that variable pre-populated in the stream will make things easier, and you won’t have to worry about anything falling through possible cracks in a recode script.

Would #RTs/#followers say anything interesting?

Ah! Interesting—yeah, I was curious what you’d be using to categorize and rank sentiment. Neat. Presumably you can tweak the AFINN dictionary as needed to encompass slang, idioms, alternate spellings, textspeak, etc.

Clay · November 18, 2013, 11:52am

Well, stuff comes and goes with these scripts. I came up with the idea for this project not long before the relevant time window, so I didn’t have a huge amount of time to consider all of the nuances. The Twitter API rate limits tweets, but they should be a representative sample of the overall firehose. Hence, I suspect that the retweet rate is representative of these topics. I agree that it seems a bit high. The always-false “Retweeted” variable is legacy of a former script from which I grabbed code. But with the stream, it’s pretty useless, so I’ll yank it. I’m also reducing the “source,” to get rid of the HTML, and reducing the date to 4 separate numeric variables: year, month, day, hour, which will make group processing simpler. I’ll post the updated script later. For now, I might just catch the raw JSON for a day or so while I work on the script. I have a big exam this week, so my time is a little limited.

Courteous_D · November 19, 2013, 2:53pm

Yeah, I can see how this could easily suck up a lot of time. I’d probably be wandering around in edge cases and optimization in a heartbeat. Fun, though!

Clay · November 20, 2013, 6:26pm

Alright… just a little update. I wrote a script to do a bit of processing and sentiment analysis. Proper sentiment analysis of tweets is very difficult without a custom sentiment dictionary. (Later, when I have some time, I might try to make one…). Take this tweet, for instance:

“@friskyxo: “@melipinkkk: about to get the PS4 hell yeah bitches” ihy” HATERS GON HATE

It’s a reply to a reply that includes misspellings, acronyms, etc, and terms that normally score as negative (“hell”, “bitches”) and use them in a positive way. It also uses HATE (negative sentiment) in a teasing manner.

Unfortunately, that’s not an outlier in terms of grammar or message. What to do, then? For a first pass, I’m not entirely sure. The script splits retweets from originals and saves tweets in files by the hour instead of 50k each. That way, I can easily produce hourly statistics, which makes for more interesting time series analysis. I’ll have to see what bubbles up when I look for common terms, popular tags, etc., before trying to develop a custom sentiment dictionary. The recoding script that I threw together is here (there’s some NSFW language as part of a filter). I’ll put the recoded files up on Google Drive when they’re done processing.