Data mining Twitter for sentiments about the new consoles

Yeah, that’s really tricky. Parsing a conversational exchange is fundamentally different from coding a statement. And here you have nested statements, where everything could be expressed with irony/sarcasm, further complicated by idiom.

Would it wreck your methodology to simply discard all replies that aren’t verbatim RTs?

Yeah, but then you gotta deal with subtweets! (Things which are–to those familiar with both the subtweeter and the subtweetee–obviously replies from one person to another without actually mentioning the other person or using standard RT or @ reply functionality)

Honestly, at this point you’re dipping into linguistics territory pretty heavily, where my lady-pal is far more an expert than I, but it’s cool. . . and extraordinarily frustrating. . . to try to deal with human speech/writing in a systematic way. Probably why most linguists these days are descriptivists rather than prescriptivists, haha.

I’ve been plugging away at the analysis. Unicode and UTF-8 and tweets withlittle crappy unicode characters in them are causing me all manner of headaches. Just when I think I’ve got them taken care of, I try to change something and then spend another hour determining why I cannot manipulate a string with a little symbol in it. Ugh. Give me webdings any day!

Anyhow, here’s a look at tweet volume and mean sentiment for xbox one and ps4 tweets over the first 10 days or so of data. As soon as I work out the unicode hassles, I’ll do the bigram and trigram analysis and write another front page article. Should be this weekend. At this point, I might as well wait for the Xbox release so that I can discuss it and the PS4 release.

I’ll bet you can’t find the PS4 release on these charts… funny that people were dissing the Xbox at the moment of PS4 release.

Whew. Stack Overflow to the rescue again. Let it stand, for the record, that if you ever are importing tweets that have unicode characters (almost all of them do), then you can’t just use csv.reader. I have no idea how I didn’t run into this problem before, since I’ve churned through these tweets hundreds of times.

You have to take this approach:


#!/usr/bin/python
# -*- coding: utf-8 -*-

import codecs
import csv
from operator import itemgetter

# Fixing the unicode csv problem
# http://stackoverflow.com/questions/20137321/parsing-unicode-in-tweets-using-python

def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
    # csv.py doesn't do Unicode; encode temporarily as UTF-8:
    csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
                            dialect=dialect, **kwargs)
    for row in csv_reader:
        # decode UTF-8 back to Unicode, cell by cell:
        yield [unicode(cell, 'utf-8') for cell in row]

def utf_8_encoder(unicode_csv_data):
    for line in unicode_csv_data:
        yield line.encode('utf-8')

tweets_fname = "processed_tweets_2013_11_10_11.csv"

with codecs.open(tweets_fname , encoding="utf-8") as current_file:
    data = unicode_csv_reader(current_file, delimiter=",")
    tweets = u' '.join(map(itemgetter(2), data))
    encoded_tweets = tweets.encode('utf8', 'replace')

print len(encoded_tweets)

So the initial descriptive statistics (lite) are up in another fp article. There were some interesting that I didn’t include. Have a look here:

Basically, for the period of tweets that I’ve collected, tweets that discuss the PS4 tend to be more likely to include racist language. Disregarding the noise in the middle, tweets that discuss the XB1 tend to be more likely to include homophobic language. Finally, other than a brief period (when the PS4 was released), tweets that mention the Xbox One tend to be more likely to include profanities than those that mention the PS4.

You obviously can’t draw any larger conclusions about this, but it might be interesting to look back at the pattern across all of the tweets between 10 Nov and Jan 1.

By the way, there are far more “profane” tweets than racist ones, and not very many homophobic ones. The raw numbers are standardized and mapped to a 0-1 scale. The result is that the homophobic chart seems much noisier because in many cases, hour to hour, the sum total value is vacillating between 3 and 20 (or something along those lines) while with the profane tweet, they’ll vacillate between, say 250 and 275 – a smaller relative jump, which makes a smoother line.

God I love this