Install the Twython package for Python
This guide uses the Twython package to connect to the Twitter API through Python. There are some useful examples in the Twython GitHub repository. You can find the official documents for Twython here. There are many ways to access the Twitter API through Twython – this guide only uses the TwythonStreamer.
- Launch Powershell
- Make sure that your Python installation worked by typing python --version at the prompt.
- Type the following code to install the Twython package.
easy_install twython
- After the install, check that it worked by typing python at the Powershell prompt. The interactive Python shell should launch.
- Type import twython at the >>> prompt. If nothing happens and you see another >>>, then you are good to go.
- You can type quit() at the >>> prompt to get out of interactive Python.
If you have errors, you can post them in this thread and I’ll try to help.
Save the script
Launch Sublime Text (or whatever code editor you are using) and create a new document. Save it as twit_stream.py somewhere you can find it easily. I created a folder on my desktop called “streaming” and saved it in there.
Now… there are 1000 ways to approach this. I wrote this script quickly and it is a little disorganized, but it works. If you’re a python coder and/or developer and my code gives you hives, then feel free to modify it or post a better version.
Copy this code and put it in your twit_stream.py file and save.
# Import the libraries that we need
from twython import TwythonStreamer
import json
import csv
import re
# These are the keys for your twitter application, as discussed earlier in the tutorial
APP_KEY = 'INSERT YOUR TWITTER CONSUMER KEY HERE'
APP_SECRET = 'INSERT YOUR TWITTER CONSUMER SECRET HERE'
OAUTH_TOKEN = 'INSERT YOUR TWITTER ACCESS TOKEN HERE'
OAUTH_TOKEN_SECRET = 'INSERT YOUR TWITTER ACCESS TOKEN SECRET HERE'
# Prompt the user for the terms to track
track_terms = raw_input("What terms do you want to track? (separate multiple with a comma): ")
# Prompt the user for how many tweets they want to keep
keep_tweets = int(raw_input("How many tweets do you want to keep? (-1 for unlimited): "))
# Adjust the input number
if keep_tweets < 0:
keep_tweets = 999999999
elif keep_tweets == 0:
keep_tweets = 10
# This will cause the script to keep only english language tweets
# Change to 'all' to keep all tweets regardless of language
keep_lang = 'en'
# Counter for keeping track how many tweets we've saved
counter = 0
# Variable to track whether we've written the header to the CSV file
header_done = False
# Variable to use to name sequential files full of tweets
file_name_suffix = 0
# Prompt the user for how many tweets they want per sequential file
tweets_per_file = int(raw_input("How many tweets do you want to save per file? "))
if tweets_per_file <= 0:
tweets_per_file = 50000
# This class will process incoming tweets and is called from MyStreamer
# in the on_success() method
class TweetMonkey:
# Remove some nasty characters that can break the CSV
def clean(self,text):
text = text.replace("
","; ")
text = text.replace('"', "'")
text = text.replace(','," ")
return text
# Method to create the CSV header in each file
def create_header(self):
global file_name_suffix
header = []
header.append("id")
header.append("lang")
header.append("user_name")
header.append("tweet")
header.append("retweeted")
header.append("favorite_count")
header.append("source")
header.append("in_reply_to_status_id")
header.append("in_reply_to_screen_name")
header.append("in_reply_to_user_id")
header.append("possibly_sensitive")
header.append("geo")
header.append("created_at")
# Write the header to the file
tweets = open("tweets_" + str(file_name_suffix) + ".csv", 'ab+')
wr = csv.writer(tweets, dialect='excel')
wr.writerow(header)
tweets.close()
# This is the method that does the heavy lifting for processing a tweet
# and putting it into the CSV file
def process(self, tweet):
global header_done
global file_name_suffix
global counter
global tweets_per_file
if counter % 1000 == 0:
print counter, "tweets processed..."
# Increment the file name
if counter % tweets_per_file == 0:
file_name_suffix += 1
header_done = False # reenable if you want every file to include the header
if not header_done:
self.create_header()
header_done = True
# Create the file or append to the existing
theOutput = []
theOutput.append( tweet['id'])
theOutput.append( tweet['lang'].encode('utf-8'))
# There is redundant scrubbing of the username because I was
# having trouble removing all of the
characters
uname = tweet['user']['name'].encode('utf-8', 'replace')
newuname = re.sub('
','',uname)
theOutput.append( newuname)
# There is redundant scrubbing of the tweet because I was
# having trouble removing all of the
characters
twt = self.clean(tweet['text']).encode('utf-8', 'replace')
newtwt = re.sub('
','',twt)
theOutput.append(newtwt)
theOutput.append( tweet['retweeted'])
theOutput.append( tweet['favorite_count'])
theOutput.append(self.clean(tweet['source']).encode('utf-8', 'replace'))
theOutput.append( tweet['in_reply_to_status_id'])
theOutput.append( tweet['in_reply_to_screen_name'])
theOutput.append( tweet['in_reply_to_user_id'])
if tweet.get('possibly_sensitive'):
theOutput.append( tweet['possibly_sensitive'])
else:
theOutput.append("False")
if tweet['geo'] is not None:
if tweet['geo']['type'] == 'Point':
lat = str(tweet['geo']['coordinates'][0]) + " "
lon = str(tweet['geo']['coordinates'][1])
theOutput.append(lat + lon)
else:
theOutput.append(tweet['geo'])
else:
theOutput.append(tweet['geo'])
theOutput.append(tweet['created_at'])
# Write the tweet to the CSV File
tweets = open("tweets_" + str(file_name_suffix) + ".csv", 'ab+')
wr = csv.writer(tweets, dialect='excel')
wr.writerow(theOutput)
tweets.close()
# This is the subclass of TwythonStreamer that handles incoming tweets
class MyStreamer(TwythonStreamer):
# Do this if the tweet is successfully captured
def on_success(self, data):
global counter
global keep_lang
global keep_tweets
if 'text' in data:
if keep_lang == 'all' or data['lang'] == keep_lang:
# Uncomment this if you want to keep the JSON in a single file
# g = open("games.json", "ab+")
# json.dump(data,g)
# g.write("
")
# g.close()
# Keep the CSV
counter += 1
writer = TweetMonkey()
writer.process(data)
# Disconnect when we have the number of requested tweets
if counter >= keep_tweets:
self.disconnect()
print "All done."
# Do this if there's an error with the tweet
def on_error(self, status_code, data):
print "There was an error:
"
print status_code, data
# Create an instance of the MyStreamer class
stream = MyStreamer(APP_KEY,APP_SECRET,OAUTH_TOKEN,OAUTH_TOKEN_SECRET)
# Tell the instance of the MyStreamer class what you want to track
stream.statuses.filter(track=track_terms)
Make sure you put your credentials from your Twitter dev account/app into the script. After saving it, hop into Powershell and navigate to the directory where you saved it. To run the script, you just have to type python and the name of the script. For instance, I would type:
python twit_stream.py
You’ll be prompted for how many tweets you want to save, how many per file, and what search terms you want to use for the stream.
For search terms, there are a few rules.
A comma acts like an inclusive “OR” statement. If you have “ps4,xbox” then you will get a stream with tweets that either contain ps4 or xbox or both terms.
If you use a phrase, then tweets with that phrase – with the words in any order – also will be in your stream. For instance, if you have “ps4,xbox one” then you will get a stream with tweets that either contain ps4 or “… one … xbox …” or “… xbox one …” or “… xbox … one …” or any combination of those. The Twitter developer documents contain information about how to work with the streaming search terms.
Watch and Wait
After launching the script, look in the folder where it lives. You’ll see a series of .csv files appear, depending on your input parameters.
Examining your spoils
Excel does not open .csv files as UTF-8 encoded by default. You therefore are likely to end up with jumbled characters if you just double click to open these .csv files. Instead, launch Excel and find the option to “Import Text.” Set the delimiter to comma, mark the first row as a header, and set the encoding to UTF-8.
Personally, I prefer to open the .csv files in Sublime Text when I need to view their contents. In later series in this guide, I’ll import those .csv files into other applications for analysis.
** I had a bit of trouble pasting in all of the text/code here. If you have trouble running the script, let me know and I’ll help you troubleshoot it.**