Skip to content

Latest commit

 

History

History
167 lines (144 loc) · 10.9 KB

README.md

File metadata and controls

167 lines (144 loc) · 10.9 KB

Analysis of Liverpool FC Tweets in the 15/16 Barclays Premier League Season

The goal of this project is to collect real-time stream of tweets, and analyse & visualise that data. I will not cover all Liverpool games of the 15/16 season, only a few notable ones.

Libraries used

  • Pymongo
  • Tweepy
  • Pandas
  • Vincent
  • NLTK
  • Folium

Python (~.py) files

streamer.py - Creates a listener that collects tweets in real-time and stores them in a MongoDB collection.

to_csv.py - Creates and writes a csv file named tweets.csv from data stored in MongoDB.

graph.py - Converts data in tweets.csv to a pandas time series and creates a graph with Vincent.

word_freq.py - Filters out texts in tweets.csv, looking for relevant words, and plots a frequency distribution of those words using the NLTK platform.

map.py - Creates a map of geo-tagged tweets with Folium.

In the stream listener, as found in streamer.py, I'm specifically listening for tweets that contain the words 'liverpool', 'lfc', or 'liverpoolfc'.

Arsenal-Liverpool Tweet Analysis (08/24/2015)

My first game of analysis was the Arsenal-Liverpool game on August 24, played at the Emirates Stadium. The game ended in a goalless draw at 0-0. I started the live twitter stream right at kick-off, which was around 3:00pm EST. The stream was activated for the entirety of the game, and turned off at 4:53pm EST, a few minutes after the final whistle. Overall, around 152,000 tweets were collected during the Arsenal-Liverpool match. Using the scripts from graph.py and word_freq.py, I had a time-series plot of the volume of tweets per minute with relevant key words, and a frequency distribution of the most common words found in my data. First, let us examine the time-twitter volume plot.

Twitter volume per minute as a function of time

Here is the graph of relevant tweets as a function of time: Alt text Examining the data, it seems that each peak corresponds with a memorable event in the game, i.e. a goal-scoring opportunity, booking, etc. For example. the first spike that we see in the plot, around 03:04, correspond with this exact moment, when Coutinho hits the bar:

We also see a large peak coming in at around 3:45, at almost 3,000 tweets per minute. Looking at the data, many of the tweets are about another Coutinho attempt at goal hitting the post:

However, we see that this peak doesn't sharply decline, but declines gradually. This is most likely due to the fact that the goal-attempt came right before half-time, allowing users to comment on their thoughts about the first half of the game for about 15 minutes before the game started again.

Frequency distribution of 25 most common words

Alt text Given the filter words, we expect liverpool and #lfc to be the most common words, so this is not surprising. We see, however, that rt, or retweets, is the third most frequently used word during the game. This indicates that many users were simply retweeting others' comments. The players that were most tweeted about are Cech and Coutinho. This also makes sense given each player's contribution to their respective teams and their highlight-worthy moments.

Mapping twitter data

As previously mentioned, not every tweet includes location data, which is up to the user to decide. Out of 152,898 tweets collcted, only 612 contained geo data, which is only 0.4% of all tweets. With the help of Folium, we are able to visualize our tweets in 3 maps:

  1. World Alt text In the world map, we see that the British Isles has a concentration of data points, which is expected for a Premier League match. Surprisingly, we also see heavy concentration in Southeast Asia, namely Malaysia, Singapore and Indonesia. Additionally, it is interesting to note that countries of former British African colonies (Nigeria, Kenya and South Africa) had relatively high twitter data.

  2. England Alt text In England, London and Liverpool had the highest concentration of tweets, unsurprising given that it was an Arsenal-Liverpool game. The rest of tweets is more sporadically spread out among other major cities like Leicester, Birmingham, Leeds, etc.

  3. Liverpool Alt text In Liverpool city, much of the tweets were posted from the city center, as well as a few around Anfield. Since this was an away match for Liverpool FC on a Monday, it is more likely that many spectators tweeted about the game in their offices.

Code Overview

Let us now examine in detail the four python files mentioned.

streamer.py

This file uses Pymongo and Tweepy to collect real-time twitter data and store them in a MongoDB collection. The key method in the CustomListener class is on_status():

  def on_status(self, tweet):
    data = {}
    data['text'] = tweet.text
    data['user'] = tweet.user.screen_name
    data['created_at'] = tweet.created_at
    data['geo'] = tweet.geo
    
    print data, '\n'
    self.db.Tweets.insert(data)

On every new tweet that comes through the filtered stream, we're creating a dictionary, called data, with 4 keys (text, user, created_at and geo). text is the actual content of the tweet itself; user is the twitter handle of the user who posted the said tweet; created_at is the time when the tweet was posted; and geo is the location at which the tweet was created. The geo is only available for those users that enable geo-tracking, and thus, majority of the tweets do not have this information. Once we create the dictionary, we can store it into a MongoDB collection, called Tweets, with self.db.Tweets.insert(data). Finally, we create an instance of the stream and filter for search words using an array:

  listen = Stream(auth, CustomListener(api))
  listen.filter(track=['liverpool','lfc','liverpoolfc'])

to_csv.py

to_csv.py simply writes data to a csv file using the data that we have stored in MongoDB. Using the csv python module, we create a writer which writes to a csv file called tweets.csv:

with open('tweets.csv', 'w') as outfile:
  fieldnames = ['text', 'user', 'created_at', 'geo']
  writer = csv.DictWriter(outfile, delimiter=',', fieldnames=fieldnames)
  writer.writeheader()

Next, we sort through the Tweets collection and write the appropriate data matching their fieldnames:

for data in db.Tweets.find():
  writer.writerow({ 
    'text': data['text'].encode('utf-8'), 
    'user': data['user'].encode('utf-8'), 
    'created_at': data['created_at'],
    'geo': data['geo']
  })

For text and user data, we must make sure to encode any unicode to UTF-8 so that data is readable both to humans and the machine.

graph.py

Now that we have exported our data to a csv file, we can start to analyze them by charting a plot of tweet volume over time using pandas and vincent, a library that allows us to build Vega visualizations with python. First we create a pandas dataframe, then use the created_at column as index to create a pandas time series.

tweets = p.read_csv('./tweets.csv')
tweets['created_at'] = p.to_datetime(p.Series(tweets['created_at']))
tweets.set_index('created_at', drop=False, inplace=True)

We are then able to convert the data into "per minute" unit. In other words, we now have tweets per minute on the y-axis. And now we can display the plot using the color of our choice with the vincent library.

# created_at index is formatted to per minute
tweets_pm = tweets['created_at'].resample('1t', how='count')

# create time series graph via Vincent
vincent.core.initialize_notebook()
area = vincent.Area(tweets_pm)
area.colors(brew='Spectral')
area.display()

word_freq.py

We can also find the most commonly-used words during the Arsenal-Liverpool game with pandas and nltk. Nltk makes it very easy to process text, and probably saved me from writing a bunch more code in word_freq.py. To parse and proess twitter texts, we first filter out for stop words, which are high-frequency words that are often irrelevant, such as articles, prepositions, etc (think "to", "the", "also"). Then, we strip all the words in twitter texts of punctuation marks. This is so that words like "Arsenal" and "Arsenal." are not counted as two different words.

# get english stopwords
stop = stopwords.words('english')
texts = pandas.read_csv('./tweets.csv')['text']

tokens = []

# strip words of punctuation marks
for text in texts.values:
  tokens.extend([word.lower().strip(':,."-') for word in text.split()])

Now, we create a new filtered words list after removing stop words and removing punctuation marks. Again, we have to take account Unicode encoding to make it machine readable.

filtered_tokens = [word.decode('utf-8') for word in tokens if not word.decode('utf-8') in stop]

With help from nltk, we can easily plot a frequency distribution of the top 25 words used during the game.

freq_dist = nltk.FreqDist(filtered_tokens)
print freq_dist.plot(25)

map.py

In map.py we can visualize our twitter data with Folium, which builds on maps using the Leaflet.js library. The first step in this file was to get geo data in our csv file. Since not all rows had value for geo column, we filter out only those that do.

# get geo data only from rows with non-empty values
locations = pandas.read_csv('./tweets.csv', usecols=[3]).dropna()

Next, we create an array, also named geos, that contains the coordinates of our 612 tweets. This way, it is easy to wrangle our location data when mapping them with Folium. Since our raw JSON data is in python literal syntax, we have to evaluate them with ast.literal_eval() to use them as a dictionary.

geos = []

for location in locations.values:
  # add to geos array an evaluated python literal syntax of the data
  geos.append(ast.literal_eval(location[0])['coordinates'])

Finally, we use Folium to instantiate the map, create markers according to coordinates, and create the map onto an html file called map.html.

# initialize and create map
tweet_map = folium.Map(location=[52.8, -2], tiles='Mapbox Bright', zoom_start=7)

# add markers
for geo in geos:
  tweet_map.circle_marker(location=geo, radius=250)

tweet_map.create_map(path='map.html')

As we can see, it is very straightfoward to set marker locations with the geos array.