I have a friend who has a Ph.D in bioinformatics. Over a beer last week, we ended up discussing the H1N1 flu virus, epidemic modeling, and countless other fascinating and somewhat scary things. She told me that epidemiologists have been experimenting with alternate methods of creating transmission models – specifically, she talked about a group that was using data from the Where’s George? project to build a computer model for tracking and predicting the spread of contagions (which I read about again in this NYTimes article two days later).
This got me thinking about the data that is hidden in various social network information streams – Facebook & Twitter updates in particular. People share a lot of information in their tweets – some of it shared intentionally, and some of it which could be uncovered with some rudimentary searching. I wondered if it would be possible to extract travel information from people’s public Twitter streams by searching for the term ‘Just landed in…’.
The idea is simple: Find tweets that contain this phrase, parse out the location they’d just landed in, along with the home location they list on their Twitter profile, and use this to map out travel in the Twittersphere (yes, I just used the phrase ‘Twittersphere’). Twitter’s search API gives us an easy way to get a list of tweets containing the phrase – I am working in Processing so I used Twitter4J to acquire the data from Twitter. The next question was a bit trickier – how would I extract location data from a list of tweets like this?:
Queen_Btch: just landed in London heading to the pub for a drink then im of to bed…so tired who knew hooking up on an airplane would be so tiring =S
jjvirgin: Just landed in Maui and I feel better already … Four days here then off to vegas
checrothers: Just landed in Dakar, Senegal… Another 9 hours n I’ll be in South Africa two entire days after I left … Doodles
It turned out to be a lot easier than I thought. MetaCarta offers 2 different APIs that can extract longitude & latitude information from a query. It can take the tweets above and extract locations:
London, London, United Kingdom – “Latitude” : 51.52, “Longitude” : -0.1
Maui, Hawaii, United States – “Latitude” : 20.5819, “Longitude” : -156.375
Dakar, Dakar, Senegal – “Latitude” : 14.72, “Longitude” : -17.48
This seemed perfect, so I signed up for an API key and set to work hooking the APIs up to Processing. This was a little bit tricky, since the APIs require authentication. After a bit of back and forth, I managed to track down the right libraries to implement Basic Authorization in Processing. I ended up writing a set of Classes to talk to MetaCarta – I’ll share these in a follow-up post later this week.
Now I had a way to take a series of tweets, and extract location data from them. I did the same thing with the location information from the Twitter user’s profile page – I could have gotten this via the Twitter API but it would cost one query per user, and Twitter limits requests to 100/hour, so I went the quick and dirty way and scraped this information from HTML. This gave me a pair of location points that could be placed on a map. This was reasonably easy with some assistance from the very informative map projection pages on Wolfram MathWorld.
I’ll admit it took some time to get the whole thing working the way that I wanted it to, but Processing is a perfect environment for this kind of project – bringing in data, implementing 3D, exporting to video – it’s all relatively easy. Here’s a render from the system, showing about 36 hours of Twitter-harvested travel:
And another, earlier render showing just 4 hours but running a bit slower (I like this pace a lot better – but not the files size of the 36 hour video rendered at this speed!!)
Now, I realize this is a far stretch from a working model to predict epidemics. But, it sure does look cool. I also I think it will be a good base for some more interesting work. Of course, as always, I’d love to hear your feedback and suggestions.