Tag Archives: api

7 Days of Source Day #6: NYTimes GraphMaker

NYTimes Drug Diptych

Project: NYTimes GraphMaker
Date: Fall, 2009
Language: Processing
Key Concepts: Data vizualization, graphing, NYTimes Article Search API

Overview:

The New York Times Article Search API gives us access to a mountain of data: more than 2.6 million indexed articles. There must be countless discoveries waiting to be made in this vast pile of information – we just need more people with shovels! With that in mind, I wanted to release a really simple example of using Processing to access word trend information from the Article Search API. Since I made this project in February, the clever folks at the NYT research lab have released an online tool to explore word trends, but I think it’s useful to have the Processing code released for those of us who want to poke around the data in a slightly deeper way. Indeed, I hope this sketch can act as a starting point for people to take some more involved forays into the dataset – it is ripe to be customized and changed and improved.

This is the simplest project I’m sharing in this now multi-week source release. It should be a nice starting point for those of you who have some programming experience but haven’t done too much in the way of data visualization. As always, if you have questions, feel free to send me an e-mail or post in the comments section below.

You can see a whole pile of radial and standard bar graphs that I made with this sketch earlier in the year in this Flickr set.

Getting Started:

You’ll need the toxiclibs core, which you can download here. Put the unzipped library into the ‘libraries’ folder in your sketchbook (if there isn’t one already, create one).

Put the folder ‘NYT_GraphMaker’ into your Processing sketch folder. Open Processing and open the sketch from the File > Sketchbook menu. You’ll find detailed instructions in the header of the main tab (theNYT_GraphMaker.pde file).

Thanks:

It’s starting to get a bit repetitive, but once again this file depends on Karsten Schmidt’s toxiclibs. These libraries are so good they should ship with Processing.

Download: GraphMaker.zip(88k)


CC-GNU GPL

This software is licensed under the CC-GNU GPL version 2.0 or later.

GoodMorning!

GoodMorning! Pinheads

GoodMorning! is a Twitter visualization tool that shows about 11,000 ‘good morning’ tweets over a 24 hour period, rendering a simple sample of Twitter activity around the globe. The tweets are colour-coded: green blocks are early tweets, orange ones are around 9am, and red tweets are later in the morning. Black blocks are ‘out of time’ tweets which said good morning (or a non-english equivalent) at a strange time in the day. Click on the image above to see North America in detail (click on the ‘all sizes’ button to see the high-res version, which is 6400×3600), or watch the video below to see the ‘good morning’ wave travel around the globe:

GoodMorning! Full Render #2 from blprnt on Vimeo.

I’ll admit that this isn’t a particularly useful visualization. It began as a quick idea that emerged out of the discussions following my post about Just Landed, in which several commenters asked to see a global version of that project. This would have been reasonably easy, but I felt that the 2D map made the important information in that visualization a bit easier to see. I wondered what type of data would be interesting to see on a globe, and started to think about something time based; more specifically, something in which you might see a kind of a wave traveling around the earth. I have been neck-deep in the Twitter API for a couple of months now, and eventually the idea trickled up to look at ‘good morning’ tweets.

The first task was to gather 24 hours worth of good morning tweets. Querying the Twitter API is easy enough – I posted a simple tutorial about doing this with Processing and Twitter4J a couple of weeks ago. The issue with gathering this many tweets is that you can only get the most recent 1,500 tweets with any search request. I needed many more than that – there are about 11,000 in the video, and those were only the ones from users with a valid location in their Twitter profile. All told, I ended up receiving upwards of 50,000 tweets. The only way to get this many results was to leave a ‘gathering’ client running for 24 hours. I should have put this on a server somewhere, but I didn’t, so my laptop needed to run the client for a full day to get the results. It ended up taking 5 days – a few false starts (the initial scripts choked on some strange iPhone locations), along with a couple of bone-headed errors on my part. Finally, I ended up with a JSON file which held the messages, users, dates and locations for a day’s worth of morning greetings.

I’d already decided to embrace the visual cliche that is the spinning globe. It’s reasonably easy to place points on a sphere once you know the latitude and longitude values – the first thing I did was to place all 11,000 points on the globe to see what kind of a distribution I ended up with. Not surprisingly, the points don’t cover the whole globe. I tried to include some non-english languages to encourage a more even distribution, but I don’t think I did the best job that I could have (if you have ideas for what to search for for other languages – particularly Asian ones, please leave a note in the comments). Still, I thought that there should be enough to get my ‘good morning wave’ going.

In my first attempts, I coloured the tweet blocks according to the size of the tweet. In these versions, you can see the wave, but it’s not very distinct:

GoodMorning! First Render from blprnt on Vimeo.

I needed some kind of colouring that would show the front of the wave – I ended up setting the colour of the blocks according to how far the time block was away from 9am, local time. This gave me the colour spectrum that you see in the latest versions.

Originally, I wanted to include the text in the tweets, but after a few very messy renders, I dropped that idea. I still think it might be possible to incorporate text in some way that doesn’t look like a pile of pick-up-sticks – I just haven’t found it yet. Here’s a render from the text-mess:

GoodMorning!

There are some inherent problems with this visualization. As mentioned earlier, it’s certainly not a complete representation of global Twitter users. Also, I’m relying on the location that the user lists in their Twitter profile to plot the points on the globe. It’s very likely that a high proportion of these locations could be inaccurate. Even if the location is correct, it might not be accurate enough to be useful. If you look at the bottom right of the images above, you’ll see a big plume of blocks in the middle of South America. This isn’t some undiscovered Twitter city in the middle of the jungle (El Tworado? Sorry.) – it’s the result of many users listing their location as simply ‘South America’. There’s one of these in every country and continent (this explains the cluster of Canadians tweeting from somewhere near Baker Lake).

On the other hand, it provides a model for how similar visualizations might be made – propagation maps of trending topics, plotting of followers over time, etc. Even in its current form, the tool does provide some interesting data – for example it seems that East Coasters tweet earlier than West Coaters (there’s more green in the East than in the West). I’m guessing that in the hands of people with more than my rudimentary statistics skills, these kinds of data sets could tell us some interesting – and (heaven forbid) useful things.

Too much information? Twitter, Privacy, and Lawrence Waterhouse

A few months ago, Twitter was in the news over concerns that people might be sharing a little bit too much information in their social networking broadcasts. Arizona resident Israel Hyman left for a vacation – but not before sending out a Tweet telling his friends and followers that he’d be out of town. While he was away, Hyman’s house was burgled. He suspected that his status update could have motivated the thieves to give his house a try.

Personally, I’m not convinced there are too many Twitter-saavy burglars (TweetBurglars?) out there. It’s probably much more likely that the would-be thief saw Mr. Hyman drive off in his station wagon with a canoe on top. In any case, I should be safe: I don’t broadcast much personal information in my Twitter feed.

At least not too much implicit personal information.

How much ‘hidden’ information am I sharing my Tweets? This is a question that has come up recently as I’ve been digging around with the Twitter API. I think that curious parties, armed with the Twitter API and some rudimentary programming skills might be able to find out more about you than you might think.

Here is a simple graphic showing all of my 1,212 Tweets since October of last year:

Twittergrams: All Tweets

The graph is ordered by day horizontally and by time vertically – tweets near the top are close to midnight, while those in the lower half were broadcast in the morning. I’ve also indicated the length of the tweet with the size of the circles – longer tweets show up larger and darker. You can see some trends from this really simple graph. First, it’s clear that my tweets have gotten longer and more frequent since I started using Twitter. Also, my first Tweets of the day (the bottom-most ones) seem to start on average at the same time – which a few notable anomalies. To investigate this a bit further, let’s highlight just the first tweets of the day (I’ve ignored days with 3 or fewer tweets):

Twittergrams: Morning Tweets

You can see from this plot that my morning messages tend to fall around 9:30am. There are a few outliers where I (heaven forbid) haven’t been around the computer – but there are also some deviations from the 9:30 norm that aren’t just statistical anomalies. If we plot some lines through the morning points starting in January this year, we’ll see the three areas where my twitter behaviour is not ‘normal’:

Twittergrams: Morning Tweet Graph

The yellow line in this graph is the 9:30 mean. The red line shows (with a bit of cushioning) the progression of  first tweet times over the 8 months in question. At the marked points, my first tweets deviate from the average. Why? In two of the tree cases, I’ve changed time zones. In March, I was in Munich, and in May/June I was in Boston, New York, and Minneapolis (you can actually see the time shift between EST & CST). In the zone marked with a 1, I was commuting in the morning to a residency in a Vancouver suburb – hence the later starts until February.

This is very simple example – certainly there isn’t a lot of useful (or incriminating) information to be found. But in the hands of a more capable investigator, it’s possible that the information underneath all of the Tweets, Facebook updates, Flickr comments, etc. that I am broadcasting everyday could reveal a lot more that I would want to share. Some nefarious party could quite easily set up a ‘tracker’ to watch my public broadcast, and to be notified if my daily behaviour deviates from the norm. TweetBurglars, are you listening?

Of course, all of this information can be useful for the good guys, too. With millions of people active on Twitter, the store of data – and what it can reveal – gets more and more interesting every day. We are already seeing scientists using web data to measure public happiness, but I think we have just scraped the surface of what could be uncovered. (To see a model of how Twitter updates could be used to track travel and disease spread, see my post on Just Landed).

In Neal Stephenson’s Cryptonomicon, he decribes a scenario in which an accurate map of London might be generated by tracking the heights of people’s heads as they stepped on and off of curves:

The curbs are sharp and perpendicular, not like the American smoothly molded sigmoid-cross-section curves. The transition between the side walk and the street is a crisp vertical. If you put a green lightbulb on Waterhouse’s head and watched him from the side during the blackout, his trajectory would look just like a square wave traced out on the face of a single-beam oscilloscope: up, down, up, down. If he were doing this at home, the curbs would be evenly spaced, about twelve to the mile, because his home town is neatly laid out on a grid.

Here in London, the street pattern is irregular and so the transitions in the square wave come at random-seeming times, sometimes very close together, sometimes very far apart.

A scientist watching the wave would probably despair of finding any pattern; it would look like a random circuit, driven by noise, triggered perhaps by the arrival of cosmic rays from deep space, or the decay of radioactive isotopes.

But if he had depth and ingenuity, it would be a different matter.

Depth could be obtained by putting a green light bulb on the head of every person in London and then recording their tracings for a few nights. The result would be a thick pile of graph-paper tracings, each one as seemingly random as the others. The thicker the pile, the greater the depth.

Ingenuity is a completely different matter. There is no systematic way to get it. One person could look at the pile of square wave tracings and see nothing but noise. Another might find a source of fascination there, an irrational feeling impossible to explain to anyone who did not share it. Some deep part of the mind, adept at noticing patterns (or the existence of a pattern) would stir awake and frantically signal the dull quotidian parts of the brain to keep looking at the pile of graph paper. The signal is dim and not always heeded, but it would instruct the recipient to stand there for days if necessary, shuffling through the pile of graphs like an autist, spreading them out over a large floor, stacking them in piles according to some inscrutable system, pencilling numbers, and letters from dead alphabets, into the corners, cross-referencing them, finding patterns, cross-checking them against others.

One day this person would walk out of that room carrying a highly accurate street map of London, reconstructed from the information in all of those square wave plots.

Stephenson tells us that success in such an endeavour requires depth and ingenuity. Depth, the internet has in spades. Millions of Twitter users are adding to a public dataset every second of every day. Ingenuity may prove a bit tougher, but with open APIs and thousands of clever curious investigators, it will be interesting to see what kinds of maps will be made – and to what means they will be used.

– I’ll be following up this post over the weekend with a preview of a new Twitter visualization that I have been working on – stay tuned!

Open Science, H1N1, Processing, and the Google Spreadsheet API

Flu Genome Data Visualizer

I’ve recently been working on a project with my friend Jennifer Gardy, whose insights into epidemiology and data modeling led me to build Just Landed. Jennifer is currently working at the BC Centre for Disease Control where, among other things, she’s been looking at data related to swine flu genomics. She came to me with an interesting idea for visualizing data related to historical flu strains, and I thought it might be an interesting project for several reasons. First, I’ve been doing a lot of reading and thinking around the concept of open science and open research, and thought that this project might be a good chance to test out some ideas. Similarly, I am very interested in the chance to use Processing in a scientific context (rather than an artistic one) and I hope this might be a way to introduce my favourite tool to a broader audience. Finally, I hope there is the chance that a good visualization tool might uncover some interesting things about the influenza virus and its nefarious ways.

The project is just getting started, so I don’t have a lot of results to share (a screenshot of the initial stages of the tool is above). But I would like to talk about the approach that I have been taking, and to share some code which might enable similar efforts to happen using Processing & Google Spreadsheets.

Michael Nielson is the author of the most cited physics publication of the last 25 years. He’s also a frequent blogger, writing on such disparate topics as chess & quantum mechanics.  He has written several excellent posts about open science, including this article about the future of science and this one about doing science online. In both articles, he argues that scientists should be utilizing web-based technologies in a much more efficient manner than they have thus far. In doing so, he believes (as do I) that the efficiency of science as a whole can be greatly improved. In his articles, Michael concentrates both on specialized services such as Science Advisor and the physics preprint arXiv, as well as more general web entities like Wikipedia and FriendFeed. I agree that these services and others (specifically, I am going to look at Google Docs) can play a useful role in building working models for open science. I’ll argue as well that open-source ‘programming for the people’ initiatives such as Processing and OpenFrameworks could be useful in fostering collaborative efforts in a scientific context.

For the flu genomics project, we are working with a reasonably large data set – about 38,000 data points. Typically, I would work with this file locally, parsing it with Processing and using it as I see fit. This approach, however, has a couple of failings. First, if the data set changes, I am required to update my file to ensure I am working with the latest version. Likewise, if the data is being worked with by several people, Jennifer would have to send each of us updated versions of the data every time there is a significant change. Both of these concerns can be solved by hosting the data online, and by having anyone working with the data subscribe to a continually updated version. This is very easily managed by a number of ‘cloud-based’ web services – the most convenient and most prevalent being Google Docs – specifically Google Spreadsheets.

Most of us are familiar with using Google Spreadsheets – we can create documents online, and then share them with other people. Spreadsheets can be created, added to, edited and deleted, all the while being available to anyone who is subscribed to the document. What isn’t common knowledge is that Google has released an API  for Spreadsheets – meaning that we can do all of those things (creating, adding, editing, deleting) remotely using a program like Processing. We can manage our Google-hosted databases with the same programs that we are using to process and visualize our data. It also means that multiple people can be working with a central store of data at the same time. In this way, Google Spreadsheets becomes a kind of a publicly-editable database (with a GUI!).

Google Spreadsheets have already been put to good use by the Guardian Data Store, where some clever British folks have compiled interesting data like university drop-out rates, MP’s expenses, and even a full list of swine flu cases by country. Using the API, we can access all of the information from our own spreadsheets and from public spreadsheers and use it to do whatever we’d like. The Google Spreadsheets API has some reasonably advanced features that allow you to construct custom tables, and use structured queries to extract specific data from spreadsheets (see the Developer’s Guide), but for now I want to concentrate on doing the simplest possible thing – extracting data from individual table cells. Let’s walk through a quick example using Processing.

I’ve created an empty sketch, which you can download here. This sketch includes all of the .jar files that we need to get started with the Spreadsheet API, saving you the trouble of having to import them yourself (the Java Client Library for the Google Data APIs is here – note that the most recent versions are compiled in Java 1.6 and aren’t compatible with the latest version of Processing). I’ve also wrapped up some very basic functionality into a class called SimpleSpreadsheetManager – have a look at the code in that tab if you want to get a better idea of how the guts of this example function. For now, I’ll just show you how to use the pre-built Class to access spreadsheet data.

First, we create a new member of the SimpleSpreadsheetManager class, and initialize it with our Google username and password:

void setup() {
size(500,500);
background(255);

SimpleSpreadsheetManager sm = new SimpleSpreadsheetManager();
sm.init("myProjectName","me@myemail.com", "mypassword");
};

void draw() {

};

Now we need to load in our spreadsheet – or more specifically, our worksheet. Google Spreadsheets are collections of individual worksheets. Each spreadsheet has a unique ID which we can use to retrieve it. We can then ask for individual worksheets within that spreadsheet. If I visit the swine flu data spreadsheet from the Guardian Data Store in my browser, I can see that the URL looks like this:

http://spreadsheets.google.com/pub?key=rFUwm_vmW6WWBA5bXNNN6ug&gid=1

This URL shows me the spreadsheet id (rFUwm_vmW6WWBA5bXNNN6). I can also see from the tabs at the top that the worksheet that I want (“COUNTRY TOTALS”) is the first worksheet in the list. I can now load this worksheet using my spreadsheet manager:

void setup() {
size(500,500);
background(255);

SimpleSpreadsheetManager sm = new SimpleSpreadsheetManager();
sm.init("myProjectName","me@myemail.com", "mypassword");
sm.fetchSheetByKey("rFUwm_vmW6WWBA5bXNNN6ug", 0);
};

void draw() {

};

To get data out of the individual cells, I have two options with the SimpleSpreadsheetManager. I can request a cell by its column and row indexes, or I can request a cell from its column name and row index:

void setup() {
size(500,500);
background(255);

SimpleSpreadsheetManager sm = new SimpleSpreadsheetManager();
sm.init("myProjectName","me@myemail.com", "mypassword");
sm.fetchSheetByKey("rFUwm_vmW6WWBA5bXNNN6ug", 0);

//get the value of the third cell in the first column
println(sm.getCellValue(0,2));                             //returns 'Australia'    //get the value of the third cell in the column      labelled 'Deaths, confirmed swine flu'
println(sm.getCellValue("deathsconfirmedswineflu",2));     //returns '9'
};

void draw() {

};

If we wanted to find out which countries had more than 10 confirmed swine flu deaths, we could do this:

void setup() {
size(500,500);
background(255);

SimpleSpreadsheetManager sm = new SimpleSpreadsheetManager();
sm.init("myProjectName","me@myemail.com", "mypassword");
sm.fetchSheetByKey("rFUwm_vmW6WWBA5bXNNN6ug", 0);

//Get all of the countries with more than one death
for (int i=0; i < sm.currentTotalRows; i++) {
String country = sm.getCellValue(0,i);
String deaths = sm.getCellValue("deathsconfirmedswineflu", i);
if (deaths != null && Integer.valueOf(deaths) > 10) println(country +  " : " + deaths);
};
};

void draw() {

};

With a bit more work (this took about 10 minutes), we can create a sketch to build an infographic linked to the spreadsheet – making it very easy to output new versions as the data is updated:

Swine Flu Deaths

Not a particularly exciting demo – but it opens a lot of doors for working with data remotely and collaboratively. Rather than needing to depend on generic visualization tools like those built into typical spreadsheet applications, we can use Processing (or a similar platform like Openframeworks or Field) to create customized tools that are suited to a specific dataset. For my flu genomics project, we’re able to create a very specialized applet that examines how the genetic sequences for antigenic regions change over time – certainly not a function that comes standard with Microsoft Excel.

Combining Processing with Google Spreadsheets provides an easy way to bring almost any kind of data into Processing, and at the same time gives us a good way to store and manage that data. I’d definitely like to add some functionality to this really simple starting point. It would be reasonably easy to allow for creation of spreadsheets and worksheets, and I’d also like to look at implementing table record feeds and their associated structured queries. Ultimately, it would be good to package this all up into a Processing library – if someone has the time to take it on, I think it would be a very useful addition for the Processing community.

The single biggest benefit of Processing is that it is easy to learn and use. Over the years, I have taught dozens of designers and artists how to leverage Processing to enter a world that many had thought was reserved for programmers. I suspect that a similar gulf tends to exist in science between those that gather the data and those that process and visualize it. I am interested to see if Processing can help to close that gap as well.

The Guardian Data Store serves as a good model for a how a shared repository for scientific data might work. Such a project would be useful for scientists. But it would also be open to artists, hackers, and the generally curious, who might be able to use the available data in novel (and hopefully useful) ways.

Just Landed: Processing, Twitter, MetaCarta & Hidden Data

Just Landed - Screenshot

I have a friend who has a Ph.D in bioinformatics. Over a beer last week, we ended up discussing the H1N1 flu virus, epidemic modeling, and countless other fascinating and somewhat scary things. She told me that epidemiologists have been experimenting with alternate methods of creating transmission models – specifically, she talked about a group that was using data from the Where’s George? project to build a computer model for tracking and predicting the spread of contagions (which I read about again in this NYTimes article two days later).

Just Landed - Screenshot

This got me thinking about the data that is hidden in various social network information streams – Facebook & Twitter updates in particular. People share a lot of information in their tweets – some of it shared intentionally, and some of it which could be uncovered with some rudimentary searching. I wondered if it would be possible to extract travel information from people’s public Twitter streams by searching for the term ‘Just landed in…’.

Just Landed - Screenshot

The idea is simple: Find tweets that contain this phrase, parse out the location they’d just landed in, along with the home location they list on their Twitter profile, and use this to map out travel in the Twittersphere (yes, I just used the phrase ‘Twittersphere’). Twitter’s search API gives us an easy way to get a list of tweets containing the phrase – I am working in Processing so I used Twitter4J to acquire the data from Twitter. The next question was a bit trickier – how would I extract location data from a list of tweets like this?:

Queen_Btch: just landed in London heading to the pub for a drink then im of to bed…so tired who knew hooking up on an airplane would be so tiring =S
jjvirgin: Just landed in Maui and I feel better already … Four days here then off to vegas
checrothers: Just landed in Dakar, Senegal… Another 9 hours n I’ll be in South Africa two entire days after I left … Doodles

It turned out to be a lot easier than I thought. MetaCarta offers 2 different APIs that can extract longitude & latitude information from a query. It can take the tweets above and extract locations:

London, London, United Kingdom – “Latitude” : 51.52, “Longitude” : -0.1
Maui, Hawaii, United States – “Latitude” : 20.5819, “Longitude” : -156.375
Dakar, Dakar, Senegal – “Latitude” : 14.72, “Longitude” : -17.48

This seemed perfect, so I signed up for an API key and set to work hooking the APIs up to Processing. This was a little bit tricky, since the APIs require authentication. After a bit of back and forth, I managed to track down the right libraries to implement Basic Authorization in Processing. I ended up writing a set of Classes to talk to MetaCarta – I’ll share these in a follow-up post later this week.

Now I had a way to take a series of tweets, and extract location data from them. I did the same thing with the location information from the Twitter user’s profile page – I could have gotten this via the Twitter API but it would cost one query per user, and Twitter limits requests to 100/hour, so I went the quick and dirty way and scraped this information from HTML. This gave me a pair of location points that could be placed on a map. This was reasonably easy with some assistance from the very informative map projection pages on Wolfram MathWorld.

I’ll admit it took some time to get the whole thing working the way that I wanted it to, but Processing is a perfect environment for this kind of project – bringing in data, implementing 3D, exporting to video – it’s all relatively easy. Here’s a render from the system, showing about 36 hours of Twitter-harvested travel:

Just Landed – 36 Hours from blprnt on Vimeo.

And another, earlier render showing just 4 hours but running a bit slower (I like this pace a lot better – but not the files size of the 36 hour video rendered at this speed!!)

Just Landed – Test Render (4 hrs) from blprnt on Vimeo.

Now, I realize this is a far stretch from a working model to predict epidemics. But, it sure does look cool. I also I think it will be a good base for some more interesting work. Of course, as always, I’d love to hear your feedback and suggestions.