Tag Archives: google data api

Open Science, H1N1, Processing, and the Google Spreadsheet API

Flu Genome Data Visualizer

I’ve recently been working on a project with my friend Jennifer Gardy, whose insights into epidemiology and data modeling led me to build Just Landed. Jennifer is currently working at the BC Centre for Disease Control where, among other things, she’s been looking at data related to swine flu genomics. She came to me with an interesting idea for visualizing data related to historical flu strains, and I thought it might be an interesting project for several reasons. First, I’ve been doing a lot of reading and thinking around the concept of open science and open research, and thought that this project might be a good chance to test out some ideas. Similarly, I am very interested in the chance to use Processing in a scientific context (rather than an artistic one) and I hope this might be a way to introduce my favourite tool to a broader audience. Finally, I hope there is the chance that a good visualization tool might uncover some interesting things about the influenza virus and its nefarious ways.

The project is just getting started, so I don’t have a lot of results to share (a screenshot of the initial stages of the tool is above). But I would like to talk about the approach that I have been taking, and to share some code which might enable similar efforts to happen using Processing & Google Spreadsheets.

Michael Nielson is the author of the most cited physics publication of the last 25 years. He’s also a frequent blogger, writing on such disparate topics as chess & quantum mechanics.  He has written several excellent posts about open science, including this article about the future of science and this one about doing science online. In both articles, he argues that scientists should be utilizing web-based technologies in a much more efficient manner than they have thus far. In doing so, he believes (as do I) that the efficiency of science as a whole can be greatly improved. In his articles, Michael concentrates both on specialized services such as Science Advisor and the physics preprint arXiv, as well as more general web entities like Wikipedia and FriendFeed. I agree that these services and others (specifically, I am going to look at Google Docs) can play a useful role in building working models for open science. I’ll argue as well that open-source ‘programming for the people’ initiatives such as Processing and OpenFrameworks could be useful in fostering collaborative efforts in a scientific context.

For the flu genomics project, we are working with a reasonably large data set – about 38,000 data points. Typically, I would work with this file locally, parsing it with Processing and using it as I see fit. This approach, however, has a couple of failings. First, if the data set changes, I am required to update my file to ensure I am working with the latest version. Likewise, if the data is being worked with by several people, Jennifer would have to send each of us updated versions of the data every time there is a significant change. Both of these concerns can be solved by hosting the data online, and by having anyone working with the data subscribe to a continually updated version. This is very easily managed by a number of ‘cloud-based’ web services – the most convenient and most prevalent being Google Docs – specifically Google Spreadsheets.

Most of us are familiar with using Google Spreadsheets – we can create documents online, and then share them with other people. Spreadsheets can be created, added to, edited and deleted, all the while being available to anyone who is subscribed to the document. What isn’t common knowledge is that Google has released an API  for Spreadsheets – meaning that we can do all of those things (creating, adding, editing, deleting) remotely using a program like Processing. We can manage our Google-hosted databases with the same programs that we are using to process and visualize our data. It also means that multiple people can be working with a central store of data at the same time. In this way, Google Spreadsheets becomes a kind of a publicly-editable database (with a GUI!).

Google Spreadsheets have already been put to good use by the Guardian Data Store, where some clever British folks have compiled interesting data like university drop-out rates, MP’s expenses, and even a full list of swine flu cases by country. Using the API, we can access all of the information from our own spreadsheets and from public spreadsheers and use it to do whatever we’d like. The Google Spreadsheets API has some reasonably advanced features that allow you to construct custom tables, and use structured queries to extract specific data from spreadsheets (see the Developer’s Guide), but for now I want to concentrate on doing the simplest possible thing – extracting data from individual table cells. Let’s walk through a quick example using Processing.

I’ve created an empty sketch, which you can download here. This sketch includes all of the .jar files that we need to get started with the Spreadsheet API, saving you the trouble of having to import them yourself (the Java Client Library for the Google Data APIs is here – note that the most recent versions are compiled in Java 1.6 and aren’t compatible with the latest version of Processing). I’ve also wrapped up some very basic functionality into a class called SimpleSpreadsheetManager – have a look at the code in that tab if you want to get a better idea of how the guts of this example function. For now, I’ll just show you how to use the pre-built Class to access spreadsheet data.

First, we create a new member of the SimpleSpreadsheetManager class, and initialize it with our Google username and password:

void setup() {
size(500,500);
background(255);

SimpleSpreadsheetManager sm = new SimpleSpreadsheetManager();
sm.init("myProjectName","me@myemail.com", "mypassword");
};

void draw() {

};

Now we need to load in our spreadsheet – or more specifically, our worksheet. Google Spreadsheets are collections of individual worksheets. Each spreadsheet has a unique ID which we can use to retrieve it. We can then ask for individual worksheets within that spreadsheet. If I visit the swine flu data spreadsheet from the Guardian Data Store in my browser, I can see that the URL looks like this:

http://spreadsheets.google.com/pub?key=rFUwm_vmW6WWBA5bXNNN6ug&gid=1

This URL shows me the spreadsheet id (rFUwm_vmW6WWBA5bXNNN6). I can also see from the tabs at the top that the worksheet that I want (“COUNTRY TOTALS”) is the first worksheet in the list. I can now load this worksheet using my spreadsheet manager:

void setup() {
size(500,500);
background(255);

SimpleSpreadsheetManager sm = new SimpleSpreadsheetManager();
sm.init("myProjectName","me@myemail.com", "mypassword");
sm.fetchSheetByKey("rFUwm_vmW6WWBA5bXNNN6ug", 0);
};

void draw() {

};

To get data out of the individual cells, I have two options with the SimpleSpreadsheetManager. I can request a cell by its column and row indexes, or I can request a cell from its column name and row index:

void setup() {
size(500,500);
background(255);

SimpleSpreadsheetManager sm = new SimpleSpreadsheetManager();
sm.init("myProjectName","me@myemail.com", "mypassword");
sm.fetchSheetByKey("rFUwm_vmW6WWBA5bXNNN6ug", 0);

//get the value of the third cell in the first column
println(sm.getCellValue(0,2));                             //returns 'Australia'    //get the value of the third cell in the column      labelled 'Deaths, confirmed swine flu'
println(sm.getCellValue("deathsconfirmedswineflu",2));     //returns '9'
};

void draw() {

};

If we wanted to find out which countries had more than 10 confirmed swine flu deaths, we could do this:

void setup() {
size(500,500);
background(255);

SimpleSpreadsheetManager sm = new SimpleSpreadsheetManager();
sm.init("myProjectName","me@myemail.com", "mypassword");
sm.fetchSheetByKey("rFUwm_vmW6WWBA5bXNNN6ug", 0);

//Get all of the countries with more than one death
for (int i=0; i < sm.currentTotalRows; i++) {
String country = sm.getCellValue(0,i);
String deaths = sm.getCellValue("deathsconfirmedswineflu", i);
if (deaths != null && Integer.valueOf(deaths) > 10) println(country +  " : " + deaths);
};
};

void draw() {

};

With a bit more work (this took about 10 minutes), we can create a sketch to build an infographic linked to the spreadsheet – making it very easy to output new versions as the data is updated:

Swine Flu Deaths

Not a particularly exciting demo – but it opens a lot of doors for working with data remotely and collaboratively. Rather than needing to depend on generic visualization tools like those built into typical spreadsheet applications, we can use Processing (or a similar platform like Openframeworks or Field) to create customized tools that are suited to a specific dataset. For my flu genomics project, we’re able to create a very specialized applet that examines how the genetic sequences for antigenic regions change over time – certainly not a function that comes standard with Microsoft Excel.

Combining Processing with Google Spreadsheets provides an easy way to bring almost any kind of data into Processing, and at the same time gives us a good way to store and manage that data. I’d definitely like to add some functionality to this really simple starting point. It would be reasonably easy to allow for creation of spreadsheets and worksheets, and I’d also like to look at implementing table record feeds and their associated structured queries. Ultimately, it would be good to package this all up into a Processing library – if someone has the time to take it on, I think it would be a very useful addition for the Processing community.

The single biggest benefit of Processing is that it is easy to learn and use. Over the years, I have taught dozens of designers and artists how to leverage Processing to enter a world that many had thought was reserved for programmers. I suspect that a similar gulf tends to exist in science between those that gather the data and those that process and visualize it. I am interested to see if Processing can help to close that gap as well.

The Guardian Data Store serves as a good model for a how a shared repository for scientific data might work. Such a project would be useful for scientists. But it would also be open to artists, hackers, and the generally curious, who might be able to use the available data in novel (and hopefully useful) ways.