Tag Archives: data

Wired UK, July ’09 – Visualizing a Nation’s DNA

Wired UK - NDNAD Spread (July, 2009)

In the spring, I was asked by Wired UK if I would be interested in producing something for the two-page ‘infoporn’ spread that runs in every issue. They had seen my experimentations with the NYTimes APIs, and were interested in the idea of non-conventional data visualizations. After a bit of research, I proposed an piece about the UK’s National DNA Database. It was a subject that interested me and I felt that there would be some interesting political territory to cover. Luckily, Wired agreed.

By searching through Parliamentary minutes, and sifting over annual reports, I was able to put together a fair amount of information about the NDNAD and I settled on a few key points that I wanted to convey with the piece. First, I wanted to somehow demonstrate how large the database is – with over 4.5M individuals profiled, it’s the largest DNA database in the world. It holds profiles for more than 7% of the UK’s population. As well as the size of the database, I wanted to show how it broke down – in racial groups, in age groups, and in terms of those who have been charged versus those who are ‘innocent’. Finally, I ¬†wanted to talk about the difference between the UK’s population demographics and the demographics represented by the profiles in the NDNAD.

The central graphic, then, is a DNA strand with one dot for each of the profiles in the database – more than 5M! Of course, I didn’t do this by hand. I wrote a program in Processing that would generate a single, continuous strand that filled up a certain size area. I was inspired by electron microscope images that I had seen of real DNA in which it looks like a loop of thread:

The nice looping threads were rendered using Perlin noise – I had a few parameters inside the program which allowed me to control how ‘messy’ the tangle became, and how much variation in thickness each strand had. While I was at it, I colour-coded each DNA dot to indicate the database’s ethnic breakdown. The result was a giant tangle, which was pretty much exactly what I wanted:

Wired UK - NDNAD Infographic

Here, you can see the individual dots, and the colour breakdown:

Wired NDNAD Graphic - detail

The next step was to break down the big tangle into three parts – one representing the bulk of the database, one representing the 948,535 profiles that were taken from people under the age of 18, and one representing the ~500,000 profiles from people who had never been charged, convicted, or warned by police. The original image had a static centre-point for the DNA loop; to break the tangle apart, I modified the program so that the centrepoint could move to pre-determined points once certain counts had been reached. The final graphic changes centre-points three times. What was nice about this set-up what that it was easy to move and adjust the positioning of the graphic to fit the page layout. Rendering out a new version of the main image took just a few minutes.

Wired UK - NDNAD Infographic

Working with these kinds of generative strategies meant that I could explore many variations. As you can see from the graphics posted here, I went through a variety of compositional and colour changes, all of which were relatively painless. Using Processing, I built a mini-application whose entire purpose was to create these DNA systems. I also built a second min-app, which rendered out a set of pie-charts that were used to display related information along with the main graphic in the spread. I wanted these pie charts to fit in visually with the main graphic, so I created a very simple sketch to output charts from any set of data:

Wired NDNAD Pie Chart

There ended up being 11 of these little pie-charts that accompanied the main graphic. Again, by building tools, I was able to do some interesting things, while at the same time avoiding large amounts of manual labour. Just how I like it! You can see the final result in the image at the top of this post, and of course, in Wired UK Рthe July issue hit newsstands a couple of weeks ago. If you are in the UK, go out and buy a copy!

Perhaps the most exciting thing that has came out of this process is that I have been asked to be a contributing editor for Wired UK. I’ll be creating some more pieces centred around data & information over the coming months (look for a Just Landed spread next month), and will also be getting the chance to showcase some work by various brilliant designers & artists in the UK and around the world.

So, stay tuned…

Just Landed: Processing, Twitter, MetaCarta & Hidden Data

Just Landed - Screenshot

I have a friend who has a Ph.D in bioinformatics. Over a beer last week, we ended up discussing the H1N1 flu virus, epidemic modeling, and countless other fascinating and somewhat scary things. She told me that epidemiologists have been experimenting with alternate methods of creating transmission models – specifically, she talked about a group that was using data from the Where’s George? project to build a computer model for tracking and predicting the spread of contagions (which I read about again in this NYTimes article two days later).

Just Landed - Screenshot

This got me thinking about the data that is hidden in various social network information streams – Facebook & Twitter updates in particular. People share a lot of information in their tweets – some of it shared intentionally, and some of it which could be uncovered with some rudimentary searching. I wondered if it would be possible to extract travel information from people’s public Twitter streams by searching for the term ‘Just landed in…’.

Just Landed - Screenshot

The idea is simple: Find tweets that contain this phrase, parse out the location they’d just landed in, along with the home location they list on their Twitter profile, and use this to map out travel in the Twittersphere (yes, I just used the phrase ‘Twittersphere’). Twitter’s search API gives us an easy way to get a list of tweets containing the phrase – I am working in Processing so I used Twitter4J to acquire the data from Twitter. The next question was a bit trickier – how would I extract location data from a list of tweets like this?:

Queen_Btch: just landed in London heading to the pub for a drink then im of to bed…so tired who knew hooking up on an airplane would be so tiring =S
jjvirgin: Just landed in Maui and I feel better already … Four days here then off to vegas
checrothers: Just landed in Dakar, Senegal… Another 9 hours n I’ll be in South Africa two entire days after I left … Doodles

It turned out to be a lot easier than I thought. MetaCarta offers 2 different APIs that can extract longitude & latitude information from a query. It can take the tweets above and extract locations:

London, London, United Kingdom – “Latitude” : 51.52, “Longitude” : -0.1
Maui, Hawaii, United States – “Latitude” : 20.5819, “Longitude” : -156.375
Dakar, Dakar, Senegal – “Latitude” : 14.72, “Longitude” : -17.48

This seemed perfect, so I signed up for an API key and set to work hooking the APIs up to Processing. This was a little bit tricky, since the APIs require authentication. After a bit of back and forth, I managed to track down the right libraries to implement Basic Authorization in Processing. I ended up writing a set of Classes to talk to MetaCarta – I’ll share these in a follow-up post later this week.

Now I had a way to take a series of tweets, and extract location data from them. I did the same thing with the location information from the Twitter user’s profile page – I could have gotten this via the Twitter API but it would cost one query per user, and Twitter limits requests to 100/hour, so I went the quick and dirty way and scraped this information from HTML. This gave me a pair of location points that could be placed on a map. This was reasonably easy with some assistance from the very informative map projection pages on Wolfram MathWorld.

I’ll admit it took some time to get the whole thing working the way that I wanted it to, but Processing is a perfect environment for this kind of project – bringing in data, implementing 3D, exporting to video – it’s all relatively easy. Here’s a render from the system, showing about 36 hours of Twitter-harvested travel:

Just Landed – 36 Hours from blprnt on Vimeo.

And another, earlier render showing just 4 hours but running a bit slower (I like this pace a lot better – but not the files size of the 36 hour video rendered at this speed!!)

Just Landed – Test Render (4 hrs) from blprnt on Vimeo.

Now, I realize this is a far stretch from a working model to predict epidemics. But, it sure does look cool. I also I think it will be a good base for some more interesting work. Of course, as always, I’d love to hear your feedback and suggestions.

The Truth is In There: Research & Discovery with The Guardian Content API

Mulder & Scully

An article I wrote for The Guardian‘s Open Platform Blog was published earlier this week. It looks at some simple ways to use Processing to access information from the Guardian’s Content API. You can read the whole article and follow along with a short tutorial here.

The Guardian Open Platform

This morning, The Guardian announced the launch of The Guardian Open Platform, a suite of services designed to give access for developers to Guardian content. Of course, this follows hot on the heels of The New York Times’ API releases, which I have discussed in detail on this blog, and have created a series of visualizations to explore.

Visualizing the Guardian: Blair & Brown v.2

Comparisons to the NYTimes APIs will be inevitable. Instead of debating the various selling points of both systems, I’ll give a short introduction to what is available from the Guardian and show a few early sketches that I have made with the data.

The most interesting thing about the Guardian Content API is that it offers access to full text for every article. This is good, in that it gives us a much bigger set of data to work with. Unfortunately, the first version of the API release doesn’t let you control the verbosity of the return – so you are getting sometimes a lot of content to process from each call. This means that making ‘simple’ visualizations of keyword frequency can take a lot longer. On the bright side, it also means that we have a lot more data to work with. Though I’ve started by building some simple graphing tools, I am excited about being able to dig into the full body text of the articles – I think there are a lot of possibilities there for linguistic analysis, etc.

Visualizing the Guardian:  Beckham and Rooney

The Content API doesn’t allow for faceted searching in the same way as the NYT Article Search API, but it does give us some fairly easy ways to refine and control our searches. For example, if I wanted to find out how many times the Guardian mentioned David Beckham, I might use a query like this:

http://api.guardianapis.com/content/search?q=environment&format=xml&api-key=

I can narrow down on a specific chunk of time using the before and after parameters:

http://api.guardianapis.com/content/search?q=environment&before=20080101&after=20070101format=xml&api-key=

I can further refine the results of this search by using filters, which are at the core of how the API works, and can be very useful in locating specific sets of information. For example, this search would result in stories about Beckham and football:

http://api.guardianapis.com/content/search?q=environment&filter=/football&before=20080101&after=20070101format=xml&api-key=

Whereas this search would result in stories which had a cultural angle:

http://api.guardianapis.com/content/search?q=environment&filter=/culturebefore=20080101&after=20070101format=xml&api-key=

A full list of filters can be received through the API endpoint – and every content piece retrieved through the search will also include a list of its related tags & filter codes.

The return from these calls can be retrieved as XML, JSON, or ATOM, by changing the format parameter. Full documentation is available on the Guardian site.

Visualizing the Guardian: Surveillance & Privacy

Along with the Content API, the Guardian has also launched their Data Store – a collection of curated data that has been used in the past by the Guardian’s editorial staff when researching articles and producing projects. I will be writing more about this later this week, but for now you can check out the offerings on the Data Store site, and read a bit about it on the Guardian’s Data Blog.