Over the last year, I’ve produced five data-driven pieces for Wired UK. Four of them have been for the two-page infoporn spread that can be found in every issue. I’ve looked at the UK’s National DNA Database, used mined Twitter data to find people’s travel paths, and mapped traffic in some of the world’s busiest sea ports.
In the August issue, out on newsstands right now, I had a chance to work with some spectacular data and extremely talented people. The piece looks at a very, very big data set – cellular phone records from a pool of 10 million users in an anonymous European country. This data came (under a very strict layer of confidentiality) from Barabási Lab in Boston, where they have been using this information to find out some fascinating things about human mobility patterns.
In this post, I’ll walk through the process of creating this piece. Along the way, I’ll show some draft images and unused experiments that eventually evolved into the final project.
Working With Big Data
I can’t get into a lot of detail about the specifics of the data set, but needless to say, phone records for 10 million individuals take up a lot of space. All told, the data for this project consisted of more than 5.5GB of flattened text files. I should say, at this point, that I don’t work on a supercomputer – I churn out all of my work from an often overheated 2.33GHZ MacBook Pro. Since the deadline was reasonably tight on this project, I decided to rule out a distributed computing approach to get at all of this data, and instead chose to work with a subset of the full list of records. Working in Processing, I built a simple script that could filter out a smaller dataset from the complete files. I built several of these at varying file sizes, giving me a nice set of data to work with both in prototyping and in production stages. This is a strategy that I often employ, even with more minimal datasets – save the heavy lifting until the final render.
The first thing I did with the trimmed-down data was to construct ‘call histories’ for each user in the set. I rendered out these histories as stacked bars of individual calls, which could then be placed into a histogram. Here’s a graph of about 10,000 users, sorted by their total time spent on the phone :
Here we see a very obvious power law distribution, with a few people talking a lot (really, a lot – 28.3 hours a week), and most callers talking relatively little (these is also a tail of text-only users at the very end). The problem here, of course, is that on a computer screen – or even in print – it’s hard to get into the data to learn anything useful. When I zoom into the graph, we can start to see the individual call histories (I’ve enlarged a few columns for detail). Here, long calls are rendered yellow, short calls are rendered red, and text messages are flat blue rectangles:
I took the same graph as above, and added another set of columns extending below – here the white bars show us how many ‘friends’ the individual callers had – ie. how many people they are regularly talking to over the week:
If I sort this graph by number of friends (rather than total call time), we can see that the two measures (talkativeness, and number of friends) don’t seem to be strongly correlated:
It’s interesting to note here as well, that the data set includes linkage information – so I can also visualize who is calling who within our group of individuals:
There is some interesting information to be dug up in here, but the long aspect of the graph and the general over-detail involved makes it not very usable – particularly for a magazine piece.
Ooh, and then Aaah.
The Infoporn section in Wired is a two page spread; I always think of it as needing to serve two separate purposes for two different kinds of readers. First, it needs to be visually pleasing. I want people to say ‘Oooh…!’ when they turn the page to it. Once they’re hooked, though, I want them to learn something – the ‘Aaah!’ moment.
The data used in the graphs above seemed too complex to do anything truly revealing with – so perhaps it could be built into something sexy enough to draw an ‘Oooh!’ or two? In order to fit the long tails of these graphs onto the page, I wondered if I could add a bit of a curl to them. To make this structural change evident, I turned the graphs on a slight angle and rendered them in 3D. Here, we see five of these graphs, totaling about a million individual users, arranged into a single, tower-like shape:
While these structures took a little while to render, I could quite easily generate a unique set of them, which I assembled as a line trailing off to the page edge on the left:
So far, the visuals for this project only tell a part of the story: that our individual calling habits fall into predictable patterns when placed with the larger whole (some excellent text from Michael Dumiak helps clarify this in the final piece). There’s another crucial piece, though. Cel phone usage data is inherently locative, since our provider always knows from which of their cel towers we are placing the call.
This is where the fun starts – we can use this locative data to track the mobility patterns of individual people (it’s worth saying here that all of the data the I worked with was anonymized). To do this, I created a tool (again, in Processing) to make ‘mobility cubes’ – which show a history of an individual’s movements over time:
The individual above, for example, travels around an area less than a square kilometer over a period of just under three days. If I flatten this graph, we can see that this person travels mostly between two locations:
From the data, we can identify a lot of individuals like this – commuters – who travel short distances between two places (home, and work). We can also find travelers (people who cover a long distance in a short period of time):
And others who seem to follow more elaborate (but often still regular) mobility patterns:
We can assemble a ‘mobility cube’ for each individual in the database – and very quickly gain a mechanism for recognizing patterns amongst these people:
Which brings us to the underlying point of the piece – we are all leaving digital trails behind us, as we make our way around our individual lives. These trails are largely considered individual – even ethereal – yet technology is making these trails more visible and more readable everyday.
Of course, to see the final piece – the polished assembly of some of the drafts and artifacts you’ve seen in this post – you’ll have to buy the magazine. Wired UK is available on newsstands in the UK, and to all of our clever subscribers.
If you want to read more about this – and you should – I’d highly recommend Albert-László Barabási’s Bursts, which goes into much more detail about human mobility & predictability.
Finally, huge thanks have to go out to László and his team at the lab, without whom this piece would have never made it to print!