Tag Archives: infoviz

Wired UK, Barabási Lab and BIG data

Over the last year, I’ve produced five data-driven pieces for Wired UK. Four of them have been for the two-page infoporn spread that can be found in every issue. I’ve looked at the UK’s National DNA Database, used mined Twitter data to find people’s travel paths, and mapped traffic in some of the world’s busiest sea ports.

In the August issue, out on newsstands right now, I had a chance to work with some spectacular data and extremely talented people. The piece looks at a very, very big data set – cellular phone records from a pool of 10 million users in an anonymous European country. This data came (under a very strict layer of confidentiality) from Barabási Lab in Boston, where they have been using this information to find out some fascinating things about human mobility patterns.

In this post, I’ll walk through the process of creating this piece. Along the way, I’ll show some draft images and unused experiments that eventually evolved into the final project.

Working With Big Data

I can’t get into a lot of detail about the specifics of the data set, but needless to say, phone records for 10 million individuals take up a lot of space. All told, the data for this project consisted of more than 5.5GB of flattened text files. I should say, at this point, that I don’t work on a supercomputer – I churn out all of my work from an often overheated 2.33GHZ MacBook Pro. Since the deadline was reasonably tight on this project, I decided to rule out a distributed computing approach to get at all of this data, and instead chose to work with a subset of the full list of records. Working in Processing, I built a simple script that could filter out a smaller dataset from the complete files. I built several of these at varying file sizes, giving me a nice set of data to work with both in prototyping and in production stages. This is a strategy that I often employ, even with more minimal datasets – save the heavy lifting until the final render.

The first thing I did with the trimmed-down data was to construct ‘call histories’ for each user in the set. I rendered out these histories as stacked bars of individual calls, which could then be placed into a histogram. Here’s a graph of about 10,000 users, sorted by their total time spent on the phone :

Wired UK & Barabási Lab: Process

Here we see a very obvious power law distribution, with a few people talking a lot (really, a lot – 28.3 hours a week), and most callers talking relatively little (these is also a tail of text-only users at the very end). The problem here, of course, is that on a computer screen – or even in print – it’s hard to get into the data to learn anything useful. When I zoom into the graph, we can start to see the individual call histories (I’ve enlarged a few columns for detail). Here, long calls are rendered yellow, short calls are rendered red, and text messages are flat blue rectangles:

Wired UK & Barabási Lab: Process

I took the same graph as above, and added another set of columns extending below – here the white bars show us how many ‘friends’ the individual callers had – ie. how many people they are regularly talking to over the week:

Wired UK & Barabási Lab: Process

If I sort this graph by number of friends (rather than total call time), we can see that the two measures (talkativeness, and number of friends) don’t seem to be strongly correlated:

Wired UK & Barabási Lab: Process

It’s interesting to note here as well, that the data set includes linkage information – so I can also visualize who is calling who within our group of individuals:

Wired UK & Barabási Lab: Process

There is some interesting information to be dug up in here, but the long aspect of the graph and the general over-detail involved makes it not very usable – particularly for a magazine piece.

Ooh, and then Aaah.

The Infoporn section in Wired is a two page spread;  I always think of it as needing to serve two separate purposes for two different kinds of readers. First, it needs to be visually pleasing. I want people to say ‘Oooh…!’ when they turn the page to it. Once they’re hooked, though, I want them to learn something – the ‘Aaah!’ moment.

The data used in the graphs above seemed too complex to do anything truly revealing with – so perhaps it could be built into something sexy enough to draw an ‘Oooh!’ or two? In order to fit the long tails of these graphs onto the page, I wondered if I could add a bit of a curl to them. To make this structural change evident, I turned the graphs on a slight angle and rendered them in 3D. Here, we see five of these graphs, totaling about a million individual users, arranged into a single, tower-like shape:

Wired UK & Barabási Lab: Process

While these structures took a little while to render, I could quite easily generate a unique set of them, which I assembled as a line trailing off to the page edge on the left:

Wired UK & Barabási Lab: Process

Getting Personal

So far, the visuals for this project only tell a part of the story: that our individual calling habits fall into predictable patterns when placed with the larger whole (some excellent text from Michael Dumiak helps clarify this in the final piece). There’s another crucial piece, though. Cel phone usage data is inherently locative, since our provider always knows from which of their cel towers we are placing the call.

This is where the fun starts – we can use this locative data to track the mobility patterns of individual people (it’s worth saying here that all of the data the I worked with was anonymized). To do this, I created a tool (again, in Processing) to make ‘mobility cubes’ – which show a history of an individual’s movements over time:

Wired UK & Barabási Lab: Process

The individual above, for example, travels around an area less than a square kilometer over a period of just under three days. If I flatten this graph, we can see that this person travels mostly between two locations:

Wired UK & Barabási Lab: Process

From the data, we can identify a lot of individuals like this – commuters – who travel short distances between two places (home, and work). We can also find travelers (people who cover a long distance in a short period of time):

Wired UK & Barabási Lab: Process

And others who seem to follow more elaborate (but often still regular) mobility patterns:

Wired UK & Barabási Lab: Process

We can assemble a ‘mobility cube’ for each individual in the database – and very quickly gain a mechanism for recognizing patterns amongst these people:

Wired UK & Barabási Lab: Process

Which brings us to the underlying point of the piece – we are all leaving digital trails behind us, as we make our way around our individual lives. These trails are largely considered individual – even ethereal – yet technology is making these trails more visible and more readable everyday.

Of course, to see the final piece – the polished assembly of some of the drafts and artifacts you’ve seen in this post – you’ll have to buy the magazine. Wired UK is available on newsstands in the UK, and to all of our clever subscribers.

If you want to read more about this – and you should – I’d highly recommend Albert-László Barabási’s Bursts, which goes into much more detail about human mobility & predictability.

Finally, huge thanks have to go out to László and his team at the lab, without whom this piece would have never made it to print!

Two Sides of the Same Story: Laskas & Gladwell on CTE & the NFL

Laskas / Gladwell

In October, I read a fascinating article on GQ.com about head injuries among former NFL players. Written by Jeanne Marie Laskas, the article was a forensic detective story, documenting a little known doctor’s efforts to bring the brain trauma issue to the attention of the medical community, the NFL, and the general public. It is a great read – an in-depth investigative piece with engaging personalities and plenty of intrigue.

A few weeks later, I picked up a copy of The New Yorker on my way home from Pittsburgh. I was surprised to see, on the cover, a promo for an article by Malcolm Gladwell about – you guessed it – brain trauma and the NFL. After having read both articles, I was surprised by how much these two investigative pieces differed. At the time I thought about doing a visualization to investigate, but somehow the idea slipped out of my head.

Until this weekend. I spent a few (okay, more like eight) hours putting together a tool with Processing that would examine some of the similarities and differences between the two articles. The most interesting data ended up coming from word usage analysis (I looked at sentences and phrases as well, but with not much luck). The base interface for the tool is a XY chart of the words – they are positioned vertically by their average position in the articles, and horizontally by which article they occur in more. The words in the centre are shared by both articles. Total usage affects the scale of the words, so we can see quite quickly which words are used most, and in which articles.

By focusing our attention on the big words which lie more or less in the center, we can see what the two articles have in common: brains, football, dementia, and a disease called CTE. What is perhaps more interesting is what lies on the outer edges; the subjects and topics that were covered by one author and not by the other.

Laskas’ article is about Dr. Bennet Omalu, dead NFL players (Mike Webster), Omalu’s colleagues (Dr. Julian Bailes & Bob Fitzsimmons) and the NFL (click on the images to see bigger versions):

Laskas / Gladwell

Gladwell’s article, on the other hand, focuses partly on another scientist, Dr. Ann McKee, the sport of football in general, as well as s central metaphor in his piece – a comparison between football and dogfighting (the bridge between the two is Michael Vick):

Laskas / Gladwell

The gulf between the two main scientific personalities profiled in the articles is interesting. Omalu and McKee are both experts in chronic traumatic encephalopathy (CTE) so it makes sense that they each appear in both articles (Omalu was the first to describe the condition; McKee. However, we see when we isolate these names that Laskas references Dr. Omalu almost exclusively (Omalu is mentioned 96 times by Laskas and only 6 times by Gladwell)* – it’s worth noting here that the Laskas article is 11.4% longer than the Gladwell piece – JT:

Laskas / Gladwell

In contrast, Laskas only refers to McKee once (Dr. McKee is mentioned by Gladwell 21 times):

Laskas / Gladwell

What is the relationship between Dr. McKee and Dr. Omalu? McKee is on the advisory board for the Sports Legacy Institute, a group which studies the results of brain trauma on athletes. SLI was founded by four individuals, including Bennet Omalu and the group’s current head, Chris Nowinski, a former professional wrestler. Omalu and the other three founders of SLI have now left the group, but it apparently continues to be a high-profile presence in the CTE field. Laskas writes:

“Indeed, the casual observer who wants to learn more about CTE will be easily led to SLI and the Boston group. There’s an SLI Twitter link, an SLI awards banquet, an SLI Web site with photos of Nowinski and links to videos of him on TV and in the newspapers. Gradually, Omalu’s name slips out of the stories, and Bailes slips out, and Fitzsimmons, and their good fight. As it happens in stories, the telling and retelling simplify and reduce.”

I wonder how much the path of an journalistic piece is affected by who you talk to first? If I had to guess, I’d say Gladwell started with the SLI, whereas Laskas seemed to have began from Dr. Omalu. This single decision could account for many of the differences between the two articles.

Other word-use choices might also give insight into editorial positions. Laskas, for example, uses the term NFL (below, at left) a lot – 57 times to Gladwell’s 11. Gladwell, on the other hand, talks more about the sport in general, using the word ‘football’ (below, at right)  40 times to Laskas’ 23:

Laskas / Gladwell Laskas / Gladwell

According to Laskas, Dr. Omalu has been roundly shunned by the NFL – they have attempted to discredit his research on many occasions (attention that has not been so pointedly focused on Dr. McKee and the SLI). Though both articles are critical of the League, it seems clear both from the article and the data that Laskas and GQ have taken a more severe stance – the addresses the NFL much more often, and with more disdain.

This exercise of quantitatively analyzing a pair of articles may seem like a strange way to spend a weekend, but it helped me to more clearly understand the differences between the two stories and to consider my reactions to each. I uncovered a few things that I hadn’t picked up at first, and at the same time was able to reinforce some of the feelings that I had after reading the two articles.

It was also another opportunity to build a quick, lightweight visualization tool dedicated to a fairly specific topic (though in this case the tool could be used to compare any two bodies of text). This strategy holds a lot of appeal to me and I think deserves attention alongside the generalist approach that we tend to see a lot of on the web and in data visualization. It seems to me that this type of investigative technique could be useful for researchers of various stripes.

I will be releasing source code for this project as well as compiled applications for Mac, Linux & Windows. In the meantime, here’s a short video of how the interface behaves:

Two Sides of the Same Story: Laskas & Gladwell on CTE & the NFL from blprnt on Vimeo.

7 Days of Source Day #2: NYTimes 365/360

NYTimes: 365/360 - 2009 (in color)

Project: NYTimes 365/360
Date: February, 2009
Language: Processing
Key Concepts: Data Visualization, NYTimes Article Search API, HashMaps & ArrayLists

Overview:

Many have you have already seen the series of visualizations that I created early in the year using the newly-released New York Times APIs. The most complex of these were in the 365/360 series in which I tried to distill an entire year of news stories into a single graphic. The resulting visualizations (2009 is picture above) capture the complex relationships – and somewhat tangled mess – that is a year in the news.

This release is a single sketch. I’ll be releasing the Article Search API Processing code as a library later in the week, but I wanted to show this project as it sits, with all of the code intact. The output from this sketch is a set of .PDFs which are suitable for print. Someday I’d like to show the entire series of these as a set of 6′ x 6′ prints – of course, someday I’d also like a solid-gold skateboard and a castle made of cheese.

That said, really nice, archival quality prints from this project (and the one I’ll be releasing tomorrow) are for sale in my Etsy shop. I realize that you’ll all be able to make your own prints now (and you are certainly welcome to do so) – but if you really enjoy the work and want to have a signed print to hang on your wall, you know who to talk to.

Getting Started:

Put the folder ‘NYT_365_360’ into your Processing sketch folder. Open Processing and open the sketch from the File > Sketchbook menu. You’ll find detailed instructions in the header of the main tab (the NYT_365_360.pde file).

Thanks:

Most of the credit for this sketch goes to the clever kids at the NYT who made the amazing Article Search API. This is the gold standard of APIs, and really is a dream to use. As you’ll see if you dig into the code, each of these complicated graphics is made with just 21 calls to the API. I can’t imagine the amount of blood, sweat, and tears that would go into making a graphic like this the old-fashioned way.

Speaking of gold standards, Robert Hodgin got me pointed to ArrayLists in the first place, and has been helpful many times over the last few years as I’ve tried to solve a series of ridiculously simple problems in Processing. Thanks, Robert!

Download: NYT365.zip (140k)


CC-GNU GPL

This software is licensed under the CC-GNU GPL version 2.0 or later.

Wired UK, July ’09 – Visualizing a Nation’s DNA

Wired UK - NDNAD Spread (July, 2009)

In the spring, I was asked by Wired UK if I would be interested in producing something for the two-page ‘infoporn’ spread that runs in every issue. They had seen my experimentations with the NYTimes APIs, and were interested in the idea of non-conventional data visualizations. After a bit of research, I proposed an piece about the UK’s National DNA Database. It was a subject that interested me and I felt that there would be some interesting political territory to cover. Luckily, Wired agreed.

By searching through Parliamentary minutes, and sifting over annual reports, I was able to put together a fair amount of information about the NDNAD and I settled on a few key points that I wanted to convey with the piece. First, I wanted to somehow demonstrate how large the database is – with over 4.5M individuals profiled, it’s the largest DNA database in the world. It holds profiles for more than 7% of the UK’s population. As well as the size of the database, I wanted to show how it broke down – in racial groups, in age groups, and in terms of those who have been charged versus those who are ‘innocent’. Finally, I  wanted to talk about the difference between the UK’s population demographics and the demographics represented by the profiles in the NDNAD.

The central graphic, then, is a DNA strand with one dot for each of the profiles in the database – more than 5M! Of course, I didn’t do this by hand. I wrote a program in Processing that would generate a single, continuous strand that filled up a certain size area. I was inspired by electron microscope images that I had seen of real DNA in which it looks like a loop of thread:

The nice looping threads were rendered using Perlin noise – I had a few parameters inside the program which allowed me to control how ‘messy’ the tangle became, and how much variation in thickness each strand had. While I was at it, I colour-coded each DNA dot to indicate the database’s ethnic breakdown. The result was a giant tangle, which was pretty much exactly what I wanted:

Wired UK - NDNAD Infographic

Here, you can see the individual dots, and the colour breakdown:

Wired NDNAD Graphic - detail

The next step was to break down the big tangle into three parts – one representing the bulk of the database, one representing the 948,535 profiles that were taken from people under the age of 18, and one representing the ~500,000 profiles from people who had never been charged, convicted, or warned by police. The original image had a static centre-point for the DNA loop; to break the tangle apart, I modified the program so that the centrepoint could move to pre-determined points once certain counts had been reached. The final graphic changes centre-points three times. What was nice about this set-up what that it was easy to move and adjust the positioning of the graphic to fit the page layout. Rendering out a new version of the main image took just a few minutes.

Wired UK - NDNAD Infographic

Working with these kinds of generative strategies meant that I could explore many variations. As you can see from the graphics posted here, I went through a variety of compositional and colour changes, all of which were relatively painless. Using Processing, I built a mini-application whose entire purpose was to create these DNA systems. I also built a second min-app, which rendered out a set of pie-charts that were used to display related information along with the main graphic in the spread. I wanted these pie charts to fit in visually with the main graphic, so I created a very simple sketch to output charts from any set of data:

Wired NDNAD Pie Chart

There ended up being 11 of these little pie-charts that accompanied the main graphic. Again, by building tools, I was able to do some interesting things, while at the same time avoiding large amounts of manual labour. Just how I like it! You can see the final result in the image at the top of this post, and of course, in Wired UK – the July issue hit newsstands a couple of weeks ago. If you are in the UK, go out and buy a copy!

Perhaps the most exciting thing that has came out of this process is that I have been asked to be a contributing editor for Wired UK. I’ll be creating some more pieces centred around data & information over the coming months (look for a Just Landed spread next month), and will also be getting the chance to showcase some work by various brilliant designers & artists in the UK and around the world.

So, stay tuned…

NYT: This was 1984

NYTimes: 365/360 - 1985

This series of images uses the faceted searching abilities of the NYTimes Article Search API to construct maps of the top organizations & people mentioned in articles for a given news year. Connections between these entities are drawn, so that relationships can be found and followed.

NYTimes: 365/360 - 2001

NYTimes: 365/360 - 2009

The maps posted so far in the Flickr set are general ones – but these can also be generated for any refined keyword search. Similarly, while the current maps are for individual years, a map could be made for any given period of time (the Bush presidency, the Gulf War, September 2001), or indeed for the whole period of time available through the API (1981-present).

One of the best things about using Processing for these types of projects is that the final result can be output to different formats. These, for example, can be output very easily as .PDFs, and I do think they’d look particularly striking as wall-sized prints. 25 years * 10 feet… does anyone have a 250-foot long wall they can lend me?