Category Archives: Information Visualization

Before Us is the Salesman’s House

Before us is the Salesman's House

When the dust settles on the 21st century, and all of the GIFs have finished animating, the most important cultural artifacts left from the digital age may very well be databases.

How will the societies of the future read these colossal stores of information?

Consider the eBay databases, which contain information for every transaction that happens and has happened on the world’s biggest marketplace. $2,094 worth of goods are sold on eBay every second. The records kept about this buying and selling go far beyond dollars and cents. Time, location and identity come together with text and images to leave a record that documents both individual events, as well as collective trends across history and geography.

This summer, Mark Hansen and I created an artwork, installed at the eBay headquarters in San Jose, which investigates this idea of the eBay database as a cultural artifact. Working in cooperation with eBay, Inc., and the ZERO1 Biennial, the piece was installed outside of the eBay headquarters and ran dusk to midnight from September 11th to October 12th.

As a conceptual foundation for the piece, we chose a much more traditional creative form than the database: the novel. Each movement begins with a selection of text. The first one every day was a stage direction from the beginning of Death of a Salesman which reads:

A melody is heard, played upon a flute. It is small and fine, telling of grass and trees and the horizon. The curtain rises.
Before us is the Salesman’s house. We are aware of towering, angular shapes behind it, surrounding it on all sides. Only the blue light of the sky falls upon the house and forestage; the surrounding area shows an angry glow of orange. As more light appears, we see a solid vault of apartment houses around the small, fragile-seeming home. An air of the dream dings to the place, a dream rising out of reality. The kitchen at center seems actual enough, for there is a kitchen table with three chairs, and a refrigerator. But no other fixtures are seen. At the back of the kitchen there is a draped entrance, which leads to the living room. To the right of the kitchen, on a level raised two feet, is a bedroom furnished only with a brass bedstead and a straight chair. On a shelf over the bed a silver athletic trophy stands. A window opens onto the apartment house at the side.

From this text, we begin by extracting items1 that might be bought on eBay:

Before us is the Salesman's House

Flute, grass, trees, curtain, table, chairs, refrigerator. This list serves now as a kind of inventory, each explored in a small set of data sketches which examine distribution: Where are these objects being sold right now? How much are they being sold for? What does the aggregate of all of the refrigerators sold in the USA look like?

Before us is the Salesman's House

Before us is the Salesman's House

Before us is the Salesman's House

Before us is the Salesman's House

From this map of objects for sale, the program selects one at random to act as a seed. For example, a refrigerator being sold for $695 in Milford, New Hampshire, will switch the focus of the piece to this town of fifteen thousand on the Souhegan river. The residents of Milford have sold many things on eBay over the years – but what about books? Using historical data, we investigate the flow of books into the town, both sold and bought by residents.

Before us is the Salesman's House

Before us is the Salesman's House

Before us is the Salesman's House

Finally, the program selects a book from this list2 and re-starts the cycle, this time with a new extracted passage, new objects, new locations, and new stories. Over the course of an evening, about a hundred cycles are completed, visualizing thousands of current and historic exchanges of objects.

Ultimately, the size of a database like eBay’s makes a complete, close reading impossible – at least for humans. Rather than an exhaustive tour of the data, then, our piece can be thought of as a distant reading3, a kind of a fly-over of this rich data landscape. It is  an aerial view of the cultural artifact that is eBay.

A motion sample of three movements from the piece can be seen in this video.

Before Us is the Salesman’s House was projected on a 30′ x 20′ semi-transparent screen, suspended in the entry way to the main building (I’m afraid lighting conditions were far from ideal for photography). It was built using Processing 2.0, MongoDB & Python. Special thanks to Jaime Austin, Phoram Meta, Jagdish Rishayur, David Szlasa and Sean Riley.

  1. Items are extracted through a combination of a text-analysis algorithm and, where needed, processing by helpful folks on Mechanical Turk.
  2. All text used comes from Project Gutenberg, a database of more than 40,000 free eBooks
  3. For more about distant reading, read this essay by Franco Moretti, or, for a summary, this article from the NYTimes

Text Comparison Tool: SOURCE CODE

About this time last year, I built a light-weight tool in Processing to compare two articles that I read about head injuries in the NFL. Later, I extended the tool to compare any two texts, and promised a source release.

Well, it only took 12 months, but I’ve finally cleaned up the code to a point where I think it will be reasonably easy to use and helpful for those who might want to learn a bit more about the code.

I always find myself in a tricky position with source releases. Often, as in the case of this tool, I have ideas about what the project should look like before it gets released. Here, I wanted to build an interface to allow people to select different text sources within the app, so that people could use the app without having to compile it from Processing. This is what delayed the release of the code for a year – I was waiting for the time to get this last piece done.

Two weeks ago, though, I has a chance to speak at an event with Tahir Hemphill. I had sent Tahir a messy version of the project for use with his Hip Hop Word Count initiative a few months back, and he used it to analyze a famous rap battle between Nas and Jay-Z:

Text Correlation Tool: Nas Versus Jay / Ether Versus The Takeover from Staple Crops on Vimeo.

This reminded me of a fairly valuable lesson as far as source code is concerned: If it works, it’s probably good enough to release.

So, without too much further ado, here’s a link to download the Text Comparison Tool:

Text Comparison Tool (1.1MB)

It’s pretty easy to use. First, you’ll need Processing to open and work with the sketch. Also, you’ll need the toxiclibs core library installed. Assuming you have those two things, these are the steps:

1. Drag the unzipped folder into your sketchbook.
2. Place your text files in the sketch’s data folder.
3. Open the sketch.
4. Look for the code block at the top of the main tab where the article information is set. It’s pretty clearly marked, and looks like this:

String title1 = "Tokyo";
String url1 = "asia.txt";
String desc1 = "Suntory Hall";
String date1 = "November 14th, 2009";
color color1 = #140047;

String title2 = "Cairo";
String url2 = "cairo.txt";
String desc2 = "Cairo University";
String date2 = "June 4th, 2009";
color color2 = #680014;

5. Replace the information here with appropriate information for your files.
6. Run the sketch!

That’s it. If you are getting strange results, you can tweak the clean() and cleanBody() methods at the bottom of the main tab to control how your text is filtered.

Hopefully I’ll still find the time to package this thing up in a bit more of a user-friendly form. But, in the meantime, hopefully people will find this useful as an exploratory tool. Note that at any time you can press the ‘s’ key to save out an image – if you find some interesting texts to compare, let me know!

Wired UK, Barabási Lab and BIG data

Over the last year, I’ve produced five data-driven pieces for Wired UK. Four of them have been for the two-page infoporn spread that can be found in every issue. I’ve looked at the UK’s National DNA Database, used mined Twitter data to find people’s travel paths, and mapped traffic in some of the world’s busiest sea ports.

In the August issue, out on newsstands right now, I had a chance to work with some spectacular data and extremely talented people. The piece looks at a very, very big data set – cellular phone records from a pool of 10 million users in an anonymous European country. This data came (under a very strict layer of confidentiality) from Barabási Lab in Boston, where they have been using this information to find out some fascinating things about human mobility patterns.

In this post, I’ll walk through the process of creating this piece. Along the way, I’ll show some draft images and unused experiments that eventually evolved into the final project.

Working With Big Data

I can’t get into a lot of detail about the specifics of the data set, but needless to say, phone records for 10 million individuals take up a lot of space. All told, the data for this project consisted of more than 5.5GB of flattened text files. I should say, at this point, that I don’t work on a supercomputer – I churn out all of my work from an often overheated 2.33GHZ MacBook Pro. Since the deadline was reasonably tight on this project, I decided to rule out a distributed computing approach to get at all of this data, and instead chose to work with a subset of the full list of records. Working in Processing, I built a simple script that could filter out a smaller dataset from the complete files. I built several of these at varying file sizes, giving me a nice set of data to work with both in prototyping and in production stages. This is a strategy that I often employ, even with more minimal datasets – save the heavy lifting until the final render.

The first thing I did with the trimmed-down data was to construct ‘call histories’ for each user in the set. I rendered out these histories as stacked bars of individual calls, which could then be placed into a histogram. Here’s a graph of about 10,000 users, sorted by their total time spent on the phone :

Wired UK & Barabási Lab: Process

Here we see a very obvious power law distribution, with a few people talking a lot (really, a lot – 28.3 hours a week), and most callers talking relatively little (these is also a tail of text-only users at the very end). The problem here, of course, is that on a computer screen – or even in print – it’s hard to get into the data to learn anything useful. When I zoom into the graph, we can start to see the individual call histories (I’ve enlarged a few columns for detail). Here, long calls are rendered yellow, short calls are rendered red, and text messages are flat blue rectangles:

Wired UK & Barabási Lab: Process

I took the same graph as above, and added another set of columns extending below – here the white bars show us how many ‘friends’ the individual callers had – ie. how many people they are regularly talking to over the week:

Wired UK & Barabási Lab: Process

If I sort this graph by number of friends (rather than total call time), we can see that the two measures (talkativeness, and number of friends) don’t seem to be strongly correlated:

Wired UK & Barabási Lab: Process

It’s interesting to note here as well, that the data set includes linkage information – so I can also visualize who is calling who within our group of individuals:

Wired UK & Barabási Lab: Process

There is some interesting information to be dug up in here, but the long aspect of the graph and the general over-detail involved makes it not very usable – particularly for a magazine piece.

Ooh, and then Aaah.

The Infoporn section in Wired is a two page spread;  I always think of it as needing to serve two separate purposes for two different kinds of readers. First, it needs to be visually pleasing. I want people to say ‘Oooh…!’ when they turn the page to it. Once they’re hooked, though, I want them to learn something – the ‘Aaah!’ moment.

The data used in the graphs above seemed too complex to do anything truly revealing with – so perhaps it could be built into something sexy enough to draw an ‘Oooh!’ or two? In order to fit the long tails of these graphs onto the page, I wondered if I could add a bit of a curl to them. To make this structural change evident, I turned the graphs on a slight angle and rendered them in 3D. Here, we see five of these graphs, totaling about a million individual users, arranged into a single, tower-like shape:

Wired UK & Barabási Lab: Process

While these structures took a little while to render, I could quite easily generate a unique set of them, which I assembled as a line trailing off to the page edge on the left:

Wired UK & Barabási Lab: Process

Getting Personal

So far, the visuals for this project only tell a part of the story: that our individual calling habits fall into predictable patterns when placed with the larger whole (some excellent text from Michael Dumiak helps clarify this in the final piece). There’s another crucial piece, though. Cel phone usage data is inherently locative, since our provider always knows from which of their cel towers we are placing the call.

This is where the fun starts – we can use this locative data to track the mobility patterns of individual people (it’s worth saying here that all of the data the I worked with was anonymized). To do this, I created a tool (again, in Processing) to make ‘mobility cubes’ – which show a history of an individual’s movements over time:

Wired UK & Barabási Lab: Process

The individual above, for example, travels around an area less than a square kilometer over a period of just under three days. If I flatten this graph, we can see that this person travels mostly between two locations:

Wired UK & Barabási Lab: Process

From the data, we can identify a lot of individuals like this – commuters – who travel short distances between two places (home, and work). We can also find travelers (people who cover a long distance in a short period of time):

Wired UK & Barabási Lab: Process

And others who seem to follow more elaborate (but often still regular) mobility patterns:

Wired UK & Barabási Lab: Process

We can assemble a ‘mobility cube’ for each individual in the database – and very quickly gain a mechanism for recognizing patterns amongst these people:

Wired UK & Barabási Lab: Process

Which brings us to the underlying point of the piece – we are all leaving digital trails behind us, as we make our way around our individual lives. These trails are largely considered individual – even ethereal – yet technology is making these trails more visible and more readable everyday.

Of course, to see the final piece – the polished assembly of some of the drafts and artifacts you’ve seen in this post – you’ll have to buy the magazine. Wired UK is available on newsstands in the UK, and to all of our clever subscribers.

If you want to read more about this – and you should – I’d highly recommend Albert-László Barabási’s Bursts, which goes into much more detail about human mobility & predictability.

Finally, huge thanks have to go out to László and his team at the lab, without whom this piece would have never made it to print!

Your Random Numbers – Getting Started with Processing and Data Visualization

Over the last year or so, I’ve spent almost as much time thinking about how to teach data visualization as I’ve spent working with data. I’ve been a teacher for 10 years – for better or for worse this means that as I learn new techniques and concepts, I’m usually thinking about pedagogy at the same time. Lately, I’ve also become convinced that this massive ‘open data’ movement that we are currently in the midst of is sorely lacking in educational components. The amount of available data, I think, is quickly outpacing our ability to use it in useful and novel ways. How can basic data visualization techniques be taught in an easy, engaging manner?

This post, then, is a first sketch of what a lesson plan for teaching Processing and data visualization might look like. I’m going to start from scratch, work through some examples, and (hopefully) make some interesting stuff. One of the nice things, I think, about this process, is that we’re going to start with fresh, new data – I’m not sure what kind of things we’re going to find once we start to get our hands dirty. This is what is really exciting about data visualization; the chance to find answers to your own, possibly novel questions.

Let’s Start With the Data

We’re not going to work with an old, dusty data set here. Nor are we going to attempt to bash our heads against an unnecessarily complex pile of numbers. Instead, we’re going to start with a data set that I made up – with the help of a couple of hundred of my Twitter followers. Yesterday morning, I posted this request:

Even on a Saturday, a lot of helpful folks pitched in, and I ended up with about 225 numbers. And so, we have the easiest possible dataset to work with – a single list of whole numbers. I’m hoping that, as well as being simple, this dataset will turn out to be quite interesting – maybe telling us something about how the human brain thinks about numbers.

I wrote a quick Processing sketch to scrape out the numbers from the post, and then to put them into a Google Spreadsheet. You can see the whole dataset here: http://spreadsheets.google.com/pub?key=t6mq_WLV5c5uj6mUNSryBIA&output=html

I chose to start from a Google Spreadsheet in this tutorial, because I wanted people to be able to generate their own datasets to work with. Teachers – you can set up a spreadsheet of your own, and get your students to collect numbers by any means you’d like. The ‘User’ and ‘Tweet’ columns are not necessary; you just need to have a column called ‘Number’.

It’s about time to get down to some coding. The only tricky part in this whole process will be connecting to the Google Spreadsheet. Rather than bog down the tutorial with a lot of confusing semi-advanced code, I’ll let you download this sample sketch which has the Google Spreadsheet machinery in place.

Got it? Great. Open that sketch in Processing, and let’s get started. Just to make sure we’re all in the same place, you should see a screen that looks like this:

At the top of the sketch, you’ll see three String values that you can change. You’ll definitely have to enter your own Google username and password. If you have your own spreadsheet of number data, you can enter in the key for your spreadsheet as well. You can find the key right in the URL of any spreadsheet.

The first thing we’ll do is change the size of our sketch to give us some room to move, set the background color, and turn on smoothing to make things pretty. We do all of this in the setup enclosure:

void setup() {
  //This code happens once, right when our sketch is launched
 size(800,800);
 background(0);
 smooth();
};

Now we need to get our data from the spreadsheet. One of the advantages of accessing the data from a shared remote file is that the remote data can change and we don’t have to worry about replacing files or changing our code.

We’re going to ask for a list of the ‘random’ numbers that are stored in the spreadsheet. The most easy way to store lists of things in Processing is in an Array. In this case, we’re looking for an array of whole numbers – integers. I’ve written a function that gets an integer array from Google – you can take a look at the code on the ‘GoogleCode’ tab if you’d like to see how that is done. What we need to know here is that this function – called getNumbers – will return, or send us back, a list of whole numbers. Let’s ask for that list:

void setup() {
  //This code happens once, right when our sketch is launched
 size(800,800);
 background(0);
 smooth();

 //Ask for the list of numbers
 int[] numbers = getNumbers();
};

OK.

World’s easiest data visualization!

 fill(255,40);
 noStroke();
 for (int i = 0; i < numbers.length; i++) {
   ellipse(numbers[i] * 8, width/2, 8,8);
 };

What this does is to draw a row of dots across the screen, one for each number that occurs in our Google list. The dots are drawn with a low alpha (40/255 or about 16%), so when numbers are picked more than once, they get brighter. The result is a strip of dots across the screen that looks like this:

Right away, we can see a couple of things about the distribution of our ‘random’ numbers. First, there are two or three very bright spots where numbers get picked several times. Also, there are some pretty evident gaps (one right in the middle) where certain numbers don’t get picked at all.

This could be normal though, right? To see if this distribution is typical, let’s draw a line of ‘real’ random numbers below our line, and see if we can notice a difference:

fill(255,40);
 noStroke();
 //Our line of Google numbers
 for (int i = 0; i < numbers.length; i++) {
   ellipse(numbers[i] * 8, height/2, 8,8);
 };
 //A line of random numbers
 for (int i = 0; i < numbers.length; i++) {
   ellipse(ceil(random(0,99)) * 8, height/2 + 20, 8,8);
 };

Now we see the two compared:

The bottom, random line doesn’t seem to have as many bright spots or as evident of gaps as our human-picked line. Still, the difference isn’t that evident. Can you tell right away which line is our line from the group below?

OK. I’ll admit it – I was hoping that the human-picked number set would be more obviously divergent from the sets of numbers that were generated by a computer. It’s possible that humans are better at picking random numbers than I had thought. Or, our sample set is too small to see any kind of real difference. It’s also possible that this quick visualization method isn’t doing the trick. Let’s stay on the track of number distribution for a few minutes and see if we can find out any more.

Our system of dots was easy, and readable, but not very useful for empirical comparisons. For the next step, let’s stick with the classics and

Build a bar graph.

Right now, we have a list of numbers. Ours range from 1-99, but let’s imagine for a second that we had a set of numbers that ranged from 0-10:

[5,8,5,2,4,1,6,3,9,0,1,3,5,7]

What we need to build a bar graph for these numbers is a list of counts – how many times each number occurs:

[1,2,1,2,1,3,1,1,1,1]

We can look at this list above, and see that there were two 1s, and three 5s.

Let’s do the same thing with our big list of numbers – we’re going to generate a list 99 numbers long that holds the counts for each of the possible numbers in our set. But, we’re going to be a bit smarter about it this time around and package our code into a function – so that we can use it again and again without having to re-write it. In this case the function will (eventually) draw a bar graph – so we’ll call it (cleverly) barGraph:

void barGraph( int[] nums ) {
  //Make a list of number counts
 int[] counts = new int[100];
 //Fill it with zeros
 for (int i = 1; i < 100; i++) {
   counts[i] = 0;
 };
 //Tally the counts
 for (int i = 0; i < nums.length; i++) {
   counts[nums[i]] ++;
 };
};

This function constructs an array of counts from whatever list of numbers we pass into it (that list is a list of integers, and we refer to it within the function as ‘nums’, a name which I made up). Now, let’s add the code to draw the graph (I’ve added another parameter to go along with the numbers – the y position of the graph):


void barGraph(int[] nums, float y) {
  //Make a list of number counts
 int[] counts = new int[100];
 //Fill it with zeros
 for (int i = 1; i < 100; i++) {
   counts[i] = 0;
 };
 //Tally the counts
 for (int i = 0; i < nums.length; i++) {
   counts[nums[i]] ++;
 };

 //Draw the bar graph
 for (int i = 0; i < counts.length; i++) {
   rect(i * 8, y, 8, -counts[i] * 10);
 };
};

We’ve added a function – a set of instructions – to our file, which we can use to draw a bar graph from a set of numbers. To actually draw the graph, we need to call the function, which we can do in the setup enclosure. Here’s the code, all together:


/*

 #myrandomnumber Tutorial
 blprnt@blprnt.com
 April, 2010

 */

//This is the Google spreadsheet manager and the id of the spreadsheet that we want to populate, along with our Google username & password
SimpleSpreadsheetManager sm;
String sUrl = "t6mq_WLV5c5uj6mUNSryBIA";
String googleUser = "YOUR USERNAME";
String googlePass = "YOUR PASSWORD";

void setup() {
  //This code happens once, right when our sketch is launched
 size(800,800);
 background(0);
 smooth();

 //Ask for the list of numbers
 int[] numbers = getNumbers();
 //Draw the graph
 barGraph(numbers, 400);
};

void barGraph(int[] nums, float y) {
  //Make a list of number counts
 int[] counts = new int[100];
 //Fill it with zeros
 for (int i = 1; i < 100; i++) {
   counts[i] = 0;
 };
 //Tally the counts
 for (int i = 0; i < nums.length; i++) {
   counts[nums[i]] ++;
 };

 //Draw the bar graph
 for (int i = 0; i < counts.length; i++) {
   rect(i * 8, y, 8, -counts[i] * 10);
 };
};

void draw() {
  //This code happens once every frame.
};

If you run your code, you should get a nice minimal bar graph which looks like this:

We can help distinguish the very high values (and the very low ones) by adding some color to the graph. In Processing’s standard RGB color mode, we can change one of our color channels (in this case, green) with our count values to give the bars some differentiation:


 //Draw the bar graph
 for (int i = 0; i < counts.length; i++) {
   fill(255, counts[i] * 30, 0);
   rect(i * 8, y, 8, -counts[i] * 10);
 };

Which gives us this:

Or, we could switch to Hue/Saturation/Brightness mode, and use our count values to cycle through the available hues:

//Draw the bar graph
 for (int i = 0; i < counts.length; i++) {
   colorMode(HSB);
   fill(counts[i] * 30, 255, 255);
   rect(i * 8, y, 8, -counts[i] * 10);
 };

Which gives us this graph:

Now would be a good time to do some comparisons to a real random sample again, to see if the new coloring makes a difference. Because we defined our bar graph instructions as a function, we can do this fairly easily (I built in an easy function to generate a random list of integers called getRandomNumbers – you can see the code on the ‘GoogleCode’ tab):

void setup() {
  //This code happens once, right when our sketch is launched
 size(800,800);
 background(0);
 smooth();

 //Ask for the list of numbers
 int[] numbers = getNumbers();
 //Draw the graph
 barGraph(numbers, 100);

 for (int i = 1; i < 7; i++) {
 int[] randoms = getRandomNumbers(225);
 barGraph(randoms, 100 + (i * 130));
 };
};

I know, I know. Bar graphs. Yay. Looking at the graphic above, though, we can see more clearly that our humanoid number set is unlike the machine-generated sets. However, I actually think that the color is more valuable than the dimensions of the bars. Since we’re dealing with 99 numbers, maybe we can display these colours in a grid and see if any patterns emerge? A really important thing to be able to do with data visualization is to

Look at datasets from multiple angles.

Let’s see if the grid gets us anywhere. Luckily, a function to make a grid is pretty much the same as the one to make a graph (I’m adding two more parameters – an x position for the grid, and a size for the individual blocks):

void colorGrid(int[] nums, float x, float y, float s) {
 //Make a list of number counts
 int[] counts = new int[100];
 //Fill it with zeros
 for (int i = 0; i < 100; i++) {
   counts[i] = 0;
 };
 //Tally the counts
 for (int i = 0; i < nums.length; i++) {
   counts[nums[i]] ++;
 };

//Move the drawing coordinates to the x,y position specified in the parameters
 pushMatrix();
 translate(x,y);
 //Draw the grid
 for (int i = 0; i < counts.length; i++) {
   colorMode(HSB);
   fill(counts[i] * 30, 255, 255, counts[i] * 30);
   rect((i % 10) * s, floor(i/10) * s, s, s);

 };
 popMatrix();
};

We can now do this to draw a nice big grid:

 //Ask for the list of numbers
 int[] numbers = getNumbers();
 //Draw the graph
 colorGrid(numbers, 50, 50, 70);

I can see some definite patterns in this grid – so let’s bring the actual numbers back into play so that we can talk about what seems to be going on. Here’s the full code, one last time:


/*

 #myrandomnumber Tutorial
 blprnt@blprnt.com
 April, 2010

 */

//This is the Google spreadsheet manager and the id of the spreadsheet that we want to populate, along with our Google username & password
SimpleSpreadsheetManager sm;
String sUrl = "t6mq_WLV5c5uj6mUNSryBIA";
String googleUser = "YOUR USERNAME";
String googlePass = "YOUR PASSWORD";

//This is the font object
PFont label;

void setup() {
  //This code happens once, right when our sketch is launched
 size(800,800);
 background(0);
 smooth();

 //Create the font object to make text with
 label = createFont("Helvetica", 24);

 //Ask for the list of numbers
 int[] numbers = getNumbers();
 //Draw the graph
 colorGrid(numbers, 50, 50, 70);
};

void barGraph(int[] nums, float y) {
  //Make a list of number counts
 int[] counts = new int[100];
 //Fill it with zeros
 for (int i = 1; i < 100; i++) {
   counts[i] = 0;
 };
 //Tally the counts
 for (int i = 0; i < nums.length; i++) {
   counts[nums[i]] ++;
 };

 //Draw the bar graph
 for (int i = 0; i < counts.length; i++) {
   colorMode(HSB);
   fill(counts[i] * 30, 255, 255);
   rect(i * 8, y, 8, -counts[i] * 10);
 };
};

void colorGrid(int[] nums, float x, float y, float s) {
 //Make a list of number counts
 int[] counts = new int[100];
 //Fill it with zeros
 for (int i = 0; i < 100; i++) {
   counts[i] = 0;
 };
 //Tally the counts
 for (int i = 0; i < nums.length; i++) {
   counts[nums[i]] ++;
 };

 pushMatrix();
 translate(x,y);
 //Draw the grid
 for (int i = 0; i < counts.length; i++) {
   colorMode(HSB);
   fill(counts[i] * 30, 255, 255, counts[i] * 30);
   textAlign(CENTER);
   textFont(label);
   textSize(s/2);
   text(i, (i % 10) * s, floor(i/10) * s);
 };
 popMatrix();
};

void draw() {
  //This code happens once every frame.

};

And, our nice looking number grid:

BINGO!

No, really. If this was a bingo card, and I was a 70-year old, I’d be rich. Look at that nice line going down the X7 column – 17, 27, 37, 47, 57, 67, 77, 87, and 97 are all appearing with good frequency. If we rule out the Douglas Adams effect on 42, it is likely that most of the top 10 most-frequently occurring numbers would have a 7 on the end. Do numbers ending with 7s ‘feel’ more random to us? Or is there something about the number 7 that we just plain like?

Contrasting to that, if I had played the x0 row, I’d be out of luck. It seems that numbers ending with a zero don’t feel very random to us at all. This could also explain the black hole around the number 50 – which, in a range from 0-100, appears to be the ‘least random’ of all.

Well, there we have it. A start-to finish example of how we can use Processing to visualize simple data, with a goal to expose underlying patterns and anomalies. The techniques that we used in this project were fairly simple – but they are useful tools that can be used in a huge variety of data situations (I use them myself, all the time).

Hopefully this tutorial is (was?) useful for some of you. And, if there are any teachers out there who would like to try this out with their classrooms, I’d love to hear how it goes.

The Missing Piece of the OpenData / OpenGov Puzzle: Education

Yesterday, I tweeted a quick thought that I had, while walking the dog:

Picture 5

A few people asked me to expand on this, so let’s give it a try:

We are facing a very different data-related problem today than we were facing only a few years ago. Back then, the call was solely for more information. Since then, corporations and governments have started to answer this call and the result has been a flood of data of all shapes and sizes. While it’s important to remain on track with the goal of making data available, we are now faced with a parallel and perhaps more perplexing problem: What do we do with it all?

Of course, an industry has developed around all of this data; start-ups around the world are coming up with new ideas and data-related products every day. At the same time, open-sourcers are releasing helpful tools and clever apps by the dozen. Still, in a large part these groups are looking at the data with fiscal utility in mind. It seems to me that if we are going to make the most of this information resource, it’s important to bring more people in on the game – and to do that requires education.

At the post-secondary level, efforts should be made to educate academics for whom this new pile of data could be useful: journalists, social scientists, historians, contemporary artists, archivists, etc. I could imagine cross-disciplinary workshops teaching the basics:

  1. A survey of what kind of data is available, and how to find it.
  2. A brief overview of common data formats (CSV, JSON, XML, etc).
  3. An introduction to user-friendly exploration tools like ManyEyes & Tableau
  4. A primer in Processing and how it can be used to quickly prototype and build specialized visualization tools.

The last step seems particularly important to me, as it encourages people to think about new ways to engage with information. In many cases, datasets that are becoming available are novel in their content, structure, and complexity – encouraging innovation in an academic framework is essential. Yes, we do need to teach people how to make bar graphs and scatter charts; but let’s also facilitate exploration and experimentation.

Why workshops? While this type of teaching could certainly be done through tutorials, or with a well-written text book, it’s my experience that teaching these subjects is much more effective one-on-one. This is particularly true for students who come at data from a non-scientific perspective (and these people are the ones that we need the most).

The long-term goal of such an initiative would be to increase data-literacy. In a perfect world, this would occur even earlier – at the highschool level. Here’s where I put on my utopian hat: teaching data literacy to young people would mean that they could find answers to their own questions, rather than waiting for the media to answer those questions for them. It also teaches them, in a practical way, about transparency and accountability in government. The education system is already producing a generation of bloggers and citizen journalists – let’s make sure they have the skills they need to be dangerous. Veering a bit to the right, these are hugely valuable skills for workers in an ‘idea economy’ – a nation with a data-literate workforce is a force to be reckoned with.

Ideally this educational component would be build in to government projects like data.gov or data.hmg.gov.uk (are you listening, Canada?). More than that, it would be woven into the education mandate of governments at federal and local levels. Of course, I’m not holding my breath.

Instead, I’ve started to plan a bit of a project for the summer. Like last year, I taught a series of workshops at my studio in Vancouver, which were open to people of all skill levels. This year, I’m going to extend my reach a bit and offer a couple of free, online presentations covering some of the things that I’ve talked about in this post. One of these workshops will be specifically targeted to youth. At the same time, I’ll be publishing course outlines and sample materials for my sessions so that others can host similar events.

Stay tuned for details – and if you have any questions or would like to lend a hand, feel free to leave a comment or get in touch.