Category Archives: data

WordPlay: A Tool for Freeform Language Exploration

When text becomes data it opens up a phenomenal amount of possibility for insight and creative exploration. The problem is that most Natural Language Processing (NPL) tools are hard to use unless you have a good foundation in programming to begin with. We use a lot of NLP in our work at The Office for Creative Research and I’ve often wondered what it would mean to make a language tool designed for open-ended exploration. I’ve also thought a lot about the role a tool like this could play in the classroom. English teachers could use it to explore both basic language (what is an adverb?) and more complex questions (how did Shakespeare use hyphenates?). History classrooms could examine patterns in language over time (how does Twitter compare to Moby Dick?). Because the tool would be open source and would have an API, Computer Science instructors could use it as a basis to teach about computation and language.

With all of this in mind I built WordPlay, a freeform NLP tool that lets you search across various bodies of text without having to know how to code or to learn any kind of strange syntax.

It’s really easy to use. Whatever you type into the query box will be used as a pattern to search across the current corpus. For example we might search for text similar to ‘a terrible fate’ across the Shakespeare’s collected works:

Screen Shot 2014-06-09 at 5.41.16 PM

Here is the same search, performed across the Bible:

Screen Shot 2014-06-09 at 5.41.45 PM

And within Sigmund Freud’s A General Introduction to Psychoanalysis:

Screen Shot 2014-06-09 at 5.42.26 PM

The system works by trying to find a match to the query text in two different ways:

First, it tries to match to part-of-speech. ‘A terrible fate’ contains a determiner, followed by an adjective, followed by a noun, so the best results will also have these parts of speech, in the same order.

Next, it attempts to match how the query sounds. Specifically, it considers the stressing pattern of the word or phrase: in the case of ‘a terrible fate’ it’ll look for other phrases with the same syllable counts (1-3-1) and the same stressing pattern.

The result of these two strategies is that the top search results in WordPlay should really sound like the query. And it works– go ahead and sing the results to the Shakespearean Danger  Zone (Kenny Loggins, eat your heart out!).

WordPlay is meant to be a simple tool, so I’ll end the explanation here. Go play with it.


On Data and Performance

Data live utilitarian lives. From the moment they are conceived, as measurements of some thing or system or person, they are conscripted to the cause of being useful. They are fed into algorithms, clustered and merged, mapped and reduced. They are graphed and charted, plotted and visualized. A rare datum might find itself turned into sound, or, more seldom, manifested as a physical object. Always, though, the measure of the life of data is in its utility. Data that are collected but not used are condemned to a quiet life in a database. They dwell in obscure tables, are quickly discarded, or worse (cue violin) – labelled as ‘exhaust’.

Perhaps this isn’t the only role for a datum? To be operated on? To be useful?

Over the last couple of years, with my collaborators Ben Rubin & Mark Hansen, we’ve been investigating the possibility of using data as a medium for performance. Here, data becomes the script, or the score, and in turn technologies that we typically think of as tools become instruments, and in some cases performers.

The most recent manifestation of these explorations is a performance called A Thousand Exhausted Things, which we recently staged at The Museum of Modern Art, with the experimental theater group Elevator Repair Service. In this performance, the script is MoMA’s collections database, an eighty year-old, 120k object strong archive. The instruments are a variety of custom-written natural language processing algorithms, which are used to turn the text of the database (largely the titles of artworks) into a performable form.

The first version of the performance itself is 15 minutes long. During this entire period, all of the dialogue that is spoken by the actors is either a complete title of an artwork, or a name of an artist. A data visualization, projected above the performers, shows the objects as abstracted forms as each artwork is mentioned:

By using such a non-conventional form to engage with the collections database, we’re asking the audience to think of the database as not just a myriad of rows and columns, but as a cultural artifact. The collection is shown as not only a record of the museum’s history, but of changing trends in contemporary art. It also allows a way for the artworks themselves to engage with one and other in a fashion which is outside the usual curatorial limitations.

These are the first nineteen lines of the performance:

Gainsboro’ Girl
“Young Girl, Back Turned”
Girl with a Mandolin (Fanny Tellier)
Interior with a Young Girl (Girl Reading)
“Head and Bust of a Woman, Three-Quarters to Left”
Head of a Sleeping Woman (Study for Nude with Drapery)
Sleeping Girl
Young Girl with Braids
Young Girl with Long Hair
“Fran̤oise with Long Neck. I, IV”
Tableau I: Lozenge with Four Lines and Gray
Spanish Girl
Another Girl Another Planet
Designs for an Overpopulated Planet: Foragers

Here we’re used an algorithm which seeks to build a ‘chain’ of like-sounding titles from the database. The algorithm attempts to make the chain longer and longer, until it can’t find a suitable title, in which case it returns to the seed word (in this case ‘Girl’). It’s a linguistic game, but it serves to curate a selection of works which may not normally be placed side by side. Jacques Villon’s 1908 etching ‘Young Girl, Back Turned‘ leads us to Picasso’s ‘Girl with a Mandolin (Fanny Tellier)‘, from 1910. John Candelero’s photograph ‘Spanish Girl‘ calls out Michael Almereda’s film ‘Another Girl Another Planet‘.

Perhaps the most exciting part about performance as a medium for data is that it allows for a fluid interpretation at the time of the performance itself. In this case, the skilled actors of Elevator Repair Service turn a dry algorithmic output into a wry dialogue of one-upmanship, allowing the artworks themselves to become pieces in an imagined language game. The possibilities for interpretation are magnified as the relationship moves from data => viewer to data => performer => viewer.

Later in A Thousand Exhausted Things an actor reads, in order, the most frequently occurring first names of artists in the MoMA collection (you can watch the video below). The first 41 of them are men’s names. John leads to Robert and David, through Max and Otto, all the way to Bruce & Carl before we hear from our first woman (Mary). While you might be able to imagine a data visualization which would show this gender imbalance more clearly (some would probably argue for a simple list), it’s difficult to conceive of a print or screen-based form delivering the message with similar impact.

We are not the only ones who are exploring the possibilities of data and performance. Providence based artist Brian House has composed and performed several musical works based on data, including ‘YOU’LL JUST HAVE TO TAKE MY WORD FOR IT‘, a piece for a small ensemble (two electric guitars and a tenor saxophone) which interprets black box data from Massachusetts Lieutenant Governor Tim Murray’s infamous car crash. Sculptor Nathalie Miebach’s ‘sculptural musical scores‘ are physical objects, representing weather data, which are meant to be performed by musicians (other pieces by Miebach are designed to be mounted to the body). In House’s and in Miebach’s work, we see data breaking out of its accepted formal restrictions.  By forcing us out of our usual framework, this work offers a new lens into event and experience, vastly different from what we would expect in a so called ‘data representation’.

As data exerts more and more influence on our lived experience, it is important that artists find ways to work with it outside of decades-old visual means like charts and graphs. Performance provides rich terrain for engagement with data, and perhaps allows for a new paradigm in which data are not as much operated on as they are allowed to operate on us.

Before Us is the Salesman’s House

Before us is the Salesman's House

When the dust settles on the 21st century, and all of the GIFs have finished animating, the most important cultural artifacts left from the digital age may very well be databases.

How will the societies of the future read these colossal stores of information?

Consider the eBay databases, which contain information for every transaction that happens and has happened on the world’s biggest marketplace. $2,094 worth of goods are sold on eBay every second. The records kept about this buying and selling go far beyond dollars and cents. Time, location and identity come together with text and images to leave a record that documents both individual events, as well as collective trends across history and geography.

This summer, Mark Hansen and I created an artwork, installed at the eBay headquarters in San Jose, which investigates this idea of the eBay database as a cultural artifact. Working in cooperation with eBay, Inc., and the ZERO1 Biennial, the piece was installed outside of the eBay headquarters and ran dusk to midnight from September 11th to October 12th.

As a conceptual foundation for the piece, we chose a much more traditional creative form than the database: the novel. Each movement begins with a selection of text. The first one every day was a stage direction from the beginning of Death of a Salesman which reads:

A melody is heard, played upon a flute. It is small and fine, telling of grass and trees and the horizon. The curtain rises.
Before us is the Salesman’s house. We are aware of towering, angular shapes behind it, surrounding it on all sides. Only the blue light of the sky falls upon the house and forestage; the surrounding area shows an angry glow of orange. As more light appears, we see a solid vault of apartment houses around the small, fragile-seeming home. An air of the dream dings to the place, a dream rising out of reality. The kitchen at center seems actual enough, for there is a kitchen table with three chairs, and a refrigerator. But no other fixtures are seen. At the back of the kitchen there is a draped entrance, which leads to the living room. To the right of the kitchen, on a level raised two feet, is a bedroom furnished only with a brass bedstead and a straight chair. On a shelf over the bed a silver athletic trophy stands. A window opens onto the apartment house at the side.

From this text, we begin by extracting items1 that might be bought on eBay:

Before us is the Salesman's House

Flute, grass, trees, curtain, table, chairs, refrigerator. This list serves now as a kind of inventory, each explored in a small set of data sketches which examine distribution: Where are these objects being sold right now? How much are they being sold for? What does the aggregate of all of the refrigerators sold in the USA look like?

Before us is the Salesman's House

Before us is the Salesman's House

Before us is the Salesman's House

Before us is the Salesman's House

From this map of objects for sale, the program selects one at random to act as a seed. For example, a refrigerator being sold for $695 in Milford, New Hampshire, will switch the focus of the piece to this town of fifteen thousand on the Souhegan river. The residents of Milford have sold many things on eBay over the years – but what about books? Using historical data, we investigate the flow of books into the town, both sold and bought by residents.

Before us is the Salesman's House

Before us is the Salesman's House

Before us is the Salesman's House

Finally, the program selects a book from this list2 and re-starts the cycle, this time with a new extracted passage, new objects, new locations, and new stories. Over the course of an evening, about a hundred cycles are completed, visualizing thousands of current and historic exchanges of objects.

Ultimately, the size of a database like eBay’s makes a complete, close reading impossible – at least for humans. Rather than an exhaustive tour of the data, then, our piece can be thought of as a distant reading3, a kind of a fly-over of this rich data landscape. It is  an aerial view of the cultural artifact that is eBay.

A motion sample of three movements from the piece can be seen in this video.

Before Us is the Salesman’s House was projected on a 30′ x 20′ semi-transparent screen, suspended in the entry way to the main building (I’m afraid lighting conditions were far from ideal for photography). It was built using Processing 2.0, MongoDB & Python. Special thanks to Jaime Austin, Phoram Meta, Jagdish Rishayur, David Szlasa and Sean Riley.

  1. Items are extracted through a combination of a text-analysis algorithm and, where needed, processing by helpful folks on Mechanical Turk.
  2. All text used comes from Project Gutenberg, a database of more than 40,000 free eBooks
  3. For more about distant reading, read this essay by Franco Moretti, or, for a summary, this article from the NYTimes

Data in an Alien Context: Kepler Visualization Source Code

Last year, I released a video visualization of the 1236 exoplanets identified by the NASA’s Kepler mission. Since then, there have been another 1091 candidates identified, and I thought it’d be a good time to update my visualization – and release the source code.

So, here it it:

I’ve tried to comment the code as well as possible – and the sketch overall is fairly simple. You will, of course, need Processing to get it running, as well as Karsten Schmidt’s esssential toxiclibs.

Your Device: Your data. How to save your iPhone location data (and help researchers make the world a better place)

An hour ago, Apple announced that it has released a patch for iOS and iTunes, which reduces the size of the location cache stored on your machine, and prevents an automatic back-up through iTunes.

Good news, right?

I don’t think so. Apple is still collecting this data, still getting this data from you, and still using it. The only difference is that you can’t use your own data.

Location data is extremely useful. That’s why Apple, Google, and Microsoft are collecting it. Over the last year, Apple has, intentionally or not, created what is likely the largest locational database ever. This is a hugely, massively, ridiculously useful database. And with this new update, Apple are the only ones who will be able to get their hands on it. I believe that our data should be… well, our data. We should be able to store it securely, explore it, and use it for any purposes that we might choose. This data would be extraordinarily useful for researchers – people studying how diseases spread, trying to solve traffic-flow problems, and researching human mobility.

With all of this in mind, some colleagues and I have been working on a project for the last week called It lets you upload your location data from your iDevice, securely store it, explore it via a map interface, and we’ll eventually offer you a system to directly donate your data to well-deserving research projects.

We’re pushing this project out quickly in hopes that we can gather as many location files as we can before people upgrade iOS and iTunes.

Visit now to upload, explore, and securely store your iDevice location data.

We are existing a world where data is being collected about us on a massive scale. This data is currently being stored, analyzed and monetized by corporations – there is little or no agency for the people to whom the data actually belongs. I believe that grass-roots initiatives like can provide a framework for how data sovereignty can be established and managed.

In the short term, I am hoping we can collect and store enough locational data to be of use to researchers. So please, before you upgrade iOS and iTunes, visit and make your own data your own data. And please (please) – pass this on.