All posts by Jer

WordPlay: A Tool for Freeform Language Exploration

When text becomes data it opens up a phenomenal amount of possibility for insight and creative exploration. The problem is that most Natural Language Processing (NPL) tools are hard to use unless you have a good foundation in programming to begin with. We use a lot of NLP in our work at The Office for Creative Research and I’ve often wondered what it would mean to make a language tool designed for open-ended exploration. I’ve also thought a lot about the role a tool like this could play in the classroom. English teachers could use it to explore both basic language (what is an adverb?) and more complex questions (how did Shakespeare use hyphenates?). History classrooms could examine patterns in language over time (how does Twitter compare to Moby Dick?). Because the tool would be open source and would have an API, Computer Science instructors could use it as a basis to teach about computation and language.

With all of this in mind I built WordPlay, a freeform NLP tool that lets you search across various bodies of text without having to know how to code or to learn any kind of strange syntax.

It’s really easy to use. Whatever you type into the query box will be used as a pattern to search across the current corpus. For example we might search for text similar to ‘a terrible fate’ across the Shakespeare’s collected works:

Screen Shot 2014-06-09 at 5.41.16 PM

Here is the same search, performed across the Bible:

Screen Shot 2014-06-09 at 5.41.45 PM

And within Sigmund Freud’s A General Introduction to Psychoanalysis:

Screen Shot 2014-06-09 at 5.42.26 PM

The system works by trying to find a match to the query text in two different ways:

First, it tries to match to part-of-speech. ‘A terrible fate’ contains a determiner, followed by an adjective, followed by a noun, so the best results will also have these parts of speech, in the same order.

Next, it attempts to match how the query sounds. Specifically, it considers the stressing pattern of the word or phrase: in the case of ‘a terrible fate’ it’ll look for other phrases with the same syllable counts (1-3-1) and the same stressing pattern.

The result of these two strategies is that the top search results in WordPlay should really sound like the query. And it works– go ahead and sing the results to the Shakespearean Danger  Zone (Kenny Loggins, eat your heart out!).

WordPlay is meant to be a simple tool, so I’ll end the explanation here. Go play with it.


On Data and Performance

Data live utilitarian lives. From the moment they are conceived, as measurements of some thing or system or person, they are conscripted to the cause of being useful. They are fed into algorithms, clustered and merged, mapped and reduced. They are graphed and charted, plotted and visualized. A rare datum might find itself turned into sound, or, more seldom, manifested as a physical object. Always, though, the measure of the life of data is in its utility. Data that are collected but not used are condemned to a quiet life in a database. They dwell in obscure tables, are quickly discarded, or worse (cue violin) – labelled as ‘exhaust’.

Perhaps this isn’t the only role for a datum? To be operated on? To be useful?

Over the last couple of years, with my collaborators Ben Rubin & Mark Hansen, we’ve been investigating the possibility of using data as a medium for performance. Here, data becomes the script, or the score, and in turn technologies that we typically think of as tools become instruments, and in some cases performers.

The most recent manifestation of these explorations is a performance called A Thousand Exhausted Things, which we recently staged at The Museum of Modern Art, with the experimental theater group Elevator Repair Service. In this performance, the script is MoMA’s collections database, an eighty year-old, 120k object strong archive. The instruments are a variety of custom-written natural language processing algorithms, which are used to turn the text of the database (largely the titles of artworks) into a performable form.

The first version of the performance itself is 15 minutes long. During this entire period, all of the dialogue that is spoken by the actors is either a complete title of an artwork, or a name of an artist. A data visualization, projected above the performers, shows the objects as abstracted forms as each artwork is mentioned:

By using such a non-conventional form to engage with the collections database, we’re asking the audience to think of the database as not just a myriad of rows and columns, but as a cultural artifact. The collection is shown as not only a record of the museum’s history, but of changing trends in contemporary art. It also allows a way for the artworks themselves to engage with one and other in a fashion which is outside the usual curatorial limitations.

These are the first nineteen lines of the performance:

Gainsboro’ Girl
“Young Girl, Back Turned”
Girl with a Mandolin (Fanny Tellier)
Interior with a Young Girl (Girl Reading)
“Head and Bust of a Woman, Three-Quarters to Left”
Head of a Sleeping Woman (Study for Nude with Drapery)
Sleeping Girl
Young Girl with Braids
Young Girl with Long Hair
“Fran̤oise with Long Neck. I, IV”
Tableau I: Lozenge with Four Lines and Gray
Spanish Girl
Another Girl Another Planet
Designs for an Overpopulated Planet: Foragers

Here we’re used an algorithm which seeks to build a ‘chain’ of like-sounding titles from the database. The algorithm attempts to make the chain longer and longer, until it can’t find a suitable title, in which case it returns to the seed word (in this case ‘Girl’). It’s a linguistic game, but it serves to curate a selection of works which may not normally be placed side by side. Jacques Villon’s 1908 etching ‘Young Girl, Back Turned‘ leads us to Picasso’s ‘Girl with a Mandolin (Fanny Tellier)‘, from 1910. John Candelero’s photograph ‘Spanish Girl‘ calls out Michael Almereda’s film ‘Another Girl Another Planet‘.

Perhaps the most exciting part about performance as a medium for data is that it allows for a fluid interpretation at the time of the performance itself. In this case, the skilled actors of Elevator Repair Service turn a dry algorithmic output into a wry dialogue of one-upmanship, allowing the artworks themselves to become pieces in an imagined language game. The possibilities for interpretation are magnified as the relationship moves from data => viewer to data => performer => viewer.

Later in A Thousand Exhausted Things an actor reads, in order, the most frequently occurring first names of artists in the MoMA collection (you can watch the video below). The first 41 of them are men’s names. John leads to Robert and David, through Max and Otto, all the way to Bruce & Carl before we hear from our first woman (Mary). While you might be able to imagine a data visualization which would show this gender imbalance more clearly (some would probably argue for a simple list), it’s difficult to conceive of a print or screen-based form delivering the message with similar impact.

We are not the only ones who are exploring the possibilities of data and performance. Providence based artist Brian House has composed and performed several musical works based on data, including ‘YOU’LL JUST HAVE TO TAKE MY WORD FOR IT‘, a piece for a small ensemble (two electric guitars and a tenor saxophone) which interprets black box data from Massachusetts Lieutenant Governor Tim Murray’s infamous car crash. Sculptor Nathalie Miebach’s ‘sculptural musical scores‘ are physical objects, representing weather data, which are meant to be performed by musicians (other pieces by Miebach are designed to be mounted to the body). In House’s and in Miebach’s work, we see data breaking out of its accepted formal restrictions.  By forcing us out of our usual framework, this work offers a new lens into event and experience, vastly different from what we would expect in a so called ‘data representation’.

As data exerts more and more influence on our lived experience, it is important that artists find ways to work with it outside of decades-old visual means like charts and graphs. Performance provides rich terrain for engagement with data, and perhaps allows for a new paradigm in which data are not as much operated on as they are allowed to operate on us.

Art and the API

In 1968, in his seminal essay Systems Esthetics, Jack Burnham wrote:

The specific function of modern didactic art has been to show that art does not reside in material entities, but in relations between people and between people and the components of their environment.

In 2013, this list of relations can be expanded to include those between people and softwareas well as those between people and networks. How can art reside within these modern relations, rather than outside of them?

Enter the API.

API is one of those three-letter acronyms (TLAs) which makes only slightly more sense once you know what it stands for. Application Programming Interface; a generic enough term to be applied to many, many pieces of software, lots of which are operating inside of your computer right now. Really the important part of the definition is ‘interface’. I like to think about an API as a bridge which allows one computer program to talk to another computer program.

APIs have a lot of utility, as they can connect disparate programs running on different devices, even if those devices are running completely different operating systems. It’s not much of a stretch to say that any software company would have a set of internal APIs that allow communications between different parts of their software infrastructure. For every public-facing API at Google or Facebook, there are dozens more that are just used inside of the company. There’s a good parallel here to mail services – while a large company might have a mail room that deals with things being delivered to them from the outside world, they also have a lot of internal machinery which allows for mail to be delivered internally. The majority of your day-to-day interaction with social networks, e-mail applications or mobile apps is facilitated by APIs.

If you’ve heard of an API at all, it’s likely the one you’re thinking of is the Twitter API. It is what lets the apps on your phone, or third party Twitter applications communicate with all of the central tweeting machinery at Twitter HQ. Twitter took a risk in the beginning of their business by leaving this API open – they gambled that, by allowing businesses to build products around the main tweeting system, they’d end up with many more projects build on Twitter than they could have built themselves. This ‘open API’ model was so successful that plenty of other companies have since jumped on board and offered open APIs of their own. (Recently, Twitter has severely limited access to its API, to the consternation of the broad community of developers who make use of it).

Because APIs are associated with big companies like Twitter and Facebook and Google, a lot of weight is often attributed to them. The creation of an API can seem almost a reverential act. “They have an API”, we whisper, in hushed tones. Surely they must be hard to build?

As it turns out, you can build simple APIs very… simply. As an example, I spent about an hour writing an API that lets you query for word counts in this article. Go ahead, and try this link:

That number that gets spit out is how many times the word ‘API’ is mentioned on the page (I built it to include comments, so this will change a bit over time). If you’re interested in the extraordinarily simple guts of all of the demo APIs in this post, you can find the code in a GitHub repository.This example shows us that APIs an be built very easily. While you certainly can put a lot of work and weight into an API, making one can also be a quick and expressive way to create bridges and tools between until-now unrelated bodies of content and applications.

This act of bridging, enabled by an API, can be a political one. Josh Begley, a data artist living in New York, has recently created an API which allows access to information on every US drone strike, using data from The Bureau for Investigative Journalism. Updated as new strikes are reported and confirmed, The API allows others access to verified and aggregated data on drone activity. I built a small wrapper API for it which returns just the most recent strike:

Josh’s API is already a useful tool; journalists can use it to feed stories, apps can use it to display updated counts. It could be used for many conceivable art purposes. A good example is Begley’s own recent project, Dronestream, a Twitter stream of every known US drone attack. Pitch Interactive’s recent ‘Out of Sight, out of Mind‘ project uses the API to update its interactive timeline of drone strikes. Here, through the use of an API, intendedly secret data becomes exposed – open data from the most closed of sources.

Recall that the basic function of an API is to bridge one piece of software to another. In this way, APIs are conduits for the mash-up, long a preferred creative tool for media artists. Instead of producing a single mash-up, though, a functional API makes a permanent link between two applications, one whose pitch and timbre can change as the data themselves are updated.

The API can act as a clear connection, simply relaying data from one place to another. However, it can also operate on these data, shifting modes and meaning as information is requested an relayed. Instead of returning a single number of people killed in a drone strike, what if we returned a list of names? What if those names were extracted from a US zip code, allowing us to think about how media attention and personal perspective would change if these were Americans dying, instead of Afghanis or Pakistanis? More easily, the names could be extracted from our social media feeds. Here is an example API that returns a group of users from my own feed, equal to the number of people killed in the last US drone strike:

This is a heavy-handed, quickly drawn example. But it suggests an interesting idea: the conceptual API. A piece of software architecture intended not only to bridge but also to question. The API as a software art mechanism, intended to be consumed not only by humans, but by other pieces of software. (Promisingly, the API also offers a medium in which software artists can work entirely apart from visual esthetic.)

Burnham wrote in 1968 that ‘the significant artist strives to reduce the technical and psychical distance between [their] artistic output and the productive means of society’. In an age of Facebook, Twitter & Google, that productive means consists largely of networked software systems. The API presents a mechanism for artistic work to operate very close to, or in fact to live within these influential systems.

New Year, New Company: Introducing The Office for Creative Research

In the fall of 2010, my friend Mike Young invited me to come to the New York Times R&D Lab, to discuss a new visualization project that was just starting to get off of the ground. That project became Cascade, and that meeting led to my two-and-a-half year stay at the R&D Lab, as the first Data Artist in Residence. Yesterday, my residency at the New York Times came to an end. This morning, I’m thrilled to announce the official launch of my new company: The Office For Creative Research.

My 28 months (the residency was originally set for four months) at the New York Times was transformational in many, many ways. Cascade, which I initiated with Mark Hansen as a conceptual prototype, became a full-fledged project supported by an entire team of designers, developers and engineers. Along with Jake Porway, Brian House, and Matt Boggie, we built OpenPaths, which continues to be an exciting model for personal engagement with data. Mark and I, working with Alexis Lloyd, also made Memory Maps, a prototype for archive exploration, in which news stories are interwoven with the personal history of the user.

These successful projects were of course accompanied by unfinished sketches, necessary failures and inevitable dead ends. I built a visualization tool for household power usage that went nowhere, a few failed archive exploration tools, and one particularly bad interface for visualizing personal connections on Twitter. The R&D group, conceived and led by Michael Zimbalist, is very much a place that encourages real exploration – and the inevitable failures that result. This freedom to explore and to push boundaries is what has made, and will continue to make NYTLabs fertile ground for ideas and innovation.

Which brings me back to The Office for Creative Research, the new company I’ve founded with Mark Hansen and Ben Rubin. OCR is a multidisciplinary research group focusing on new modes of engagement with data. We’re looking to partner with companies, institutions, scientists, museums – any individual, group or organization who is facing novel problems with data. A browse through our collective portfolio will show our range of approach, from visualization to algorithm design to performance and installation. Our unique range of skills, drawing from both the arts and sciences, give us the ability to tackle almost any problem, from the laboratory to the gallery, and everywhere in between.

We’ve outlined the mission of The Office for Creative research in this memorandum, released today, and you can see more of our work on OCR’s freshly-launched website. While we already have a set of fascinating projects on the go for 2013, we are looking for innovative new partners. Please get in touch if you’d like to explore the possibility of working with OCR. Also, we’ll be looking to hire talented people in the spring, so if you’d like to work in New York City, exploring the borders between data, technology & culture, send us a message. 

It’s going to be an exciting year. We’ll be running a series of workshops at OCR starting next month, and we’ll be publishing a journal at the end of 2013 documenting the progress of our research. For regular news and data-related commentary, you can follow The Office For Creative Research on Twitter – @The_O_C_R.

I’d be remiss not to end this post with a thank-you to the many talented people at the New York Times who made my time there so tremendously enjoyable. It’s a world-class organization, filled with world-class human beings, and I’ll always be grateful for having had the chance to spend time there.

Happy New Year,


Before Us is the Salesman’s House

Before us is the Salesman's House

When the dust settles on the 21st century, and all of the GIFs have finished animating, the most important cultural artifacts left from the digital age may very well be databases.

How will the societies of the future read these colossal stores of information?

Consider the eBay databases, which contain information for every transaction that happens and has happened on the world’s biggest marketplace. $2,094 worth of goods are sold on eBay every second. The records kept about this buying and selling go far beyond dollars and cents. Time, location and identity come together with text and images to leave a record that documents both individual events, as well as collective trends across history and geography.

This summer, Mark Hansen and I created an artwork, installed at the eBay headquarters in San Jose, which investigates this idea of the eBay database as a cultural artifact. Working in cooperation with eBay, Inc., and the ZERO1 Biennial, the piece was installed outside of the eBay headquarters and ran dusk to midnight from September 11th to October 12th.

As a conceptual foundation for the piece, we chose a much more traditional creative form than the database: the novel. Each movement begins with a selection of text. The first one every day was a stage direction from the beginning of Death of a Salesman which reads:

A melody is heard, played upon a flute. It is small and fine, telling of grass and trees and the horizon. The curtain rises.
Before us is the Salesman’s house. We are aware of towering, angular shapes behind it, surrounding it on all sides. Only the blue light of the sky falls upon the house and forestage; the surrounding area shows an angry glow of orange. As more light appears, we see a solid vault of apartment houses around the small, fragile-seeming home. An air of the dream dings to the place, a dream rising out of reality. The kitchen at center seems actual enough, for there is a kitchen table with three chairs, and a refrigerator. But no other fixtures are seen. At the back of the kitchen there is a draped entrance, which leads to the living room. To the right of the kitchen, on a level raised two feet, is a bedroom furnished only with a brass bedstead and a straight chair. On a shelf over the bed a silver athletic trophy stands. A window opens onto the apartment house at the side.

From this text, we begin by extracting items1 that might be bought on eBay:

Before us is the Salesman's House

Flute, grass, trees, curtain, table, chairs, refrigerator. This list serves now as a kind of inventory, each explored in a small set of data sketches which examine distribution: Where are these objects being sold right now? How much are they being sold for? What does the aggregate of all of the refrigerators sold in the USA look like?

Before us is the Salesman's House

Before us is the Salesman's House

Before us is the Salesman's House

Before us is the Salesman's House

From this map of objects for sale, the program selects one at random to act as a seed. For example, a refrigerator being sold for $695 in Milford, New Hampshire, will switch the focus of the piece to this town of fifteen thousand on the Souhegan river. The residents of Milford have sold many things on eBay over the years – but what about books? Using historical data, we investigate the flow of books into the town, both sold and bought by residents.

Before us is the Salesman's House

Before us is the Salesman's House

Before us is the Salesman's House

Finally, the program selects a book from this list2 and re-starts the cycle, this time with a new extracted passage, new objects, new locations, and new stories. Over the course of an evening, about a hundred cycles are completed, visualizing thousands of current and historic exchanges of objects.

Ultimately, the size of a database like eBay’s makes a complete, close reading impossible – at least for humans. Rather than an exhaustive tour of the data, then, our piece can be thought of as a distant reading3, a kind of a fly-over of this rich data landscape. It is  an aerial view of the cultural artifact that is eBay.

A motion sample of three movements from the piece can be seen in this video.

Before Us is the Salesman’s House was projected on a 30′ x 20′ semi-transparent screen, suspended in the entry way to the main building (I’m afraid lighting conditions were far from ideal for photography). It was built using Processing 2.0, MongoDB & Python. Special thanks to Jaime Austin, Phoram Meta, Jagdish Rishayur, David Szlasa and Sean Riley.

  1. Items are extracted through a combination of a text-analysis algorithm and, where needed, processing by helpful folks on Mechanical Turk.
  2. All text used comes from Project Gutenberg, a database of more than 40,000 free eBooks
  3. For more about distant reading, read this essay by Franco Moretti, or, for a summary, this article from the NYTimes