This morning, The Guardian announced the launch of The Guardian Open Platform, a suite of services designed to give access for developers to Guardian content. Of course, this follows hot on the heels of The New York Times’ API releases, which I have discussed in detail on this blog, and have created a series of visualizations to explore.
Comparisons to the NYTimes APIs will be inevitable. Instead of debating the various selling points of both systems, I’ll give a short introduction to what is available from the Guardian and show a few early sketches that I have made with the data.
The most interesting thing about the Guardian Content API is that it offers access to full text for every article. This is good, in that it gives us a much bigger set of data to work with. Unfortunately, the first version of the API release doesn’t let you control the verbosity of the return – so you are getting sometimes a lot of content to process from each call. This means that making ‘simple’ visualizations of keyword frequency can take a lot longer. On the bright side, it also means that we have a lot more data to work with. Though I’ve started by building some simple graphing tools, I am excited about being able to dig into the full body text of the articles – I think there are a lot of possibilities there for linguistic analysis, etc.
The Content API doesn’t allow for faceted searching in the same way as the NYT Article Search API, but it does give us some fairly easy ways to refine and control our searches. For example, if I wanted to find out how many times the Guardian mentioned David Beckham, I might use a query like this:
I can narrow down on a specific chunk of time using the before and after parameters:
I can further refine the results of this search by using filters, which are at the core of how the API works, and can be very useful in locating specific sets of information. For example, this search would result in stories about Beckham and football:
Whereas this search would result in stories which had a cultural angle:
A full list of filters can be received through the API endpoint – and every content piece retrieved through the search will also include a list of its related tags & filter codes.
The return from these calls can be retrieved as XML, JSON, or ATOM, by changing the format parameter. Full documentation is available on the Guardian site.
Along with the Content API, the Guardian has also launched their Data Store – a collection of curated data that has been used in the past by the Guardian’s editorial staff when researching articles and producing projects. I will be writing more about this later this week, but for now you can check out the offerings on the Data Store site, and read a bit about it on the Guardian’s Data Blog.