Tag Archives: google

genealogy of a braeburn

It has been observed that when systems break down we get to actually see how they operate. I wonder what this breakage below says about the use of Freebase and Wikipedia data in Google’s Knowlege Graph.

Yes, that’s an image of Braeburn from My Little Pony to the right, and text about the apple to the left. Interestingly it’s fine at Wikipedia:

And it’s not even there in Freebase (according to a search).

I don’t know if this reveals what’s going on in the flow of entities between Wikipedia, Freebase and Google. But I thought it was interesting. I wonder where to report such an anomaly. Is there a place?

Thanks to Jeff Godin in #code4lib for noticing the breakage in Knowledge Graph.

See also Hilary Mason’s post about how her identity got mixed up on Bing. (Thanks Chris).

Update: 2012-02-04

I thought to check a week later, and the The Knowledge Graph results got even funnier, now it’s a collage of apples and My Little Pony:

level 0 linked archival data

TLDR; lets see if we can share structured archival data better by adding HTML <link> elements that point at our EAD XML files.

A few weeks ago I attended a small meeting of DC museums, archives and libraries that were discussing what Linked Data means for Archives. Hillel Arnold and I took collaborative notes in Pirate Pad. For a good part of the time we went around the room talking about how we describe archival collections with various workflows using Encoded Archival Description (EAD), and how this was mostly working (or not).

Some good work has already been done imagining how Linked Data can transform archival description by the LOCAH (now Linking Lives) as well as the Social Networks and Archival Context project. I think tools like Editors’ Notes, CWRC Writer, and Google’s Research Pane could provide really useful models for how the work of an archivist could benefit from linking to external resources such as Wikipedia, dbpedia, VIAF, etc. But we really didn’t talk about that in too much detail. The focus instead was on various tools people used in their EAD workflows: Archivists’ Toolkit, Oxygen, ExistDB, Access databases, etc … and the hope that Archives Space could possibly improve matters. We did touch briefly on what it means to make finding aids available on the Web, but not in a very satisfactory way.

I was really struck by how everyone was using EAD, even if their tools were different. I was also left with the lingering suspicion that not much of this EAD data was linked to from the HTML presentation of the finding aid. After some conversations it was also my understanding that even after 20 years of work on EAD, there is not a listing of websites that make EAD finding aids available. It seems particularly sad that institutions have invested a lot of time and effort in putting EAD into practice, and yet we still aren’t really sharing them very well with each other.

So in a bit of a fit of frustration I did some hacking to see if I could use Google and ArchiveGrid to identify websites that serve up finding aids either as HTML or as EAD XML. I wanted to:

  1. Get a list of websites that made HTML and EAD XML finding aids available. We can rely on Google to index the Web, but maybe we could index the archival web a bit better ourselves if we had a better understanding of where the EAD data was available. The idea is that this initial list could be used to bootstrap a list of websites making EAD finding aids available in the Wikipedia entry for EAD.
  2. To see which websites have HTML representations that link to an EAD XML representation. The rationale here is to encourage a very simple best practice for linking to structured archival data when it is available. More on that below.

I was able to identify 201 hosts that served up finding aids either as HTML or XML. You should be able to see them here in this spreadsheet. I also collected URLs for finding aids (both HTML and XML) that I was able to locate, which can be seen in this JSON file.

With the URLs in hand I wrote a little script to examine which of the 156 hosts serving up HTML representations of finding aids had a link to an XML EAD document. I looked for a very simple kind of link that was popularized by the RSS and Atom syndication community for autodiscovery of blog feeds. A <link> tag that has a rel attribute of alternate and a type attribute set to application/xml. Out of the 156 websites serving up HTML representations of finding aids I could only find two websites that used this link pattern: Princeton University and Emory University.

For example if you view the HTML source for the Einstein Collection finding aid at Princeton you’ll see this link:

<link rel="alternate" type="application/xml" href="http://findingaids.princeton.edu/collections/C1022.xml" />

Similarly the finding aid for the Salman Rushdie collection at Emory University has this link:

<link rel="alternate" type="application/xml" href="/documents/rushdie1000/EAD/" />

As the title of this blog post suggests, I’m calling this pattern level 0 linked data. Linked Data purists would probably say this isn’t Linked Data at all since it doesn’t involve an RDF serialization. And I guess they would be right. But it does express a graph of HTML and EAD data that is linked, and it serves a real need. If you are interested in Linked Data and archives I encourage you to add these links to your HTML finding aids today.

So why is are these links important?

The main reason is they are found in HTML documents, which are the representations that matter most on the Web. HTML documents are read by people. They are hypertext documents that link to and from other places on an archives website and elswewhere on the Web at large. They are well understood technically by the Web development community…if you hire a developer they might have strong feelings about using PHP or Ruby, but they will know HTML backwards and forwards. They are crawled and indexed by search engine bots so that researchers around the world can discover our collections. They are cited in social environments like Twitter, Facebook, blog posts, etc. We have a responsibility to create stable homes (URLs) for our archival descriptions that fit into the Web.

The other reason is these links are important is that they make our investment in EAD visible on the Web for anyone who is looking. Nobody but ArchiveGrid actively crawl EAD XML data. They are the only ones that can find them, because they have been told where they are. If we did a better job of advertising the availability of our EAD documents I think we would see more tools and services around them. ArchiveGrid is a good example of the sort of tool that could be built on top of a web of EAD data. But what about archival collections in your local area? Perhaps it would be useful to have a service that let you look across the archival holdings of institutions in a consortium you belong to. Or perhaps you might want to create an alerting service that lets researchers know what new archival collections are being made available. Or maybe you need to collaborate with archives in a specific domain, and need tools that provide a custom experience for that distributed collection. I imagine there would be lots of ideas for apps if there were just a teensy bit more thought put into how finding aids (both the HTML and the XML) are put on the Web, and how we shared information about their availability.

Going forward I think HTML5 microdata and RDFa present some excellent opportunities for Linked Data representations of finding aids. Especially when you consider some of the vocabulary development being done around them; as well as some of the work being done by Tim Sherratt on using linked data to create new user experiences around archival data. But if your institution has already invested in creating EAD documents I think trying this link pattern with data you already have could be a good first step towards introducing linked data into your archive. I hope it is a first baby step that archives can take in merging some of the structured data found in the EAD XML document into the HTML they publish about their collections.

I’m planning on getting the list of EAD publishers into the Wikipedia article for EAD, and putting out a call for others to add their website if it is missing. I also think that a simple crawling and aggregation service that use the links in some fashion could also encourage more linking. A lot of this blog post has been mental preparation for my involvement in an IMLS funded project run out of Tufts that will be looking at Linked Archival Metadata, which is about to be kicked off this winter. If you’ve read this far, and have any thoughts or suggestions about this I’d enjoy hearing them either here, on Twitter or via email.

NoDB

Last week Liam Wyatt emailed me asking if I could add The National Museum of Australia to Linkypedia, which tracks external links from Wikipedia articles to specific websites. Specifically Liam was interested in seeing the list of articles that reference the National Museum, sorted by how much they are viewed at Wikipedia. This presented me with two problems:

  1. I turned Linkypedia off a few months ago, since the site hadn’t been seeing much traffic, and I have not yet figured out how to keep the site going on the paltry Linode VPS I’m using for other things like this blog.
  2. I hadn’t incorporated Wikipedia page view statistics into Linkypedia, because I didn’t know they were available, and even if I had I didn’t have Liam’s idea of using them in this way.

#1 was easily rectified since I still had the database lying around and had just disabled the Linkypedia Apache vhost. I brought it up, and added www.nma.gov.au to the list of hosts to monitor. #2 was surprisingly easy to solve as well, with a somewhat hastily written Django management command that pulls down the hourly stats and rifles through them looking for links to sites that Linkypedia manages. Incidentally the Python requests library makes efficiently iterating through large amounts of gzip compressed text at a URL very simple indeed.

After the management command had run for a few days and the stats for 2012 had been added to the Linkypedia database, I was able to see that the top 10 most accessed Wikipedia articles that linked to the National Museum were: Mick Jagger, Cricket, Kangaroo, Victorian Era, Ned Kelly, James Cook, Thylacine, Indigenous Australians and Playing Card and Emu … if you are curious the full list is available with counts, as well as similar lists for the Museum of Modern Art, the Biodiversity Heritage Library, Wikileaks, Thomas and other sites which Linkypedia also monitors somewhat sporadically.

So this was interesting…but it’s not actually what I set out to write about today.

While I had seen aggregate reports of Wikipedia page view data, prior to Liam’s email I didn’t know that hourly dumps of page view statistics existed. I recently did a series of experiments with realtime Wikipedia data, so naturally I wondered what might be do-able with the page-view stats. The data is gzip compressed and space delimited, which made it a perfect fit for noodling around in the Unix shell with curl, cut, sort, etc. Before much time passed I had a bash script that could run from cron every hour and dump out the top 1000 accessed English Wikipedia pages as a JSON file:

#!/bin/bash
 
# fetch.sh is a shell script to run from cron that will download the latest 
# hour's page view statistics and write out the top 1000 to a JSON file
# you will probably want to run it sufficiently after the top of the hour
# so that the file is likely to be there, e.g.
#
# 30 * * * * cd /home/ed/Projects/wikitrends/; ./fetch.sh
 
# first, get some date info
 
year=`date -u +%Y`
month=`date -u +%m`
day=`date -u +%d`
hour=`date -u +%H`
 
# this is the expected URL for the pagecount dump file
 
url="http://dumps.wikimedia.org/other/pagecounts-raw/$year/$year-$month/pagecounts-$year$month$day-${hour}0000.gz" 
 
# sometimes the filenames have a timestamp of 1 second instead of 0 
# so if 0000.gz isn't there try using 0001.gz instead
 
curl -f -s -I $url > /dev/null
retval=$?
if [ $retval -ne 0 ]; then
    url="http://dumps.wikimedia.org/other/pagecounts-raw/$year/$year-$month/pagecounts-$year$month$day-${hour}0001.gz" 
fi
 
# create a directory and filename for the JSON
 
json_dir="data/$year/$month/$day"
json_file="$json_dir/$hour.json"
mkdir -p $json_dir
 
# fetch the data and write out the stats!
 
curl --silent $url | \
    gunzip -c | \
    egrep '^en ' | \
    perl -npe '@cols=split/ /; print "$cols[2] $cols[1]\n";' | \
    sort -rn | \
    head -n 1000 | \
    ./jsonify.py > $json_file

And a short time later after that I had some pretty simple HTML and JavaScript that could use the JSON to display the top 25 accessed Wikipedia articles from the last hour.

Which I put it up on GitHub as Wikitrends. It was kind of surprising at first to see how prevalent mainstream media (mostly television) figures into the statistics. Perhaps it was a function of me working on the code at night when “normal” people are typically watching TV and looking up stuff related to the shows they are looking at. I did notice some odd things pop up in the list occasionally and found myself googling to see if there was recent news on the topic.

To help provide some context I added flyovers so that when you hover over the article title you will see the summary of the Wikipedia article. Behind the scenes the JavaScript looks up the article using the Wikipedia API and extracts the summary. This got me thinking that it could also be useful to include some links to canned searches at Google (last hour), Twitter and Facebook to provide context for the spike that the Wikipedia article is seeing. Perhaps it would be more interesting to see this information flow by somehow on the side…

A nice side effect of this minimalist (hereby dubbed NoDB…take that NoSQL!) approach to developing the Wikitrends app is that I have uniquely named, URL addressable JSON files for the top 1000 accessed English Wikipedia articles every hour at http://inkdroid.org/wikitrends/data. Even better, the JSON files even get archived at GitHub. Now don’t take this seriously, it’s late (or early) and I’m really just making a really lame joke. I’m sure having a database would be useful for trend analysis and whatnot. But for my immediate purposes it wasn’t really needed.

So, Wikitrends is another Wikiepdia stats curiosity app. At the very least I got a chance to use JavaScript a bit more seriously by working with Underscore and the very slick async library. Perhaps there are some ways you can think of to make Wikitrends more useful or interesting. If so please let me know.

And, since I haven’t said it enough before: thank you Wikimedia for taking such a pragmatic approach to making your data available. It is an inspiration.