diving into VIAF

Featured

Last week saw a big (well big for library data nerds) announcement from OCLC that they are making the data for the Virtual International Authority File (VIAF) available for download under the terms of the Open Data Commons Attribution (ODC-BY) license. If you’re not already familiar with VIAF here’s a brief description from OCLC Research:

Most large libraries maintain lists of names for people, corporations, conferences, and geographic places, as well as lists to control works and other entities. These lists, or authority files, have been developed and maintained in distinctive ways by individual library communities around the world. The differences in how to approach this work become evident as library data from many communities is combined in shared catalogs such as OCLC’s WorldCat.

VIAF’s goal is to make library authority files less expensive to maintain and more generally useful to the library domain and beyond. To achieve this, VIAF seeks to include authoritative names from many libraries into a global service that is available via the Web. By linking disparate names for the same person or organization, VIAF provides a convenient means for a wider community of libraries and other agencies to repurpose bibliographic data produced by libraries serving different language communities

More specifically, the VIAF service: links national and regional-level authority records, creating clusters of related records and expands the concept of universal bibliographic control by:

  • allowing national and regional variations in authorized form to coexist
  • supporting needs for variations in preferred language, script and spelling
  • playing a role in the emerging Semantic Web

If you went and looked at the OCLC Research page you’ll notice that last month the VIAF project moved to OCLC. This is evidence of a growing commitment on OCLC’s part to make VIAF part of the library information landscape. It currently includes data about people, places and organizations from 22 different national libraries and other organizations.

Already there has been some great writing about what the release of VIAF data means for the cultural heritage sector. In particular Thom Hickey’s Outgoing is a trove of information about the project, which provides a behind-the-scense look at the various services it offers.

Rather than paraphrase what others have said already I thought I would download some of the data and report on what it looks like. Specifically I’m interested in the RDF data (as opposed to the custom XML, and MARC variants) since I believe it to have the most explicit structure and relations. The shared semantics in the RDF vocabularies that are used also make it the most interesting from a Linked Data perspective.

Diving In

The primary data structure of interest in the data dumps that OCLC has made available is what they call the cluster. A cluster is essentially a hub-and-spoke model with a resource for the person, place or organization in the middle that is attached via the spokes to conceptual resources at the participating VIAF institutions. As an example here is an illustration of the VIAF cluster for the Canadian archivist Hugh Taylor

Here you can see a FOAF Person resource (yellow) in the middle that is linked to from SKOS Concepts (blue) for Bibliothèque nationale de France, The Libraries and Archives of Canada, Deutschen Nationalbibliothek, BIBSYS (Norway) and the Library of Congress. Each of the SKOS Concepts have their own preferred label, which you can see varies across institution. This high level view obscures quite a bit of data, which is probably best viewed in Turtle if you want to see it:

<http://viaf.org/viaf/14894854>
    rdaGr2:dateOfBirth "1920-01-22" ;
    rdaGr2:dateOfDeath "2005-09-11" ;
    a rdaEnt:Person, foaf:Person ;
    owl:sameAs <http://d-nb.info/gnd/109337093> ;
    foaf:name "Taylor, Hugh A.", "Taylor, Hugh A. (Hugh Alexander), 1920-", "Taylor, Hugh Alexander 1920-2005" .

<http://viaf.org/viaf/sourceID/BIBSYS%7Cx90575046#skos:Concept>
    a skos:Concept ;
    skos:inScheme <http://viaf.org/authorityScheme/BIBSYS> ;
    skos:prefLabel "Taylor, Hugh A." ;
    foaf:focus <http://viaf.org/viaf/14894854> .

<http://viaf.org/viaf/sourceID/BNF%7C12688277#skos:Concept>
    a skos:Concept ;
    skos:inScheme <http://viaf.org/authorityScheme/BNF> ;
    skos:prefLabel "Taylor, Hugh Alexander 1920-2005" ;
    foaf:focus <http://viaf.org/viaf/14894854> .

<http://viaf.org/viaf/sourceID/DNB%7C109337093#skos:Concept>
    a skos:Concept ;
    skos:inScheme <http://viaf.org/authorityScheme/DNB> ;
    skos:prefLabel "Taylor, Hugh A." ;
    foaf:focus <http://viaf.org/viaf/14894854> .

<http://viaf.org/viaf/sourceID/LAC%7C0013G3497#skos:Concept>
    a skos:Concept ;
    skos:inScheme <http://viaf.org/authorityScheme/LAC> ;
    skos:prefLabel "Taylor, Hugh A. (Hugh Alexander), 1920-" ;
    foaf:focus <http://viaf.org/viaf/14894854> .

<http://viaf.org/viaf/sourceID/LC%7Cn++82148845#skos:Concept>
    a skos:Concept ;
    skos:exactMatch <http://id.loc.gov/authorities/names/n82148845> ;
    skos:inScheme <http://viaf.org/authorityScheme/LC> ;
    skos:prefLabel "Taylor, Hugh A." ;
    foaf:focus <http://viaf.org/viaf/14894854> .

The Numbers

The RDF Cluster Dataset http://viaf.org/viaf/data/viaf-20120422-clusters.xml.gz is 2.1G gzip compressed RDF data. Rather than it being one complete RDF/XML file, each line has a complete RDF/XML document on it, which represents a single cluster. All in all there are 20,379,541 clusters in the file.

I quickly hacked together a rdflib filter that reads the uncompressed line-oriented RDF/XML and writes the RDF as ntriples:

import sys
 
import rdflib
 
for line in sys.stdin:
    g = rdflib.Graph()
    g.parse(data=line)
    print g.serialize(format='nt').encode('utf-8'),

This took 4 days to run on my (admittedly old) laptop. If you are interested in seeing the ntriples let me know and I can see about making it available somewhere. It is 2.8G gzip compressed. An ntriples dump might be a useful version of the RDF data for OCLC to make available, since it would be easier to load into triplestores, and otherwise muck around with (more on that below) than the line oriented RDF/XML. I don’t know much about the backend that drives VIAF (has anyone seen it written up?)…but I would understand if someone said it was too expensive to generate, and was intentionally left as an exercise for the downloader.

Given its line-oriented nature, ntriples is very handy for doing analysis from the Unix command line with cut, sort, uniq, etc. From the ntriples file I learned that the VIAF RDF dump is made up of 377,194,224 assertions or RDF triples. Here’s the breakdown on the types of resources present in the data:

Resource Type Number of Resources
skos:Concept 26,745,286
foaf:Document 20,379,541
foaf:Person 15,043,112
rda:Person 15,043,112
foaf:Organization 3,722,318
foaf:CorporateBody 3,722,318
dbpedia:Place 195,472

Here’s a breakdown of predicates (RDF properties) that are used:

RDF Property Number of Assertions
rdf:type 84,851,159
foaf:focus 45,510,716
foaf:name 44,729,247
rdfs:comment 41,253,178
owl:sameAs 32,741,138
skos:prefLabel 26,745,286
skos:inScheme 26,745,286
foaf:primaryTopic 20,379,541
void:inDataset 20,379,541
skos:altLabel 16,702,081
skos:exactMatch 8,487,197
rda:dateOfBirth 5,215,150
rda:dateOfDeath 1,364,355
owl:differentFrom 1,045,172
rdfs:seeAlso 1,045,172

I’m expecting these statistics to be useful in helping target some future work I want to do with the VIAF RDF dataset (to explore what an idiomatic JSON representation for the dataset would be, shhh). In addition to the RDF, OCLC also makes a dump of link data available. It is a smaller file (239M gzip compressed) of tab delimited data, which looks like:

...
http://viaf.org/viaf/10014828   SELIBR:219751
http://viaf.org/viaf/10014828   SUDOC:052584895
http://viaf.org/viaf/10014828   NKC:xx0015094
http://viaf.org/viaf/10014828   BIBSYS:x98003783
http://viaf.org/viaf/10014828   LC:24893
http://viaf.org/viaf/10014828   NUKAT:vtls000425208
http://viaf.org/viaf/10014828   BNE:XX917469
http://viaf.org/viaf/10014828   DNB:121888096
http://viaf.org/viaf/10014828   BNF:http://catalogue.bnf.fr/ark:/12148/cb13566121c
http://viaf.org/viaf/10014828   http://en.wikipedia.org/wiki/Liza_Marklund
...

There are 27,046,631 links in total. With a little more Unix commandline-fu I was able to get some stats on the number of links by institution:

Institution Number of Links
LC NACO (United States) 8,325,352
Deutschen Nationalbibliothek (Germany) 7,732,546
SUDOC (France) 2,031,452
BIBSYS (Norway) 1,822,681
Bibliothèque nationale de France 1,643,068
National Library of Australia 977,141
NUKAT Center (Poland) 894,981
Libraries and Archives of Canada 674,088
National Library of the Czech Republic 598,848
Biblioteca Nacional de España 519,511
National Library of Israel 327,455
Biblioteca Nacional de Portugal 321,064
English Wikipedia 301,345
Vatican Library 247,574
Getty Union List of Artist Names 202,711
National Library of Sweden 161,845
RERO (Switzerland) 119,366
Istituto Centrale per il Catalogo Unico (Italy) 45,208
Swiss National Library 33,866
National Széchényi Library (Hungary) 33,727
Bibliotheca Alexandrina (Egypt) 26,877
Flemish Public Libraries 4,819
Russian State Library 997
Extended VIAF Authority 109

The 301,345 links to Wikipedia are really great to see. It might be a fun project to see how many of these links are actually present in Wikipedia, and if they can be automatically added with a bot if they are missing. I think it’s useful to have the HTTP identifier in the link dump file, as is the case for the BNF identifiers. I’m not sure why the DNB, Sweden, and LC URLs aren’t expressed URLs as well.

One other parting observation (I’m sure I’ll blog more about this) is that it would be nice if more of the data that you see in the HTML presentation were available in the RDF dumps. Specifically, it would be useful to have the Wikipedia links expressed in the RDF data, as well as linked works (uniform titles).

Anyway, a big thanks to OCLC for making the VIAF dataset available! It really feels like a major sea change in the cultural heritage data ecosystem.

way, way back

Featured

For some experimental work I’ve been talking about with Nicholas Taylor (his idea, which he or I will write about later if it pans out) I’ve gotten interested in programmatic ways of seeing when a URL is available in a web archive. Of course there is the Internet Archive‘s collection; but what isn’t as widely known perhaps is that web archiving is going on around the world at a smaller scale, often using similar software, and often under the auspices of the International Internet Preservation Consortium.

Nicholas pointed me at some work around Memento, which provides a proxy-like API in front of some web archives. If you aren’t already familiar with it, Memento is some machinery, or a REST API for deterimining when a given URL is available in a Web Archive–which is pretty useful. Of course, like many standardization efforts it relies on people actually implementing it. For Web Architecture folks, the core idea in Memento is pretty simple; but I think its core simplicity may be obscured from software developers who need to fully digest the spec in order to say they “do” Memento.

Meanwhile a lot of web archives have used the Wayback Machine from the Internet Archive to provide a human interface to the archived web content. While looking at the memento-server code I was surprised to learn that the Wayback can also return structured data about what URLs have been archived. For example, you can see what content the Internet Archive has for the New York Times homepage by visiting:

http://wayback.archive.org/web/xmlquery?url=http://www.nytimes.com

which returns a chunk of XML like:

< ?xml version="1.0" encoding="UTF-8"?>
<wayback>
  <request>
    <startdate>19960101000000</startdate>
    <numreturned>4425</numreturned>
    <type>urlquery</type>
    <enddate>20120503151837</enddate>
    <numresults>4425</numresults>
    <firstreturned>0</firstreturned>
    <url>nytimes.com/</url>
    <resultsrequested>40000</resultsrequested>
    <resultstype>resultstypecapture</resultstype>
  </request>
  <results>
    <result>
      <compressedoffset>68043717</compressedoffset>
      <mimetype>text/html</mimetype>
      <file>IA-001766.arc.gz</file>
      <redirecturl>-</redirecturl>
      <urlkey>nytimes.com/</urlkey>
      <digest>GY3</digest>
      <httpresponsecode>200</httpresponsecode>
      <url>http://www.nytimes.com:80/</url>
      <capturedate>19961112181513</capturedate>
    </result>
    <result>
      <compressedoffset>8107</compressedoffset>
      <mimetype>text/html</mimetype>
      <file>BK-000007.arc.gz</file>
      <redirecturl>-</redirecturl>
      <urlkey>nytimes.com/</urlkey>
      <digest>GY3</digest>
      <httpresponsecode>200</httpresponsecode>
      <url>http://www.nytimes.com:80/</url>
      <capturedate>19961121230155</capturedate>
    </result>
    ...
  </results>
</wayback>

Sort of similarly you can see what the British Library’s Web Archive has for the BBC homepage by visiting:

http://www.webarchive.org.uk/wayback/archive/*/http://www.bbc.co.uk/

Where you will see:

< ?xml version="1.0" encoding="UTF-8"?>
<wayback>
  <request>
    <startdate>19910806145620</startdate>
    <numreturned>201</numreturned>
    <type>urlquery</type>
    <enddate>20120503152750</enddate>
    <numresults>201</numresults>
    <firstreturned>0</firstreturned>
    <url>bbc.co.uk/</url>
    <resultsrequested>10000</resultsrequested>
    <resultstype>resultstypecapture</resultstype>
  </request>
  <results>
    <result>
      <compressedoffset>75367408</compressedoffset>
      <mimetype>text/html</mimetype>
      <file>BL-196764-0.warc.gz</file>
      <redirecturl>-</redirecturl>
      <urlkey>bbc.co.uk/</urlkey>
      <digest>sha512:b155b8dd868c17748405b7a8d2ee3606efea1319ee237507055f258189c0f620c38d2c159fc4e02211c1ff6d265f45e17ae7eb18f94a5494ab024175fe6f79c3</digest>
      <httpresponsecode>200</httpresponsecode>
      <url>http://www.bbc.co.uk/</url>
      <capturedate>20080410162445</capturedate>
    </result>
    <result>
      <compressedoffset>92484146</compressedoffset>
      <mimetype>text/html</mimetype>
      <file>BL-7307314-46.warc.gz</file>
      <redirecturl>-</redirecturl>
      <urlkey>bbc.co.uk/</urlkey>
      <digest>sha512:6e37c62b3aa7b60cccc50d430bc7429ecf0d2662bca5562b61ba0bc1027c824a2f7526c747bfca52db46dba5a2ae9c9d96d013e588b2ae5d78188ea4436c571f</digest>
      <httpresponsecode>200</httpresponsecode>
      <url>http://www.bbc.co.uk/</url>
      <capturedate>20080527231330</capturedate>
    </result>
    ...
  </results>
</wayback>

It turns out British Library are using this structured data to access data from Hadoop where their web archives live on HDFS as WARC files–which is pretty slick. Actually WARCs on spinning disk is pretty awesome by itself, no matter how you are doing it.

Unfortunately I wasn’t able to make it to the International Internet Preservation Consortium meeting going on right now at the Library of Congress. I’m at home heating bottles, changing diapers, and dozing off in comfy chairs with a boppy around my waist. If I was there I think I would be asking:

  1. Is there a list of Wayback Machine endpoints that are on the Web? There are multiple ones at the California Digital Library, the Library of Congress, and elsewhere I bet.
  2. How many of them are configured to make this XML data available? Can it easily be turned on for those that don’t have it?
  3. Rather than requiring people to implement a new standard to improve interoperability, could we document the XML format that Wayback can already emit, and share the endpoints? This way web archives that don’t run Heretrix and Wayback could also share what content they have collected in the same community.

This isn’t to say that Memento isn’t a good idea (I think it is). I just think there might be some quick wins to be had by documenting and raising awareness about things that are already working away quietly behind the scenes. Perhaps the list of Wayback endpoints could be added to the Wikipedia page?

Ok, enough for now. I have a bottle to heat up :-)

Lessons of JSON

Featured

A recent (and short) IEEE Computing Conversations interview with Douglas Crockford about the development of JavaScript Object Notation (JSON) offers some profound, and sometimes counter-intuitive, insights into standards development on the Web.

Pave the Cowpaths

I don’t claim to have invented it, because it already existed in nature. I just saw it, recognized the value of it, gave it a name, and a description, and showed its benefits. I don’t claim to be the only person to have discovered it.

Crockford is likeably humble about the origins of JSON. Rather than claiming he invented JSON he instead says he discovered it–almost as if he was a naturalist on an expedition in some uncharted territory. Looking at the Web as an ecosystem with forms of life in it might seem more like a stretched metaphor or sci-fi plot; but I think it’s a useful and pragmatic stance. The Web is a complex space, and efforts to forcibly make it move in particular directions often fail, even when big organizations and trans-national corporations are behind them. Grafting, aligning and cross-fertilizing technologies, while respecting the communities that they grow from will probably feel more chaotic, but it’s likely to yield better results in the long run.

Necessity is the Mother of Invention

Crockford needed JSON when building an application where a client written in JavaScript needed to communicate with a server written in Java. He wanted something simple that would let him solve a real need he had right in front of him. He didn’t want the client to have to compose a query for the server, have the server perform the query against a database, and return something to the client that in turn needed to be queried again. He wanted something where the data serialization matched the data structures available to both programming language environments. He wanted something that made his life easier. Since Crockford was actually using JSON in his work it has usability at its core.

Less is More

Crockford tried very hard to strip unnecessary stuff from JSON so it stood a better chance of being language independent. When confronted with push back about JSON not being a “standard” Crockford registered json.org, put up a specification that documented the data format, and declared it as a standard. He didn’t expend a lot of energy in top-down mode, trying to convince industry and government agencies that JSON was the way to go. Software developers discovered it, and started using it in their applications because it made their life easier too. Some people complained that the X in AJAX stood for XML, and therefore JSON should not be used. But this dogmatism faded into the background over time as the benefits of JSON were recognized.

JSON is not versioned, and has no mechanism for revision. JSON cannot change. This probably sounds like heresy to many folks involved in standardization. It is more radical than the WHATWG’s decision to remove the version number from HTML5. It may only be possible because Crockford focused so much on keeping the JSON specification so small and tight. Crockford is realistic about JSON not solving all data representation problems, and speculates that we will see use cases that JSON does not help solve. But when this happens something new will be created instead of extending JSON. This relieves the burden of backwards compatibility that can often drag projects down into a quagmire of complexity. Software designed to work with JSON will always work with whatever valid JSON is found in the wild.

Anyhow

Don’t listen to me, go watch the interview!

on not linking

Featured

NPR Morning Edition recently ran an interview with Teju Cole about his most recent project called Small Fates. Cole is the recipient of the 2012 Hemingway Foundation/PEN Award for his novel Open City. Small Fates is a series of poetic snippets from Nigerian newspapers, which Cole has been posting on Twitter. It turns out Small Fates draws on a tradition of compressed news stories known as fait divers. The interview is a really nice description of the poetic use of this material to explore time and place. In some ways it reminds me a little of the cut-up technique that William S. Bouroughs popularized; albeit in a more lyrical, less dadaist form.

At about the 3 minute mark in the interview Cole mentions that he has recently been using content from historic New York newspapers using Chronicling America. For example:

Chronicling America is a software project I work on. Of course we were all really excited to hear Cole mention us on NPR. One thing that we were wondering is whether he could include shortened URLs to the newspaper page in Chronicling America in his tweets. Obviously this would be a clever publicity vehicle for the NEH funded National Digital Newspaper Program. It would also allow the Small Fates reader to follow the tweet to the source material, if they wanted more context.

Through the magic of Facebook, Twitter, good old email and Teju’s generosity I got in touch with him to see if he would be willing to include some shortened Chronicling America URLs in his tweets. His response indicated that he had clearly already thought about linking, but had decided not to. His reasons for not linking struck me as really interesting, and he agreed to let me quote them here:

I can’t include links directly in my tweets for three reasons.

The first is aesthetic: I like the way the tweets look as clean sentences. One wouldn’t wish to hyperlink a poem.

The second is artistic: I want people to stay here, not go off somewhere else and crosscheck the story. Why go through all the trouble of compression if they’re just going to go off and read more about it? What’s omitted from a story is, to me, an important part of a writer’s storytelling strategy.

And the third is practical: though I seldom use up all 140 characters, rarely do I have enough room left for a url string, even a shortened one.

I really loved this artistic (and pragmatic) rationale for not linking, and thought you might too.

Curating Curation

Featured

This post was composed over at Storify and exported here.

NoDB

Featured

Last week Liam Wyatt emailed me asking if I could add The National Museum of Australia to Linkypedia, which tracks external links from Wikipedia articles to specific websites. Specifically Liam was interested in seeing the list of articles that reference the National Museum, sorted by how much they are viewed at Wikipedia. This presented me with two problems:

  1. I turned Linkypedia off a few months ago, since the site hadn’t been seeing much traffic, and I have not yet figured out how to keep the site going on the paltry Linode VPS I’m using for other things like this blog.
  2. I hadn’t incorporated Wikipedia page view statistics into Linkypedia, because I didn’t know they were available, and even if I had I didn’t have Liam’s idea of using them in this way.

#1 was easily rectified since I still had the database lying around and had just disabled the Linkypedia Apache vhost. I brought it up, and added www.nma.gov.au to the list of hosts to monitor. #2 was surprisingly easy to solve as well, with a somewhat hastily written Django management command that pulls down the hourly stats and rifles through them looking for links to sites that Linkypedia manages. Incidentally the Python requests library makes efficiently iterating through large amounts of gzip compressed text at a URL very simple indeed.

After the management command had run for a few days and the stats for 2012 had been added to the Linkypedia database, I was able to see that the top 10 most accessed Wikipedia articles that linked to the National Museum were: Mick Jagger, Cricket, Kangaroo, Victorian Era, Ned Kelly, James Cook, Thylacine, Indigenous Australians and Playing Card and Emu … if you are curious the full list is available with counts, as well as similar lists for the Museum of Modern Art, the Biodiversity Heritage Library, Wikileaks, Thomas and other sites which Linkypedia also monitors somewhat sporadically.

So this was interesting…but it’s not actually what I set out to write about today.

While I had seen aggregate reports of Wikipedia page view data, prior to Liam’s email I didn’t know that hourly dumps of page view statistics existed. I recently did a series of experiments with realtime Wikipedia data, so naturally I wondered what might be do-able with the page-view stats. The data is gzip compressed and space delimited, which made it a perfect fit for noodling around in the Unix shell with curl, cut, sort, etc. Before much time passed I had a bash script that could run from cron every hour and dump out the top 1000 accessed English Wikipedia pages as a JSON file:

#!/bin/bash
 
# fetch.sh is a shell script to run from cron that will download the latest 
# hour's page view statistics and write out the top 1000 to a JSON file
# you will probably want to run it sufficiently after the top of the hour
# so that the file is likely to be there, e.g.
#
# 30 * * * * cd /home/ed/Projects/wikitrends/; ./fetch.sh
 
# first, get some date info
 
year=`date -u +%Y`
month=`date -u +%m`
day=`date -u +%d`
hour=`date -u +%H`
 
# this is the expected URL for the pagecount dump file
 
url="http://dumps.wikimedia.org/other/pagecounts-raw/$year/$year-$month/pagecounts-$year$month$day-${hour}0000.gz" 
 
# sometimes the filenames have a timestamp of 1 second instead of 0 
# so if 0000.gz isn't there try using 0001.gz instead
 
curl -f -s -I $url > /dev/null
retval=$?
if [ $retval -ne 0 ]; then
    url="http://dumps.wikimedia.org/other/pagecounts-raw/$year/$year-$month/pagecounts-$year$month$day-${hour}0001.gz" 
fi
 
# create a directory and filename for the JSON
 
json_dir="data/$year/$month/$day"
json_file="$json_dir/$hour.json"
mkdir -p $json_dir
 
# fetch the data and write out the stats!
 
curl --silent $url | \
    gunzip -c | \
    egrep '^en ' | \
    perl -npe '@cols=split/ /; print "$cols[2] $cols[1]\n";' | \
    sort -rn | \
    head -n 1000 | \
    ./jsonify.py > $json_file

And a short time later after that I had some pretty simple HTML and JavaScript that could use the JSON to display the top 25 accessed Wikipedia articles from the last hour.

Which I put it up on GitHub as Wikitrends. It was kind of surprising at first to see how prevalent mainstream media (mostly television) figures into the statistics. Perhaps it was a function of me working on the code at night when “normal” people are typically watching TV and looking up stuff related to the shows they are looking at. I did notice some odd things pop up in the list occasionally and found myself googling to see if there was recent news on the topic.

To help provide some context I added flyovers so that when you hover over the article title you will see the summary of the Wikipedia article. Behind the scenes the JavaScript looks up the article using the Wikipedia API and extracts the summary. This got me thinking that it could also be useful to include some links to canned searches at Google (last hour), Twitter and Facebook to provide context for the spike that the Wikipedia article is seeing. Perhaps it would be more interesting to see this information flow by somehow on the side…

A nice side effect of this minimalist (hereby dubbed NoDB…take that NoSQL!) approach to developing the Wikitrends app is that I have uniquely named, URL addressable JSON files for the top 1000 accessed English Wikipedia articles every hour at http://inkdroid.org/wikitrends/data. Even better, the JSON files even get archived at GitHub. Now don’t take this seriously, it’s late (or early) and I’m really just making a really lame joke. I’m sure having a database would be useful for trend analysis and whatnot. But for my immediate purposes it wasn’t really needed.

So, Wikitrends is another Wikiepdia stats curiosity app. At the very least I got a chance to use JavaScript a bit more seriously by working with Underscore and the very slick async library. Perhaps there are some ways you can think of to make Wikitrends more useful or interesting. If so please let me know.

And, since I haven’t said it enough before: thank you Wikimedia for taking such a pragmatic approach to making your data available. It is an inspiration.

2011 Musics

Featured

2011 was the year of streaming music for me–specifically using Rdio. Being able to follow what friends and folks I admire are listening to, easily listen along, and then build my own online collection in the cloud was a revelation. Being able to easily do it from my desktop at home, or at work, or on my mobile device for $5/month was just astounding. The world ain’t perfect, but this is damn near close.

Anyhow, here’s some of my favorite music from 2011, in no particular order … a lot of which I probably wouldn’t have listened to if it wasn’t streamable on the Web. You might have to wait a few seconds while the YouTube clips load.

genealogy of a typo

Featured

I got a Kindle Touch today for Christmas–thanks Kesa! Admittedly I’m pretty late to this party. As I made ready to purchase my first ebook I hopped over to my GoodReads to-read list, to pick something out. I scanned the list quickly, and my eye came to rest on Stephen Ramsey‘s recent book Reading Machines. But I got hung up on something irrelevant: the subtitle was Toward and Algorithmic Criticism instead of Toward an Algorithmic Criticism, the latter of which is clearly correct based on the cover image.

Having recently looked at API services for book data I got curious about how the title appeared on other popular web properties, such as Amazon:

GoogleBooks:

Barnes & Noble:

and LibraryThing

I wasn’t terribly surprised not to find it on OpenLibrary. But it does seem interesting that the exact same typo is present on all these book websites as well, while the title appears correct on the publisher’s website:

and at OCLC:

It’s hard to tell for sure, but my guess is that Amazon, Barnes & Noble, and GoogleBooks got the error from Bowker Link (the Books in Print data service), and that LibraryThing then picked up the data from Amazon, and similarly GoodReads picked up the data from GoogleBooks. LibraryThing can pull data from a variety of sources, including Amazon; and I’m not entirely sure where GoodReads gets their data from, but it seems likely that it comes from the GoogleBooks API given other tie-ins with Google.

If you know more about the lineage of data in these services I would be interested to hear it. Specifically if you have a subscription to BowkerLink it would be great if you could check the title. It would be nice to live in a world where these sorts of data provenance issues were easier to read.

polling and pushing with the Times Newswire API

Featured

I’ve been continuing to play around with Node.js some. Not because it’s the only game in town, but mainly because of the renaissance that’s going in the JavaScript community, which is kind of fun and slightly addictive. Ok, I guess that makes me a fan-boy, whatever…

So the latest in my experiments is nytimestream, which is a visualization (ok, it’s just a list) of New York Times headlines using the Times Newswire API. When I saw Derek Willis recently put some work into a Ruby library for the API I got to thinking what it might be like to use Node.js and Socket.IO to provide a push stream of updates. It didn’t take too long. I actually highly doubt anyone is going to use nytimestream much. So you might be wondering why I bothered to create it at all. I guess it was kind more of an academic exercise than anything to reinforce some things that Node.js has been teaching me.

Normally if you wanted a web page to dynamically update based on events elsewhere you’d have some code running in the browser routinely poll a webservice for updates. In this scenario our clients (c1, c2 and c3) poll the Times Newswire directly:

But what happens if lots of people start using your application? Yup, you get lots of requests going to the web service…which may not be a good thing, particularly if you are limited to a certain number of requests per day.

So a logical next step is to create a proxy for the webservice, which will reduce hits on the Times Newswire API.

But still, the client code needs to poll for updates. This can result in the proxy web service needing to field lots of requests as the number of clients increases. You can poll less, but that will diminish the real time nature of your app. If you are interested in having the real time updates in your app in the first place this probably won’t seem like a great solution.

So what if you could have the proxy web service push updates to the clients when it discovers an update?

This is basically what an event-driven webservice application allows you to do (labelled NodeJS in the diagram above). Node’s Socket.IO provides a really nice abstraction around streaming updates in the browser. If you view source on nytimestream you’ll see a bit of code like this:

var socket = io.connect();
socket.on('connect', function() {
  socket.on('story', function(story) {
    addStory(story);
    removeOld();
    fadeList();
  });
});

story is a JavaScript object that comes directly from the proxy webservice as a chunk of JSON. I’ve got the app running on Heroku, which currently recommends Socket.IO be configured to only do long polling (xhr-polling). Socket.IO actually supports a bunch of other transports suitable for streaming, including web sockets. xhr-polling basically means the browser keeps a connection open to the server until an update comes down, after which it quickly reconnects to wait for the next update. This is still preferable to constant polling, especially for the NYTimes API which often sees 15 minutes or more go by without an update. Keeping connections open like this can be expensive in more typical web stacks where each connection translates into a thread or process. But this is what Node’s non-blocking IO programming environment fixes up for you.

Just because I could, I added a little easter egg view in nytimestream, which allows you to see new stories come across the wire as JSON when nytimestream discovers them. It’s similar to Twitter’s stream API in that you can call it with curl. It’s different in that, well, there’s hardly the same amount of updates. Try it out with:

curl http://nytimestream.herokuapp.com/stream/

The occasional newlines are there to prevent the connection from timing out.

dcat:distribution considered helpful

Featured

The other day I happened to notice that the folks at data.gov.uk have started using the Data Catalog Vocabulary in the RDFa they have embedded in their dataset webpages. As an example here is the RDF you can pull out of the HTML for the Anonymised MOT tests and results dataset. Of particular interest to me is that the dataset description now includes an explicit link to the actual data being described using the dcat:distribution property.

     <http://data.gov.uk/id/dataset/anonymised_mot_test> dcat:distribution
         <http://www.dft.gov.uk/data/download/10007/DOC>,
         <http://www.dft.gov.uk/data/download/10008/ZIP>,
         <http://www.dft.gov.uk/data/download/10009/GZ>,
         <http://www.dft.gov.uk/data/download/10010/GZ>,
         <http://www.dft.gov.uk/data/download/10011/GZ>,
         <http://www.dft.gov.uk/data/download/10012/GZ>,
         <http://www.dft.gov.uk/data/download/10013/GZ>,
         <http://www.dft.gov.uk/data/download/10014/GZ>,
         <http://www.dft.gov.uk/data/download/10015/GZ>,
         <http://www.dft.gov.uk/data/download/10016/GZ>,
         <http://www.dft.gov.uk/data/download/10017/GZ>,
         <http://www.dft.gov.uk/data/download/10018/GZ>,
         <http://www.dft.gov.uk/data/download/10019/GZ>,
         <http://www.dft.gov.uk/data/download/10020/GZ>,
         <http://www.dft.gov.uk/data/download/10021/GZ>,
         <http://www.dft.gov.uk/data/download/10022/GZ> .

Chris Gutteridge happened to see a Twitter message of mine about this, and asked what consumes this data, and why I thought it was important. So here’s a brief illustration. I reran a little python program I have that crawls all of the data.gov.uk datasets, extracting the RDF using rdflib’s RDFa support (thanks Ivan). Now there are 92,550 triples (up from 35,478 triples almost a year ago).

So what can you do with the this metadata about datasets? I am a software developer working in the area where digital preservation meets the web. So I’m interested in not only getting the metadata for these datasets, but also the datasets themselves. It’s important to enable 3rd party, automated access to datasets for a variety of reasons; but the biggest one for me can be summarized with the common-sensical: Lots of Copies Keep Stuff Safe.

It’s kind of a no-brainer, but copies are important for digital preservation, when the unfortunate happens. The subtlety is being able to know where the copies of a particular dataset are in the enterprise, in a distributed system like the Web, and the mechanics for relating them together. It’s also important for scholarly communication, so that researchers can cite datasets and follow citations of other research to the actual dataset it is based upon. And lastly aggregation services that collect datasets for dissemination on a particular platform, like data.gov.uk, need ways to predictably sweep domains for datasets that needs to be collected.

Consider this practical example: as someone interested in digital preservation I’d like to be able to know what format types are used within the data.gov.uk collection. Since they have used the dcat:distribution property to point at the referenced dataset, I was able to write a small Python program to crawl the datasets and log the media type and HTTP status code along the way, to generate some results like:

media type datasets
text/html 5898
application/octet-stream 1266
application/vnd.ms-excel 874
text/plain 234
text/csv 220
application/pdf 167
text/xml 81
text/comma-separated-values 51
application/x-zip-compressed 36
application/vnd.ms-powerpoint 33
application/zip 31
application/x-msexcel 28
application/excel 21
application/xml 18
text/x-comma-separated-values 14
application/x-gzip 13
application/x-bittorrent 12
application/octet_stream 12
application/msword 10
application/force-download 10
application/x-vnd.oasis.opendocument.presentation 9
application/x-octet-stream 9
application/vnd.excel 9
application/x-unknown-content-type 6
application/xhtml+xml 6
application/vnd.msexcel 5
application/vnd.google-earth.kml+xml kml 5
application/octetstream 4
application/csv 3
vnd.openxmlformats-officedocument.spreadsheetml.sheet 2
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 2
application/octet-string 2
image/jpeg 1
image/gif 1
application/x-mspowerpoint 1
application/vnd.google-earth.kml+xml 1
application/powerpoint 1
application/msexcel 1

Granted some of these aren’t too interesting. The predominance of text/html is largely an artifact of using dcat:distribution to link to the splash page for the dataset, not to the dataset itself. This is allowed by the dcat vocabulary … but dcat’s approach kind of assumes that the domain of the assertion is suitably typed as a dcat:Download, dcat:Feed or dcat:WebService. I personally think that dcat has some issues that make it a bit more difficult to use than I’d like. But it’s extremely useful that data.gov.uk are kicking the tires on the vocabulary, so that kinks like this can be worked out.

The application/octet-stream media-type (and its variants) are also kind of useless for these purposes, since it basically says the dataset is made of bits. It would be more helpful if the servers in these cases could send something more specific. But it ought to be possible to use something like JHOVE or DROID to do some post-hoc analysis of the bitstream to figure out just what this data is, if it is valid etc.

The nice thing about using the Web to publish these datasets and their descriptions is that this sort of format analysis application could be decoupled from the data.gov.uk web publishing software itself. data.gov.uk becomes a clearinghouse for information and whereabouts of datasets, but a format verification service can be built as an orthogonal application. I think it basically fits the RESTful style of Curation Microservices being promoted by the California Digital Library:

Micro-services are an approach to digital curation based on devolving curation function into a set of independent, but interoperable, services that embody curation values and strategies. Since each of the services is small and self-contained, they are collectively easier to develop, deploy, maintain, and enhance. Equally as important, they are more easily replaced when they have outlived their usefulness. Although the individual services are narrowly scoped, the complex function needed for effective curation emerges from the strategic combination of individual services.

One last thing before you are returned to your regular scheduled programming. You may have noticed that the URI for the dataset being described in the RDF is different from the URL for the HTML view for the resource. For example:

http://data.gov.uk/id/dataset/anonymised_mot_test

instead of:

http://data.gov.uk/dataset/anonymised_mot_test

This is understandable given some of the dictums about Linked Data and trying to separate the Information Resource from the Non-Information Resource. But it would be nice if the URL resolved via a 303 redirect to the HTML as the Cool URIs for the Semantic Web document prescribes. If this is going to be the identifier for the dataset it’s important that it resolves so that people and automated agents can follow their nose to the dataset. I think this highlights some of the difficulties that people typically face when deploying Linked Data.

on preserving bookmarks

Featured

While it’s not exactly clear what the future of Delicious is, the recent news about Yahoo closing the doors or selling the building prompted me to look around at other social bookmarking tools, and to revisit some old stomping grounds.

Dan Chudnov has been running Unalog since 2003 (roughly when Delicious started). In fact I can remember Dan and Joshua Schacter having some conversations about the idea of social bookmarking as both of the services co-evolved. So my first experience with social bookmarking was on Unalog, but a year or so later I ended up switching to Delicious in 2004 for reasons I can’t quite remember. I think I liked some of the tools that had sprouted up around Delicious, and felt a bit guilty for abandoning Unalog.

Anyhow, I wanted to take the exported Delicious bookmarks and see if I could get them into Unalog. So I set up a dev Unalog environment, created a friendly fork of Dan’s code, and added the ability to POST a chunk of JSON:

    curl --user user:pass \
         --header "Content-type: application/json" \
         --data '{"url": "http://example.com", "title": "Example"}' \
         http://unalog.com/entry/new

Here’s a fuller example of the JSON that you can supply:

    {
      "url": "http://zombo.com",
      "title": "ZOMBO",
      "comment": "found this awesome website today",
      "tags": "website awesome example",
      "content": "You can do anything at Zombo.com. The only limit is yourself. Etc...",
      "is_private": true
    }

The nice thing about Unalog is that if you supply it (content), Unalog will index the text of the resource you’ve bookmarked. This allows you to do a fulltext search over your bookmarked materials.

So yeah, to make a long story a bit shorter, I created a script that reads in the bookmarks from a Delicious bookmark export (an HTML file) and pushes them up to a Unalog instance. Since the script GETs the bookmark URL to send Unalog the content to index you also get a log which contains the HTTP Status Code that provides the fodder for a linkrot report like:

200 OK 4546
404 Not Found 367
403 Forbidden 141
DNS failure 81
Connection refused 28
500 Internal Server Error 19
503 Service Unavailable 10
401 Unauthorized 9
410 Gone 5
302 Found 5
400 Bad Request 4
502 Bad Gateway 2
412 Precondition Failed 1
300 Multiple Choices 1
201 Created 1

That was 5,220 bookmarks total collected over 5 years–which initially seemed low, until I did the math and figured I did 3 bookmarks a day on average. If we lump all the non-200 OK responses, that amounts to 13% linkrot. At first blush this seems significantly different compared to the research done by Spinelli from 2003 (thanks Mike) which reported 28% linkrot. I would’ve expected the Spinelli results to be better than my haphazard bookmark collection since he was sampling academic publications on the Web. But he was also sampling links from the 1995-1999 period, while I had links from 2005-2010. I know this is mere conjecture, but maybe we’re learning to do things better on the web w/ Cool URIs. I’d like to think so at least. Maybe a comparison with some work done by folks at HP and Microsoft would provide some insight.

At the very least this was a good reminder of how important this activity of pouring data from one system into another is to digital preservation. What Greg Janée calls relay-supporting preservation.

Most of all I want to echo the comments of former Yahoo employee Stephen Hood who wrote recently about the value of this unique collection of bookmarks to the web community. If for some reason Yahoo were to close the doors on Delicious it would be great if they could donate the public bookmarks to the Web somehow, either via a public institution like the Smithsonian or the Library of Congress (full disclosure I work at the Library of Congress in a Digital Preservation group), or to an organization dedicated to the preservation of the Web like the International Internet Preservation Consortium of which LC, other National Libraries, and the Internet Archive are members.

Wikipedia 10

Featured

Wikipedia is turning ten years old on January 15th, and celebratory gatherings are going around the globe, including one in my area (Washington DC) on January 22 at the National Archives.

Like you, I’ve been an accidental user of Wikipedia when searching for a topic to learn more about. Over time I have started actively searching on Wikipedia for things, linking to Wikipedia articles (in email and HTML), and even saying thank you with a small donation. I’ve only attended one of the local DC chapter events before, but am definitely planning on attending the DC event to meet other people who value Wikipedia as a resource.

Perhaps also like you (since you are reading this blog) I also work in a cultural heritage organization, well a library to be precise. I wasn’t able to attend the recent Galleries, Libraries, Archives, Museums & Wikimedia conference at the British Museum last November. But I have been listening to the audio that they kindly provided recently with great interest. If you are interested in the role that cultural heritage organizations can play on Wikipedia, and the Web in general definitely check it out if you’ve got a long commute (or time to kill) and a audio device of some kind. There are lots of museums, galleries, archives and libraries in the DC area, so I’m hoping that this event on the 22nd will be an opportunity for folks to get together across institutional lines to talk about how they are engaging with the Wikipedia community. Who knows maybe it could be a precursor to a similar to GLAM-WIKI here in DC?

I’m planning on doing a lightning talk about this side/experimental project I’ve been working on called linkypedia. The basic idea is to give web publishers (and specifically cultural heritage organizations like libraries, museums, archives, etc) an idea of how their content is being used as primary resource material on Wikipedia. The goal is to validate the work that these institutions have done to make this content available, and for them to do more…and also to engage with the Wikipedia community. Version 1 of the (opensource) software is running at here on my minimal Linode VPS. But I’m also working on version 2, which will hopefully scale a bit better, and provide a more global (not just English Wikipedia) and real time picture of how your stuff is being used on Wikipedia. Part of the challenge is figuring out how to pay for it, given the volume of external links in the major language Wikipedias. I’m hoping a tip-jar and thoughtful use of Amazon EC2 will be enough.

If you are interested in learning more about the event on the 22nd check out Katie Filbert’s recent post to the Sunlight Labs Discussion List, and the (ahem) wiki page to sign up! Thanks Mark for letting me know about the birthday celebration last week in IRC. Sometimes with all the discussion lists, tweets, and blogs things like this slip by without me noticing them. So a direct prod in IRC helps :-)

wikipedia external links: a redis database

Featured

As part of my continued meandering linkypedia v2 experiments I created a Redis database of high level statistics about host names and top-level-domain names in external links from Wikipedia articles. Tom Morris happened to mention he has been loading the external links as well (thanks Alf), so I thought I’d make the redis database dump available to anyone that is interested in looking at it. If you want to give it a whirl try this out:

% wget http://inkdroid.org/data/wikipedia-extlinks.rdb
% sudo aptitude install redis-server
% sudo mv wikipedia-extlinks.rdb /var/lib/redis/dump.rdb
% sudo chown redis:redis /var/lib/redis/dump.rdb
% sudo /etc/init.d/redis-server restart
% sudo pip install redis # or easy_install (version in apt is kinda old)
% python
Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import redis
>>> r = redis.Redis()
>>> r.zrevrange("hosts", 0, 25, True)
[('toolserver.org', 2360809.0), ('www.ncbi.nlm.nih.gov', 508702.0), ('dx.doi.org', 410293.0), ('commons.wikimedia.org', 408986.0), ('www.imdb.com', 398877.0), ('www.nsesoftware.nl', 390636.0), ('maps.google.com', 346997.0), ('books.google.com', 323111.0), ('news.bbc.co.uk', 214738.0), ('tools.wikimedia.de', 181215.0), ('edwardbetts.com', 168102.0), ('dispatch.opac.d-nb.de', 166322.0), ('web.archive.org', 165665.0), ('www.insee.fr', 160797.0), ('www.iucnredlist.org', 155620.0), ('stable.toolserver.org', 155335.0), ('www.openstreetmap.org', 154127.0), ('d-nb.info', 141504.0), ('ssd.jpl.nasa.gov', 137200.0), ('www.youtube.com', 133827.0), ('www.google.com', 131011.0), ('www.census.gov', 124182.0), ('www.allmusic.com', 117602.0), ('maps.yandex.ru', 114978.0), ('news.google.com', 102111.0), ('amigo.geneontology.org', 95972.0)]
>>> r.zrevrange("hosts:edu", 0, 25, True)
[('nedwww.ipac.caltech.edu', 28642.0), ('adsabs.harvard.edu', 25699.0), ('animaldiversity.ummz.umich.edu', 21747.0), ('www.perseus.tufts.edu', 20438.0), ('genome.ucsc.edu', 20290.0), ('cfa-www.harvard.edu', 14234.0), ('penelope.uchicago.edu', 9806.0), ('www.bucknell.edu', 8627.0), ('www.law.cornell.edu', 7530.0), ('biopl-a-181.plantbio.cornell.edu', 5747.0), ('ucjeps.berkeley.edu', 5452.0), ('plato.stanford.edu', 5243.0), ('www.fiu.edu', 5004.0), ('www.volcano.si.edu', 4507.0), ('calphotos.berkeley.edu', 4446.0), ('www.usc.edu', 4345.0), ('ftp.met.fsu.edu', 3941.0), ('web.mit.edu', 3548.0), ('www.lpi.usra.edu', 3497.0), ('insects.tamu.edu', 3479.0), ('www.cfa.harvard.edu', 3447.0), ('www.columbia.edu', 3260.0), ('www.yale.edu', 3122.0), ('www.fordham.edu', 2963.0), ('www.people.fas.harvard.edu', 2908.0), ('genealogy.math.ndsu.nodak.edu', 2726.0)]
>>> r.zrevrange("tlds", 0, 25, True)
[('com', 11368104.0), ('org', 7785866.0), ('de', 1857158.0), ('gov', 1767137.0), ('uk', 1489505.0), ('fr', 1173624.0), ('ru', 897413.0), ('net', 868337.0), ('edu', 793838.0), ('jp', 733995.0), ('nl', 707177.0), ('pl', 590058.0), ('it', 486441.0), ('ca', 408163.0), ('au', 387764.0), ('info', 296508.0), ('br', 276599.0), ('es', 242767.0), ('ch', 224692.0), ('us', 179223.0), ('at', 163397.0), ('be', 132395.0), ('cz', 92683.0), ('eu', 91671.0), ('ar', 89856.0), ('mil', 87788.0)]
>>> r.zscore("hosts", "www.bbc.co.uk")
56245.0

Basically there are a few sorted sets in there:

  • “hosts”: all the hosts sorted by the number of externallinks
  • “hosts:%s”: where %s is a top level domain (“com”, “uk”, etc)
  • “tlds”: all the tlds sorted by the number of externallinks
  • “wikipedia”: the wikipedia langauges sorted by total number of externallinks

I’m not exactly sure how portable redis databases are but I was able to move it between a couple Ubuntu machines and Ryan successfully looked at it on a Gentoo box he had available. You’ll need roughly 300MB of RAM available. I must say I was impressed with redis and in particular sorted sets for this stats collection task. Thanks to Chris Adams for pointing me in the direction of redis in the first place.

wikixdc

Featured

Wikipedia’s 10th Birthday Party at the National Archives in Washington DC on Saturday was a lot of fun. Far and away, the most astonishing moment for me came early in the opening remarks by David Ferriero, the Archivist of the United States, when he stated (in no uncertain terms) that he was a big fan of Wikipedia, and that it was often his first go-to for information. Not only that, but when discussion about a bid for a DC WikiMania (the Wikipedia Annual Conference) came up later in the morning, Ferriero suggested that the National Archives would be willing to host it if it came to pass. I’m not sure if anything actually came of this later in the day–a WikiMania in DC would be incredible. It was just amazing to hear the Archivist of the United States be supportive of Wikipedia as a reference source…especially as stories of schools, colleges and universities rejecting Wikipedia as a source are still common. Ferriero’s point was even more poignant with several high schoolers in attendance. Now we all can say:

If Wikipedia is good enough for the Archivist of the United States, maybe it should be good enough for you.

Another highlight for me was meeting Phoebe Ayers, who is a reference librarian at UC Davis, member of the Wikimedia Foundation Board of Trustees, and author of How Wikipedia Works. I strong armed Phoebe into signing my copy (I bought this copy on Amazon after it was de-accessioned from Cuyahoga County Public Library in Parma, Ohio ). Phoebe has some exciting ideas for creating collaborations between libraries and Wikipedia, which I think fit quite well into the Galleries, Libraries, Archives and Museuems (GLAM) effort within Wikipedia. I think she is still working on how to organize the effort.

Later in the day we heard how the National Archives is thinking of following the lead of the British Museum and establishing a Wikipedian in Residence. Liam Wyatt, the first Wikipedian in Residence, put a human face on Wikipedia for the British Museum, and familiarized museum staff with editing Wikipedia, through activities like the Hoxne Challenge. Having a Wikipedia in Residence at the National Archives (and who knows maybe the Smithsonian and the Library of Congress) would be extremely useful I think.

In a similar vein, Sage Ross spoke at length about the Wikipedia Ambassador Program. The Ambassador Program is a formal way for folks to represent Wikipedia in academic settings (universities, high schools, etc). Ambassadors can get training in how to engage with Wikipedia (editing, etc) and can help professors and teachers who want to integrate Wikipedia into their curriculum, and scholarly activities.

I got to meet Peter Benjamin Meyer of the Bureau of Labor Statistics, who has some interesting ideas for aggregating statistical information from federal statistical sources, and writing some bots that will update article info-boxes for places in the United States. The impending release of the 2010 US Census Data has the Wikipedia community discussing the best way to update the information that was added by a bot for the 2000 census. It seemed like Peter might be able to piggy back some of his efforts on this work that is going on at Wikipedia for the 2010 Census.

Jyothis Edthoot an Oracle employee and Wikipedia Steward gave me a behind the scenes look at the tools he and others in Counter Vandalism Unit use to keep Wikipedia open for edits from anyone in the world. I also got to meet Harihar Shankar from Herbert van de Sompel’s team at Los Alamos National Lab, and to learn more about the latest developments with Memento, which he gave a lightning talk about. I also ran into Jeanne Kramer-Smyth of the World Bank, and got to hear about their efforts to provide meaningful access to their document collections to web crawlers using their metadata.

I did end up giving a lightning talk about Linkypedia (slides on the left). I was kind of rushed, and I wasn’t sure that this was exactly the right audience for the talk (being mainly Wikipedians instead of folks from the GLAM sector). But it helped me think through some of the challenges in expressing what Linkypedia is about, and who it is for. All in all it was a really fun day, with a lot of friendly folks interested in the Wikipedia community. There must’ve been at least 70 people there on a very cold Saturday–a promising sign of good things to come for collaborations between Wikipedia and the DC area.

OCLC’s mapFAST and CORS

Featured

Yesterday at Code4lib 2011 Karen Coombs gave a talk where (among other things) she demonstrated mapFAST that lets you find relevant subject headings for a given location, and then click on a subject heading and find relevant books on the topic. Go check out the archived video of her talk (note you’ll have to jump 39 minutes or so into the stream). Karen mentioned that the demo UI uses the mapFAST REST/JSON API. The service lets you construct a URL like this to get back subjects for any location you can identify with lat/lon coordinates:

http://experimental.worldcat.org/mapfast/services?geo={lat},{lon}";crs=wgs84&radius={radius-in-meters}&mq=&sortby=distance&max-results={num-results}"

For example:

ed@curry:~$ curl -i 'http://experimental.worldcat.org/mapfast/services?geo=39.01,-77.01;crs=wgs84&radius=100000&mq=&sortby=distance&max-results=1'
HTTP/1.1 200 OK
Date: Wed, 09 Feb 2011 14:07:39 GMT
Server: Apache/2.0.63 (Unix)
Access-Control-Allow-Origin: *
Transfer-Encoding: chunked
Content-Type: application/json

{
  "Status": {
    "code": 200,
    "request": "geocode"
  },
  "Placemark": [
    {
      "point": {
        "coordinates": "39.0064,-77.0303"
      },
      "description": "",
      "ExtendedData": [
        {
          "name": "NormalizedName",
          "value": "maryland silver spring woodside park"
        },
        {
          "name": "Feature",
          "value": "ppl"
        },
        {
          "name": "FCode",
          "value": "P"
        }
      ],
      "id": "fst01324433",
      "name": "Maryland -- Silver Spring -- Woodside Park"
    }
  ],
  "name": "FAST Authority Records"
}

Recently I have been reading Mark Pilgirm’s wonderful Dive into HTML5 book, so I got it into my head that it would be fun to try out some of the geo-location features in modern browsers to display subject headings that are relevant for wherever you are. A short time later I now have a simple HTML/JavaScript HTML5 application (dubbed subjects-here) that does just that. The application itself is really just a toy. Part of Karen’s talk was emphasizing the importance of using more than just text in Library Applications…and subjects-here kind of misses that key point.

What I wanted to highlight is the text in red in the HTTP response above:

Access-Control-Allow-Origin: *

The Access-Control-Allow-Origin HTTP header is a Cross-Origin Resource Sharing (CORS) header. If you’ve developed JavaScript applications before, you probably have run into situations where you wanted to load some JavaScript from a service elsewhere on the web. But you were prevented from doing this by Same Origin Policy, which prevents your JavaScript code from talking to a website that is different from the one it loaded from. So normally you hack around this by creating a proxy for that web service in your own application, which is a bit of work. Sometimes license agreements frown on you re-exposing their service, so you have to jump through a few more hoops to make sure it’s not an open proxy for the web service.

Enter CORS.

What the folks at OCLC did was add a Access-Control-Origin header to their JSON response. This basically means that my JavaScript served up at inkdroid.org is able to run in your browser and talk to the server at experimental.worldcat.org. OCLC has decided to allow this, to make their Web Service easier to use. So to create subjects-here I didn’t have to write a single bit of server side code, it’s just static HTML and JavaScript:

function main() {
    if (Modernizr.geolocation) {
        navigator.geolocation.getCurrentPosition(lookup_subjects);
    }
    else {
        display_error();
    }
}
 
function lookup_subjects(position) {
    lat = parseFloat(position.coords.latitude);
    lon = parseFloat(position.coords.longitude);
    url = "http://experimental.worldcat.org/mapfast/services?geo=" + lat + "," + lon + ";crs=wgs84&radius=100000&mq=&sortby=distance&max-results=15";
    $.getJSON(url, display_subjects);
}
 
function display_subjects(data) {
    // putting results into the DOM left as exercise to the reader
}

Nice and simple right? The full code is on GitHub, which seemed a bit superfluous since there is no server-side piece (it’s all in the browser). So the big wins are:

  • OCLC gets to see who is actually using their web service, not who is proxying it.
  • I don’t have to write some proxy code.

The slight drawbacks are:

  • My application has a runtime dependency on experimental.worldcat.org, but it kinda did already when I was proxying it.
  • Most modern browsers support CORS headers, but not all of them. So you would need to evaluate whether that matters to you.

I guess this is just a long way of saying USE CORS!! and help make the web a better place (pun intended).

Update: and also, it is a good example where something like GeoJSON and OpenSearch Geo could’ve been used to help spread common patterns for data on the Web. Thanks to Sean Gillies for pointing that out.

Update: and Chris is absolutely right, JSONP is another pattern in the Web Developer community that is a bit of a hack, but is an excellent fallback for older browsers.

release early, release often

Featured

The National Digital Newspaper Program (NDNP) went live with a new JavaScript viewer today (as well as a lot of other stylistic improvements) in the Beta area portion of Chronicling America.

Being able to smoothly zoom in on images, and go into fullscreen mode is really reminiscent (for me) of the visceral experience of using a microfilm reader. We joked about adding whirring sound effects when moving between pages. But you’ll be glad to know we showed restraint :-) It’s all kind of deeply ironic given the Web’s roots in the Memex.

Hats off to Dan Krech. Risa Ohara and Chris Adams for really digging into things like dynamically rendering tiles from our JPEG2000 access copies using Python (more maybe on that later).

I hacked together a brief video demonstration (above) of looking up the day the American Civil War ended (April 9, 1865) in the New York Daily Tribune, to show off the viewer. One thing I forgot to do was go into headless mode (F11 w/ Firefox, Chrome, etc), which amplifies the effect somewhat.

Aside from the improvements on the site, this is a real milestone for the project and (I believe) the Library of Congress generally, since it is a ‘beta’ preview of what we would like to replace the existing site with. Given the nature of what they do, libraries are typically fairly conservative and slow moving organizations. So knowing how to use a beta/experimental area has proven to be a challenge. Hopefully a little space for experimentation will pay off. I don’t think we could’ve gotten this far without the help of our fearless leader in all things tech, David Brunton.

If you have ideas, feel free to leave feedback using the little widget on the lower-right of pages at Chronicling America, or using our new mailing list that is devoted to the Open Source software project that makes Chronicling America available. Open Source too, imagine that!

on “good” repositories

Featured

Chris Rusbridge kicked off an interesting thread on JISC-REPOSITORIES with the tag line What makes a good repository? Implicit in this question, and perhaps the discussion list, is that he is asking about “digital” repositories, and not the brick n’ mortar libraries, archives, etc that are arguably also repositories.

The question of what a repository is, is pretty much verboten in the group I work in. This is kind of amusing since I work in a group whose latest name (in a string of names, and no-names) is the Repository Development Center. Well, maybe saying it’s “verboten” is putting it a bit too strongly. It’s not as if the “repository” is equivalent to He Who Shall Not Be Named or anything. It’s just that the word means so many things, to so many different people, and encompasses so much of what we do, that it’s hardly worth talking about. At our best (IMHO) we focus on what staff and researchers want to do with digital materials, and building out services that help them do that. Getting wrapped around the axle about what set of technologies we are using, and whether they model data in a particular way, is putting the cart before the horse.

As Dan penned one April 1st: “if you seek a pleasant repository, look about you”. I guess this largely depends on where you are sitting. But seriously, if there’s one thing that the Trustworthy Repositories Audit & Certification: Criteria and Checklist, the Ten Principles for Digital Preservation Repositories, and the Blue Ribbon Task Force on Sustainable Digital Preservation and Access make abundantly clear (after you’ve clawed out your eyes) it’s that the fiscal and social dimension of repositories are a whole lot more important in the long run than the technical bits of how a repository is assembled in the now. I’m a software developer, and by nature I reach for technical solutions to problems, but in my heart of hearts I know it’s true.

Back to Chris’ question. Perhaps the “digital” is a red-herring. What if we consider his question in light of traditional libraries? This got me thinking: could Ranganthan and his Five Laws of Library Science serve as a touchstone? Granted, bringing Ranganathan into library discussions is a bit of a cliché. But asking ethical questions like the “goodness” of something is a great excuse to dip into the canon. So put on your repository colored glasses, which magically substitute Repository Object for Book, and …

Repository Objects Are For Use

We can build repositories that function as dark archives. But it kind of rots your soul to do it. It rots your soul because no matter what awesome technologies you are using to enable digital preservation in the repository, the repository needs to be used by people. If it isn’t used, the stuff rots. And the digital stuff rots a whole lot faster than the physical materials. Your repository should have a raison d’être. It should be anchored in a community of people that want to use the materials that it houses. If it doesn’t the repository is likely to suck not be good.

Every Reader His/Her Repository Object

Depending on their raison d’être (see above) repositories are used by a wide variety of people: researchers, administrators, systems developers, curators, etc. It does a disservice to these people if the repository doesn’t support their use cases. A researcher probably doesn’t care when fixity checks were last performed, and an administrator generating a report on fixity checks doesn’t care about how an repository object was linked to and tagged in Twitter. Does your repository allow these different views, for different users to co-exist for the same object? Does it allow new classes of users to evolve?

Every Repository Object Its Reader

Are the objects in your repository discoverable? Are there multiple access pathways to them? For example, can someone do a search in Google and wind up looking at an item in your repository? Can someone link to it from a Wikipedia article? Can someone do a search within your repository to find an object of interest? Can they browse a controlled vocabulary or subject guide to find it? Are repository objects easily identified and found by automated agents like web crawlers and software components that need to audit them? Is it easy to extend, enhance and refine your description of what the repository object is as new users are discovered?

Save the Time of the Reader

Is your repository collection meaningfully on the Web? If it isn’t, it should be, because that’s where a lot of people are doing research today…in their web browser. If it can’t be open access on the web, that’s OK … but the collection and its contents should be discoverable so that someone can arrange an onsite visit. For example, can a genealogist do a search for a person’s name in a search engine and end up in your repository? Or do they have to know to come to your application to type in a search there? Once they are in your repository can they easily limit their search along familiar dimensions such as who, what, why, when, and where? Is it easy for someone to bookmark a search, or an item for later use. Do you allow your repository objects to be reused in other contexts like Facebook, Twitter, Flickr, etc which put the content where people are, instead of expecting them to come to you?

The Repository is a Growing Organism

This is my favorite. Can you keep adding numbers and types of objects, and scale your architecture linearly? Or are you constrained in how large the repository can grow? Is this constraint technical, social and/or financial? Can your repository change as new types or numbers of users (both human and machine) come into existence? When the limits of a particular software stack are reached, is it possible to throw it away and build another without losing the repository objects you have? How well does your repository fit into the web ecosystem? As the web changes do you anticipate your repository will change along with it? How can you retire functionality and objects; to let them naturally die, with respect, and make way for the new?

So …

I guess there are more questions here than answers. I hadn’t thought of framing repository questions in terms of Ranganathan’s laws before, but I imagine it has occurred to other people before. They still seem to be quite good principles to riff on, even in the digital repository realm–at least for a blog post. If you happen to run across similar treatment elsewhere I would appreciate hearing about them.

xhtml, wayback

Featured

The Internet Archive gave the Wayback Machine a facelift back in January. It actually looks really nice, but I noticed something kinda odd. I was looking for old archived versions of the lcsh.info site. Things work fine for the latest archived copies:

But during part of lcsh.info’s brief lifetime the site was serving up XHTML with the application/xhtml+xml media type. Now Wayback rightly (I think) remembers the media type, and serves it up that way:

ed@curry:~$ curl -I http://replay.waybackmachine.org/20081216020433/http://lcsh.info/
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
X-Archive-Guessed-Charset: UTF-8
X-Archive-Orig-Connection: close
X-Archive-Orig-Content-Length: 6497
X-Archive-Orig-Content-Type: application/xhtml+xml; charset=UTF-8
X-Archive-Orig-Server: Apache/2.2.8 (Ubuntu) DAV/2 SVN/1.4.6 PHP/5.2.4-2ubuntu5.4 with Suhosin-Patch mod_wsgi/1.3 Python/2.5.2
X-Archive-Orig-Date: Tue, 16 Dec 2008 02:04:31 GMT
Content-Type: application/xhtml+xml;charset=utf-8
X-Varnish: 1458812435 1458503935
Via: 1.1 varnish
Date: Wed, 09 Mar 2011 23:09:47 GMT
X-Varnish: 903390921
Age: 0
Via: 1.1 varnish
Connection: keep-alive
X-Cache: MISS

But to add navigation controls and branding, Wayback also splices in its own HTML into the display, which unfortunately is not valid XML. And since the media type and doctype trigger standards mode in browsers, the pages render in Firefox like this:

And in Chrome like this:

Now I don’t quite know what the solution should be here. Perhaps the HTML that is spliced in should be valid XML. Or maybe Wayback should just serve up the HTML as text/html. Or maybe this is a good use case for frames (gasp). But I imagine it will similarly afflict any other XHTML that was served up as application/xhtml+xml when Heretrix crawled it.

Sigh. I sure am glad that HTML5 is arriving on the scene and XHTML is riding off into the sunset. Although it’s kind of the Long Goodbye given Internet Archive has archived it.

Update: just a couple hours later I got an email that a fix for this was deployed. And sure enough now it works. I quickly eyeballed the response and didn’t see what the change was. Thanks very much Internet Archive!

geeks bearing gifts

Featured

I recently received some correspondence about the EZID identifier service from the California Digital Library. EZID is a relatively new service that aims to help cultural heritage institutions manage their identifiers. Or as the EZID website says:

EZID (easy-eye-dee) is a service that makes it simple for digital object producers (researchers and others) to obtain and manage long-term identifiers for their digital content. EZID has these core functions:

Create a persistent identifier: DOI or ARK

  • Add object location
  • Add metadata
  • Update object location
  • Update object metadata

I have some serious concerns about a group of cultural institutions relying on a single service like EZID for managing their identifier namespaces. It seems too much like a single point of failure. I wonder, are there any plans to make the software available, and to allow multiple EZID servers to operate as peers?

URLs are a globally deployed identifier scheme that depend upon HTTP and DNS. These technologies have software implementations in many different computer languages, for diverse operating systems. Bugs and vulnerabilities associated with these software libraries are routinely discovered and fixed, often because the software itself is available as open source, and there are “many eyes” looking at the source code. Best of all, you can put a URL into your web browser (which are now ubiquitous), and view a document that is about the identified resource.

In my humble opinion, cultural heritage institutions should make every effort to work with the grain of the Web, and taking URLs seriously is a big part of that. I’d like to see more guidance for cultural heritage institutions on effective use of URLs, what Tim Berners-Lee has called Cool URIs, and what the Microformats and blogging community call permalinks. When systems are being designed or evaluated for purchase, we need to think about the URL namespaces that we are creating, and how we can migrate them forwards. Ironically, identifier schemes that don’t fit into the DNS and HTTP landscape have their own set of risks; namely that organizations become dependent on niche software and services. Sometimes it’s prudent (and cost effective) to seek safety in numbers.

I did not put this discussion here to try to shame CDL in any way. I think the EZID service is well intentioned, clearly done in good spirit, and already quite useful. But in the long run I’m not sure it pays for institutions to go it alone like this. As another crank (I mean this with all due respect) Ted Nelson put it:

Beware Geeks Bearing Gifts.

On the surface the EZID service seems like a very useful gift. But it comes with a whole set of attendant assumptions. Instead of investing time & energy getting your institution to use a service like EZID, I think most cultural heritage institutions would be better off thinking about how they manage their URL namespaces, and making resource metadata available at those URLs.

snac hacks

Featured

A few months ago Brian Tingle posted some exciting news that the Social Networks and Archival Context (SNAC) project was releasing the data that sits behind their initial prototype:

As a part of our work on the Social Networks and Archival Context Project, the SNAC team is please to release more early results of our ongoing research.

A property graph of correspondedWith and associatedWith relationships between corporate, personal, and family identities is made available under the Open Data Commons Attribution License in the form of a graphML file. The graph expresses 245,367 relationships between 124,152 named entities.

The graphML file, as well as the scripts to create and load a graph database from EAC or graphML, are available on google code [5]

We are still researching how to map from the property graph model to RDF, but this graph processing stack will likely power the interactive visualization of the historical social networks we are developing.

The SNAC project have aggregated archival finding aid data for manuscript collections at the Library of Congress, Northwest Digital Archives, Online Archive of California and Virginia Heritage. They then used authority control data from NACO/LCNAF, Getty Union List of Artist Names Online (ULAN) and VIAF to knit these archival finding aids using the Encoded Archival Context – Corporate bodies, Persons, and Families (EAC-CPF).

I wrote about SNAC here about 9 months ago, and how much potential there is in the idea of visualizing archival collections across institutions, along the axis of identity. I had also privately encouraged Brian to look into releasing some portion of the data that is driving their prototype. So when Brian delivered I felt some obligation to look at the data and try to do something with it. Since Brian indicated that the project was interested in an RDF serialization, and Mark had pointed me at Aaron Rubenstein’s arch vocabulary, I decided to take a stab at converting the GraphML data to some Arch flavored RDF.

So I forked Brian’s mercurial repository, and wrote a script that parses the GraphML XML that Brian provided, and writes RDF (using arch:correspondedWith, arch:primaryProvenanceOf, arch:appearsWith) to a local triple store using rdflib. Since RDF has URLs cooked in pretty deep, part of this conversion involved reverse-engineering the SNAC URLs in the prototype, which wasn’t terribly clean, but it seemed good enough for demonstration purposes.

Once I had those triples (877,595 of them) I learned from Cory Harper that the SNAC folks had matched up the archival identities with entries in the Virtual International Authority File. The VIAF URLs aren’t present in their GraphML data (GraphML is not as expressive as RDF), but they are available in the prototype HTML, which I had URLs for. So, again, in the name of demonstration and not purity, I wrote a little scraper that would use the reverse-engineered SNAC URL to pull down the VIAF id. I tried to be respectful and not do this scraping in parallel, and to sleep a bit between requests. A few days of running and I had 40,237 owl:sameAs assertions that linked the SNAC URLs with the VIAF URLs.

With the VIAF URLs in hand I thought it would be useful to have a graph of only the VIAF related resources. It seemed like a VIAF centered graph of archival information could demonstrate something we’ve been talking about some in the Library Linked Data W3C Incubator Group: that Linked Data actually provides a technology that lets the archival and bibliographic description communities cross-pollinate and share. This is the real insight of the SNAC effort, that these communities have a lot in common, in that they both deal with people, places, organizations, etc. So I wrote another little script that created a named graph within the larger triple store, and used the owl:sameAs assertions to do some brute force inferencing, to generate triples relating VIAF resources with Arch.

I realize that Turtle isn’t probably the most compelling example of the result, but in the absence of an app (maybe more on that forthcoming) that uses it, it’ll have to do for now. So here are the assertions for Vannevar Bush, for the Linked Data fetishists out there:

@prefix foaf <http://xmlns.com/foaf/0.1/> .
@prefix arch <http://purl.org/archival/vocab/arch#> .

<http://viaf.org/viaf/15572358/#foaf:Person>
    a foaf:Person ;
    foaf:name "Bush, Vannevar, 1890-1974." ;
    arch:appearsWith <http://viaf.org/viaf/21341544/#foaf:Person>,
        <http://viaf.org/viaf/30867998/#foaf:Person>,
        <http://viaf.org/viaf/5076979/#foaf:Person>,
        <http://viaf.org/viaf/6653121/#foaf:Person>,
        <http://viaf.org/viaf/79397853/#foaf:Person> ;
    arch:correspondedWith <http://viaf.org/viaf/13632081/#foaf:Person>,
        <http://viaf.org/viaf/16555764/#foaf:Person>,
        <http://viaf.org/viaf/18495018/#foaf:Person>,
        <http://viaf.org/viaf/20482758/#foaf:Person>,
        <http://viaf.org/viaf/20994992/#foaf:Person>,
        <http://viaf.org/viaf/32065073/#foaf:Person>,
        <http://viaf.org/viaf/41170431/#foaf:Person>,
        <http://viaf.org/viaf/44376228/#foaf:Person>,
        <http://viaf.org/viaf/46092803/#foaf:Person>,
        <http://viaf.org/viaf/49966637/#foaf:Person>,
        <http://viaf.org/viaf/51816245/#foaf:Person>,
        <http://viaf.org/viaf/52483290/#foaf:Person>,
        <http://viaf.org/viaf/54269107/#foaf:Person>,
        <http://viaf.org/viaf/54947702/#foaf:Person>,
        <http://viaf.org/viaf/56705976/#foaf:Person>,
        <http://viaf.org/viaf/63110426/#foaf:Person>,
        <http://viaf.org/viaf/64014369/#foaf:Person>,
        <http://viaf.org/viaf/64087560/#foaf:Person>,
        <http://viaf.org/viaf/6724310/#foaf:Person>,
        <http://viaf.org/viaf/71767943/#foaf:Person>,
        <http://viaf.org/viaf/75645270/#foaf:Person>,
        <http://viaf.org/viaf/76361317/#foaf:Person>,
        <http://viaf.org/viaf/77126996/#foaf:Person>,
        <http://viaf.org/viaf/77903683/#foaf:Person>,
        <http://viaf.org/viaf/8664807/#foaf:Person>,
        <http://viaf.org/viaf/92419478/#foaf:Person> ;
    arch:primaryProvenanceOf <http://hdl.loc.gov/loc.mss/eadmss.ms001043>,
        <http://hdl.loc.gov/loc.mss/eadmss.ms007098>,
        <http://hdl.loc.gov/loc.mss/eadmss.ms010024>,
        <http://hdl.loc.gov/loc.mss/eadmss.ms998004>,
        <http://hdl.loc.gov/loc.mss/eadmss.ms998007>,
        <http://hdl.loc.gov/loc.mss/eadmss.ms998009>,
        <http://nwda-db.wsulibs.wsu.edu/findaid/ark:/80444/xv42415>,
        <http://www.oac.cdlib.org/findaid/ark:/13030/kt5b69p6zq>,
        <http://www.oac.cdlib.org/findaid/ark:/13030/kt8w1014rz> ;
    owl:sameAs <http://socialarchive.iath.virginia.edu/xtf/view?docId=Bush+Vannevar+1890-1974-cr.xml> .

I’ve made a full dump of the data I created available if you are interested in taking a look. The nice thing is that the URIs are already published on the web, so I didn’t need to mint any identifiers myself to publish this Linked Data. Although I kind of played fast and loose with the SNAC URIs for people since they don’t do the httpRange-14 dance. It’s interesting that it doesn’t seem to have immediately broken anything. It would be nice if the SNAC Prototype URIs were a bit cleaner I guess. Perhaps they could use some kind of identifier instead of baking the heading into the URL?

So maybe I’ll have some time to build a simple app on top of this data. But hopefully I’ve at least communicated how useful it could be for the cultural heritage community to share web identifiers for people, and use them in their data. RDF also proved to be a nice malleable data model for expressing the relationships, and serializing them so that others could download them. Here’s to the emerging (hopefully) Giant Global GLAM Graph!

DOIs as Linked Data

Featured

Last week Ross Singer alerted me to some pretty big news for folks interested in Library Linked Data: CrossRef has made the metadata for 46 million Digital Object Identifiers (DOI) available as Linked Data. DOIs are heavily used in the publishing space to uniquely identify electronic documents (largely scholarly journal articles). CrossRef is a consortium of roughly 3,000 publishers, and is a big player in the academic publishing marketplace.

So practically what this means is that all the places in the scholarly publishing ecosystem where DOIs are present (caveat below), it’s now possible to use the Web to retrieve metadata associated with that electronic document. Say you’ve got a DOI in the database backing your institutional repository:

doi:10.1038/171737a0

you can use the DOI to construct a URL:

http://dx.doi.org/10.1038/171737a0

and then do an HTTP GET (what your Web browser is doing all the time as you wander around the Web) to ask for metadata about that document:

curl –location –header “Accept: text/turtle” http://dx.doi.org/10.1038/171737a0

At which point you will get back some Turtle flavored RDF that looks like:

<http://dx.doi.org/10.1038/171737a0>
    a <http://purl.org/ontology/bibo/Article> ;
    <http://purl.org/dc/terms/title> "Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid" ;
    <http://purl.org/dc/terms/creator> <http://id.crossref.org/contributor/f-h-c-crick-367n8iqsynab1>, <http://id.crossref.org/contributor/j-d-watson-367n8iqsynab1> ;
    <http://prismstandard.org/namespaces/basic/2.1/doi> "10.1038/171737a0" ;
    <http://prismstandard.org/namespaces/basic/2.1/endingPage> "738" ;
    <http://prismstandard.org/namespaces/basic/2.1/startingPage> "737" ;
    <http://prismstandard.org/namespaces/basic/2.1/volume> "171" ;
    <http://purl.org/dc/terms/date> "1953-04-25Z"^^<http://www.w3.org/2001/XMLSchema#date> ;
    <http://purl.org/dc/terms/identifier> "10.1038/171737a0" ;
    <http://purl.org/dc/terms/isPartOf> <http://id.crossref.org/issn/0028-0836> ;
    <http://purl.org/dc/terms/publisher> "Nature Publishing Group" ;
    <http://purl.org/ontology/bibo/doi> "10.1038/171737a0" ;
    <http://purl.org/ontology/bibo/pageEnd> "738" ;
    <http://purl.org/ontology/bibo/pageStart> "737" ;
    <http://purl.org/ontology/bibo/volume> "171" ;
    <http://www.w3.org/2002/07/owl#sameAs> <doi:10.1038/171737a0>, <info:doi/10.1038/171737a0> .


<http://id.crossref.org/contributor/f-h-c-crick-367n8iqsynab1>
    a <http://xmlns.com/foaf/0.1/Person> ;
    <http://xmlns.com/foaf/0.1/familyName> "CRICK" ;
    <http://xmlns.com/foaf/0.1/givenName> "F. H. C." ;
    <http://xmlns.com/foaf/0.1/name> "F. H. C. CRICK" .

<http://id.crossref.org/contributor/j-d-watson-367n8iqsynab1>
    a <http://xmlns.com/foaf/0.1/Person> ;
    <http://xmlns.com/foaf/0.1/familyName> "WATSON" ;
    <http://xmlns.com/foaf/0.1/givenName> "J. D." ;
    <http://xmlns.com/foaf/0.1/name> "J. D. WATSON" .

Well without all the funky colors…I put them there to help illustrate how the RDF includes some useful information, such as:

  • the document is an Article
  • it has the title “Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid”
  • the article was published on April 25th, 1953
  • the article was published in the journal Nature
  • the article was written by two people: J. D. Watson and F. H. C. Crick
  • it can be found in volume 171, on pages 737-738

It’s also interesting that both the Bibliographic Ontology and the Publishing Requirements for Industry Standard Metadata (PRISM) vocabularies being used. RDF lets you mix in different vocabularies like this. Some people might see this description as partly redundant, but it allows a data publisher to play the field a bit in its descriptions, while still committing to a particular URL for the resource.

Anyhow, the whole point of Linked Data is that you (or your software) can follow your nose by noticing links to related resources of interest in the data. If you are familiar with Turtle and RDF (a more visual diagram is below) you’ll see that the article “Molecular Structure of Nucleic Acids” is “part of” another resource:

http://id.crossref.org/issn/0028-0836

If we follow our nose to this URL we get another bit of RDF:

<http://id.crossref.org/issn/0028-0836>
    <http://purl.org/dc/terms/publisher> <http://periodicals.dataincubator.org/organization/nature-publishing-group> ;
    <http://purl.org/dc/terms/sameAs> <http://periodicals.dataincubator.org/issn/0028-0836>, <urn:issn:0028-0836> ;
    <http://purl.org/dc/terms/title> "Nature" ;
    a "http://purl.org/ontology/bibo/Journal" .

Which tells us that the article is part of the journal Nature, which is the “same as” link to a resource in Linked Periodicals Data at the Data Incubator. When we resolve that URL we eventually get some more RDF:

<http://periodicals.dataincubator.org/journal/nature>
    dc:identifier <info:pmid/0410462>, <info:pmid/0410463> ;
    dc:subject "BIOLOGY", "Biologie", "CIENCIA", "NATURAL HISTORY", "Natuurwetenschappen", "Physique", "SCIENCE", "Science", "Sciences" ;
    dct:publisher <http://periodicals.dataincubator.org/organization/nature-publishing-group> ;
    dct:subject <http://id.loc.gov/authorities/sh85014203>, <http://id.loc.gov/authorities/sh00007934>, <http://id.loc.gov/authorities/sh85015263>, <http://id.loc.gov/authorities/sh85090222>, <http://id.loc.gov/authorities/sh85118553> ;
    dct:title "Nature" ;
    bibo:eissn "1476-4687" ;
    bibo:issn "0028-0836", "0090-0028" ;
    bibo:shortTitle "Nat New Biol", "Nature", "Nature New Biol." ;
    a bibo:Journal ;
    owl:sameAs <http://periodicals.dataincubator.org/eissn/1476-4687>, <http://periodicals.dataincubator.org/issn/0028-0836>, <http://periodicals.dataincubator.org/issn/0090-0028> ;
    foaf:isPrimaryTopicOf <http://locatorplus.gov/cgi-bin/Pwebrecon.cgi?DB=local&v1=1&ti=1,1&Search_Arg=0410462&Search_Code=0359&CNT=20&SID=1>, <http://locatorplus.gov/cgi-bin/Pwebrecon.cgi?DB=local&v1=1&ti=1,1&Search_Arg=0410463&Search_Code=0359&CNT=20&SID=1>, <http://www.ncbi.nlm.nih.gov/sites/entrez?Db=nlmcatalog&doptcmdl=Expanded&cmd=search&Term=0410462%5BNlmId%5D>, <http://www.ncbi.nlm.nih.gov/sites/entrez?Db=nlmcatalog&doptcmdl=Expanded&cmd=search&Term=0410463%5BNlmId%5D> .

Which (among other things) tells us that the journal Nature publishes content with the topic of “Biology” from the Library of Congress Subject Headings:

<http://id.loc.gov/authorities/sh85014203#concept>
    skos:prefLabel "Biology"@en ;
    dcterms:created "1986-02-11T00:00:00-04:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
    dcterms:modified "1990-10-09T11:20:35-04:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
    a skos:Concept ;
    owl:sameAs <info:lc/authorities/sh85014203> ;
    skos:broader <http://id.loc.gov/authorities/sh85076841#concept> ;
    skos:closeMatch <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb119440835> ;
    skos:inScheme <http://id.loc.gov/authorities#conceptScheme>, <http://id.loc.gov/authorities#topicalTerms> ;
    skos:narrower <http://id.loc.gov/authorities/sh00003440#concept>, <http://id.loc.gov/authorities/sh2001012327#concept>, <http://id.loc.gov/authorities/sh2003008355#concept>, <http://id.loc.gov/authorities/sh2005001919#concept>, <http://id.loc.gov/authorities/sh2006001276#concept>, <http://id.loc.gov/authorities/sh2006002547#concept>, <http://id.loc.gov/authorities/sh2006005143#concept>, <http://id.loc.gov/authorities/sh2007007463#concept>, <http://id.loc.gov/authorities/sh2009008123#concept>;
    skos:related <http://id.loc.gov/authorities/sh85076810#concept>, <http://id.loc.gov/authorities/sh85090222#concept>, <http://id.loc.gov/authorities/sh90000612#concept> .

Here we can see the topic of Biology as it relates to other concepts in the Library of Congress Subject Headings, as well as a similar concept in Biologie générale from RAMEAU, which are subject headings from the Bibliothèque nationale de France.

<http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb119440835>
    a skos:Concept ;
    skos:altLabel "Biologie générale"@fr ;
    skos:broader <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb119716335> ;
    skos:closeMatch <http://d-nb.info/gnd/4006851-1>, <http://id.loc.gov/authorities/sh85014203#concept> ;
    skos:inScheme <http://stitch.cs.vu.nl/vocabularies/rameau/autorites_matieres>, <http://stitch.cs.vu.nl/vocabularies/rameau/noms_communs> ;
    skos:narrower <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb11931061x>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb11931064z>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb119310659>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb11933905d>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb119348479>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb11940847j>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb119422283>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb119440599>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb11944082t>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb11946836b>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb11953082s>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb11958000t>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb119586710>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb11962028z>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb119658909>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb11978175d>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb11978521k>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb11978651s>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb11979066d>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb11979284w>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb11985187w>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb119867466>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb11988172k>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb119886708>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb120174264>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb12100722v>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb121238944>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb121441898>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb121519084>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb123305379>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb13162665q>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb13319250k>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb13622707v>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb144016698>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb150318912>, <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb150557648> ;
    skos:note "Domaine : 570"@fr ;
    skos:prefLabel "Biologie"@fr, "FRBNF119440833"@x-notation ;

So at this point maybe you’ll admit that it’s kind of cool to wander around in the data like this. But if you haven’t drunk the Kool-Aid recently (unlikely if you’ve read this far) you might be wondering: what’s the point? Who cares?

I think you should care about this example because it shows:

  1. how an existing organization can leverage its pre-existing identifiers on the Web to enable data publishing (Linked Data)
  2. how important it is for publishers to consider who they link to in their data, and how they do it
  3. how essential the RDF data model is for using the Web to join up these pools (or some may call them silos) of data

The raw Turtle RDF above may have made your eyes glaze over, so its worth restating that this new DOI service allows those with DOIs in their databases to use the machinery of the Web to aggregate and join up data from 4 different organizations: CrossRef, Data Incubator, Library of Congress, and the Bibliothèque nationale de France:

And it’s not just the traditional scholarly publishing community that will potentially benefit from this new Linked Data. As I discovered last August when routing around in the external links dumps from English Wikipedia there were 323,805 links from Wikipedia Articles to dx.doi.org–for example the article for Molecular Structure of Nucleic Acids has a citation that includes an external link to the DOI URL included above.

CrossRef’s new Linked Data service could allow someone to write a bot to crawl and verify the citations on Wikipedia. Or perhaps there could be a template on Wikipedia that would allow an editor to add a citation to an article by simply using the DOI, which would then fill in the other bits of article metadata needed for display. There are lots of possibilities.

As I commented over on the CrossTech blog (not approved yet), it would be handy if the service was able to parse and act on non-simple Accept headers during content negotiation, since it’s fairly common for RDF tools like jena, rdflib, arc, redland to send Accept headers with q-values in them. It might actually be nice to see support for some simple JSON views, that might be handy for people that get scared off RDF easily. But those are some minor quibbles in comparison to the outstanding work that CrossRef have done in getting this service going. Hopefully we’ll see more publishing organizations like DataCite helping build this data publishing community more as well.

Update: if this topic interests you, and you want to read more about it, definitely check out John Erickson‘s blog post DOIs, URIs and Cool Resolution.

a bit about scruffiness

Featured


Terry Winograd at CHI 2006 (by boltron)

I just finished reading Understanding Computers and Cognition by Terry Winograd and Fernando Flores and have to jot down some quick notes & quotes before I jump in and start reading it again … yeah, it’s that good.

Having gone on Rorty and Wittgenstein kicks recently, I was really happy to find this book while browsing the Neats vs Scruffies Wikipedia article a few months ago. It seems to combine this somewhat odd interest I have in pragmatism and writing software. While it was first published in 1986, it’s still very relevant today, especially in light of what the more Semantic Web heavy Linked Data crowd are trying to do with ontologies and rules. Plus it’s written in clear and accessible language, which is perfect for the arm-chair compsci/philosopher-type … so it’s ideal for a dilettante like me.

While only 207 pages long, the breadth of the book is kind of astounding. The philosophy of Heidegger figures prominently…in particular his ideas about throwness, breakdowns and readiness to hand which emphasize the importance of concernful activity over rationalist, representations of knowledge.

Heidegger insists that it is meaningless to talk about the existence of objects and their properties in the absence of concernful activity, with its potential for breaking down.

The work of the biologist Humberto Maturana forms the second part of the theoretical foundation of the book. The authors draw upon Maturana’s ideas about structural coupling to emphasize the point that:

The most successful designs are not those that try to fully model the domain in which they operate, but those that are ‘in alignment’ with the fundamental structure of that domain, and that allow for modification and evolution to generate new structure coupling.

And the third leg in the chair is John Searle’s notion of speech acts which emphasizes the role of commitment and action, or the social role of language in dealing with meaning.

Words correspond to our intuition about “reality” because our purposes in using them are closely aligned with our physical existence in a world and our actions within it. But the coincidence is the result of our use of language within a tradition … our structure coupling within a consensual domain. Language and cognition are fundamentally social … our ability to think and to give meaning to language is rooted in our participation in a society and a tradition.


Fernando Flores (by Sebastián Piñera)

So the really wonderful thing that this book does here is take this theoretical framework (Heidegger, Maturana & Searle) and apply it to the design of computer software. As the preface makes clear, the authors largely wrote this book to dismantle popular (at the time) notions that computers would “think” like humans. While much of this seems anachronistic today, we still see similar thinking in some of the ways that the Semantic Web is described, where machines will understand the semantics of data, using ontologies that model the “real world”.

There is still a fair bit of talk about getting the ontologies just right so that they model the world properly, and then running rule driven inference engines over the instance data, to “learn” more things. But what is often missing is a firm idea of what actual tools will use this new data. How will these tools be used by people acting in a particular domain? Like The Modeler, practitioners in the Linked Data and Semantic Web community often jump to modeling a domain, and trying to get it to match “reality” before understanding what the field of activity we want to support is…what we are trying to have the computer help us do … what new conversations we want the computer to enable with other people.

In creating tools we are designing new conversations and connections. When a change is made, the most significant innovation is the modification of the conversation structure, not the mechanical means by which the conversation is carried out. In making such changes we alter the overall pattern of conversation, introducing new possibilities or better anticipating breakdowns in the previously existing ones … When we are aware of the real impact of design we can more consciously design conversation structures that work.

It’s important to note here that these are conversations between people, who are acting in some domain, and using computers as tools. It’s the social activity that grounds the computer software, and not some correspondence that the software shares with reality or truth. I guess this is a subtle point, and I’m not doing a terrific job of elucidating it here, but if your interest is piqued definitely pick up a copy of the book. Over the past 5 years I’ve been lucky to work with several people who intuitively understand how important the social setting and alignment are to successful software development–but it’s nice to have the theoretical tools as ballast when the weather gets rough.

Another really surprising part of the book (given that it was written in 1986) is the foreshadowing of the agile school of programming:

… the development of any computer-based system will have to proceed in a cycle from design to experience and back again. It is impossible to anticipate all of the relevant breakdowns and their domains. They emerge gradually in practice. System development methodologies need to take this as a fundamental condition of generating the relevant domains, and to facilitate it through techniques such as building prototypes early in the design process and applying them in situations as close as possible to those in which they will eventually be used.

Compare that with the notion of iterative development that’s now prevalent in software development circles. I guess it shouldn’t be that surprising since the roots of extend back quite a ways. But still, it was pretty eerie seeing how on target Winograd and Flores could be still, particularly in the field of computing which has changed so rapidly in the last 25 years.

update: Kendall Clark has an interesting post that addresses some of the concerns about semantic web technologies.
update: Ryan Shaw recommended some more reading material in this vein.

the dpla as a generative platform

Featured

Last week I had the opportunity to attend a meeting of the Digital Public Library of America (DPLA) in Amsterdam. Several people have asked me why an American project was meeting in Amsterdam. In large part it was an opportunity for the DPLA to reach out to, and learn from European projects such as Europeana, LOD2 and Wikimedia Germany–or as the agenda describes:

The purpose of the May 16 and 17 expert working group meeting, convened with generous support from the Open Society Foundations, is to begin to identify the characteristics of a technical infrastructure for the proposed DPLA. This infrastructure ideally will be interoperable with international efforts underway, support global learning, and act as a generative platform for undefined future uses. This workshop will examine interoperability of discovery, use, and deep research in existing global digital library infrastructure to ensure that the DPLA adopts best practices in these areas. It will also serve to share information and foster exchange among peers, to look for opportunities for closer collaboration or harmonization of existing efforts, and to surface topics for deeper inquiry as we examine the role linked data might play in a DPLA.

Prior to the meeting I read the DPLA Concept Note, watched the discussion list and wiki activity — but the DPLA still seemed somewhat hard to grasp to me. The thing I learned at the meeting in Amsterdam is that this nebulousness is by design–not by accident. The DPLA steering committee aren’t really pushing a particular solution that they have in mind. In fact, there doesn’t seem to be a clear consensus about what problem they are trying to solve. Instead the steering committee seem to be making a concerted effort to keep an open, beginners-mind about what a Digital Public Library of America might be. They are trying to create conversations around several broad topic areas or work-streams: content and scope, financial/business models, governance, legal issues, and technical aspects. The recent meeting in Amsterdam focused on the technical aspects work-stream–in particular, best practices for data interoperability on the Web. The thought being that perhaps the DPLA could exist in some kind of distributed relationship with existing digital library efforts in the United States–and possibly abroad. Keeping an open mind in situations like this takes quite a bit of effort. There is often an irresistable urge to jump to particular use cases, scenarios or technical solutions, for fear of seeming ill informed or rudderless. I think the DPLA should be commended for creating conversations at this formative stage, instead of solutions in search of a problem.

I hadn’t noticed the phrase “generative platform” in the meeting announcement above until I began this blog post…but in hindsight it seems very evocative of the potential of the DPLA. At their best, digital libraries currently put content on the Web, so that researchers can discover it via search engines like Google, Bing, Baidu, etc. Researchers discover a pool of digital library content while performing a query in a search engine. Once they’ve landed in the digital library webapp they can wander outwards to related resources, and perhaps do a more nuanced search within the scoped context of the collection. But in practice this doesn’t happen all that often. I imagine many institutions digitize content that actually never makes it onto the Web at all. And when it does make it onto the Web it is often deep-web content hiding behind a web form, un-discoverable by crawlers. Or worse, the content might be actively made invisible by using a robots.txt to prevent search engines from crawling it. Sadly this is often done for performance reasons, not out of any real desire to keep the content from being found–because all too often library webapps are not designed to support crawling.

I was invited to talk very briefly (10 minutes) about Linked Data at the Amsterdam meeting. I think most everyone recognizes that a successful DPLA would exist in a space where there has been years of digitization efforts in the US, with big projects like the HathiTrust and countless others going on. I wanted to talk about how the Web could be used to integrate these collections. Rather than digging into a particular Linked Data solution to the problem of synchronization, I thought I would try to highlight how libraries could learn to do web1.0 a bit better. In particular I wanted to showcase how Google Scholar abandoned OAI-PMH (a traditional library standard for integrating collections) in favor of using sitemaps and metadata embedded in HTML. I wanted to show how thoughtful use of sitemaps, a sensible robots.txt, and perhaps some Atom to publish updates, and deletes a bit more methodically can offer just the same functionality as OAI-PMH, but in a way that is aligned with the Web, and the services that are built on top of it. Digital library initiatives often go off and create their own specialized way of looking at the Web, and ignore broader trends. The nature of grant funding, and research papers often serve as an incentive for this behavior. I’ve heard rumors that there is even some NISO working group being formed to look into standardizing some sort of feed based approach to metadata harvesting. Personally I think it’s probably more important for us to use some of the standards and patterns that are already available instead of trying to define another one.

So you could say I pulled a bit of a bait and switch: instead of talking about Linked Data I really ended up talking about Search Engine Optimization. I didn’t mention RDF or SPARQL once. If anyone noticed they didn’t seem to mind too much.

I learned a lot of very useful information during the presentations–too much to really note here. But there was one conversation that really stands out after a week has passed.

Greg Crane of the Perseus Digital Library spoke about about Deep Research, and how students and researchers participate in the creation of online knowledge. At one point Greg had a slide that contained a map of the entire world, and spoke about how the scope of the DPLA can’t really be confined to the United States alone–since American society is largely made up of immigrant communities (some by choice, some not) the scope of the DPLA is in fact the entire world. I couldn’t help but think how completely audacious it was to say that the Digital Public Library of America would somehow encompass the world — similar to how brick and mortar library and museum collections can often mirror the imperialistic interests of the states that they belong to.


Original WWW graphic by Robert Cailliau

So I was relieved when Stefan Gradmann asked how Greg thought the DPLA would fit in with projects like Europeana, which are already aggregating content from Europe. I can’t exactly recall Greg’s response (update: Greg filled in some of the blanks via email), but this prompted Dan Brickley to point out that in fact it’s pretty hard to draw lines around Europe too … and more importantly the Web is a space that can unite these projects. At this point Josh Greenberg jumped in and suggested that perhaps some thoughtful linking between sites like a DPLA and Europeana could help bring them together. This all probably happened in the span of 3 or 4 minutes, but the exchange really crystallized for me that the cultural heritage community could do a whole lot better at deep linking with each other. Josh’s suggestion is particularly good, because researchers could see and act on contextual links. It wouldn’t be something hidden in a data layer that nobody ever looks at. But to do this sort of linking right we would need to share our data better with each other, and it would most likely need to be Linked Data — machine readable data with URLs at its core. I guess it’s a no-brainer that for it to succeed the DPLA needs to be aligned with the ultimate generative platform of our era: the World Wide Web. Name things with URLs, create typed links between them, and other people’s stuff.

Another thing that struck me was how Europeana really gets the linking part. Europeana is essentially a portal site, or directory of digital objects for more than 15 million items provided by hundreds of providers across Europe. You can find these objects in Europeana, but if you drill down far enough you eventually find yourself on the original site that made the object available. I agree with folks that think that perhaps the user experience of the site would be improved if the user never left Europeana to view the digital object in full. This would necessarily require harvesting a richer version of the digital object, which would be more difficult, but not impossible. There would also be an opportunity to serve as a second copy for the object, which is potentially very desirable to originating institutions for preservation purposes…lots of copies keeps stuff safe.

But even in this hypothetical scenario where the object is available in full on Europeana, I think it would still be important to link out to the originating institution that digitized the object. Linking makes the provenance of the item explicit, which will continue to be important to researchers on the Web. But perhaps more importantly it gives institutions a reason to participate in the project as a whole. Participants will see increased awareness and use of their web properties, as users wander over from Europeana. Perhaps they could even link back to Europeana, which ought to increase Europeana’s density in the graph of the web, which also should boost its relevancy ranking in search engines like Google.

Another good lesson of Europeana is that it’s not just about libraries, but also includes many archives, museums and galleries. One of my personal disappointments about the Amsterdam meeting was that Mathias Schindler of Wikimedia-Germany had to pull out at the last minute. I’ve never met him, but Mathias has had a lot to do with trying to bring the Wikipedia and Library communities together. Efforts to promote the use of Wikipedia as a platform in the Galleries, Libraries, Archives and Museums (GLAM) sector are intensifying. The pivotal role that Wikipedia has had in the Linked Data community in the form of dbpedia.org is also very significant. Earlier this year there was a meeting of various Wikipedia, dbpedia and Freebase folks at a Data Summit, where people talked about the potential for an inner hub for the various languages wikipedias to share inter-wiki links, and extracted structured metadata. I haven’t heard whether this is actually leading anywhere currently, but at the very least its a recognition that Wikipedia is itself turning into a key part of information infrastructure on the web.

So I’ve rambled on a bit at this point. Thanks for reading this far. My take-away from the Amsterdam meeting was that the DPLA needs to think about how it wants to align itself with the Web, and work with its grain … not against it. This is easier said than done. The DPLA needs to think about incentives that would give existing digital library projects practical reasons to want to be involved. This also is easier said than done. And hopefully these incentives won’t just involve getting grant money. Keeping an open mind, taking a REST here and there, and continuing to have these very useful conversations (and contests) should help shape the DPLA as a generative platform.

meeting notes and a manifesto

Featured

A few weeks ago I was in sunny (mostly) Palo Alto, California at a Linked Data Workshop hosted by Stanford University, and funded by Council on Library and Information Resources. It was an invite-only event, with very little footprint on the public Web, and an attendee list that was distributed via email with “Confidential PLEASE do not re-distribute” across the top. So it feels more than a little bit foolhardy to be writing about it here, even in a limited way. But I was going to the meeting as an employee of the Library of Congress, so I feel a bit responsible for making some kind of public notes/comments about what happened. I suspect this might impact future invitations to similar events, but I can live with that :-)

A week is a long time to talk about anything…and Linked Data is certainly no exception. The conversation was buoyed by the fact that it was a pretty small group of around 30 folks from a wide variety of institutions including: Stanford University, Bibliotheca Alexandrina, Bibliothèque nationale de France, Aalto University, Emory University, Google, National Institute of Informatics, University of Virginia, University of Michigan, Deutschen Nationalbibliothek, British Library, Det Kongelige Bibliotek, California Digital Library and Seme4. So the focus was definitely on cultural heritage institutions, and specifically what Linked Data can offer them. There was a good mix of people in attendance: some who were relatively new to Linked Data and RDF, to others who had been involved in the space before the term Linked Data, RDF or the Web existed…and there were people like me who were somewhere in between.

A few of us took collaborative notes in PiratePad, which sort of tapered off as the week progressed and more discussion happened. After some initial lighting-talk-style presentations from attendees on Linked Data projects they were involved in, we spent most of the rest of the week breaking up into 4 groups, to discuss various issues, and then joining back up with the larger group for discussion. If things go as planned you can expect to see a report from Stanford that covers the opportunities and challenges that Linked Data offers the cultural heritage sector, which were raised during these sessions. I think it’ll be a nice compliment to the report that the W3C Library Linked Data Incubator Group is preparing, which recently became available as a draft.

One thing that has stuck with me a few weeks later, is the continued need in the cultural heritage Linked Data sector for reconciliation services, that help people connect up their resources with appropriate resources that other folks have published. If you work for a large organization, there is often even a need for reconciliation services within the enterprise. For example the British Library reported that it has some 300 distinct data systems within the organization, that sometimes need to be connected together. Linking is the essential ingredient, the sine qua non of Linked Data. Linking is what makes Linked Data and the RDF data model different. It helps you express the work you may have done in joining up your data with other people’s data. It’s the 4th design pattern in Tim Berners-Lee’s Linked Data Design Issues:

Include links to other URIs. so that they can discover more things.

But, expressing the links is the easy part…creating them where they do not currently exist is harder.

Fortunately, Hugh Glaser was on hand to talk about the role that sameas.org plays in the Linked Data space, and how the RKBexplorer managed to reconcile authors names across institutional repositories. He also has described some work with the British Museuem linking up two different data silos about museum objects, to provide a unified web views for those objects. Hugh, if you are reading this, can you comment with a link to this work you did, and how it surfaces in British Museum website?

Similarly Stefano Mazzocchi talked about how Google’s tools like Freebase Suggest and their WebID app can make it easy to integrate Freebase’s identity system into your applications. If you are building a cataloging tool, take a serious look at what using something like Freebase Suggest (a jquery plugin) can offer your application. In addition, as part of the Google Refine data cleanup tool, Google has created an API for data reconciliation services, which other service providers could supply. Stefano indicated that Google was considering releasing the code behind this reconciliation service, and stressed that it is useful for the community to make more of these reconciliation services available, to help others link their data with other peoples data. It seems obvious I guess, but I was interested to hear that Google themselves are encouraging the use of Freebase IDs to join up data within their enterprise.

Almost a year ago Leigh Dodds created a similar API layer for data that is stored in the Talis Platform. Now that the British National Bibliography is being made available in a Talis Store, it might be possible to use Leigh’s code to put a reconciliation service on top of that data. Caveat being that not all the BNB is currently available there. By the way, hats off to the British Library for iteratively making that data available, and getting feedback early, instead of waiting for it all to be “done”…which of course they never will be, if they are successful at integrating Linked Data into their existing data work flows.

If you squint right, I think it’s also possible to look at the VIAF AutoSuggest service as a type reconciliation service. It would be useful to have a similar service over the Library of Congress Subject Headings at id.loc.gov. Having similar APIs for these services could be a handy thing as we begin to build new descriptive cataloging tools that take advantage of these pools of data. But I don’t think it’s necessarily essential, as the APIs could be orchestrated in a more ad hoc, web2.0 mashup style. I imagine I’m not alone in thinking we’re now at the stage when we can start building new cataloging tools that take advantage of these data services. Along those lines Rachel Frick had an excellent idea to try to enhance collection building applications like Omeka and Archives Space to take advantage of reconciliation services under the covers. Adding a bit of suggest-like functionality to these tools could smooth out the description of resources that libraries, museums and archives are putting online. I think the Omeka-LCSH plugin is a good example of steps in this direction.

One other thing that stuck with me from the workshop is that the new (dare I say buzzwordy) focus on Library Linked Data is somewhat frightening to library data professionals. There is a lot of new terminology, and issues to work out (as the Stanford report will hopefully highlight). Some of this scariness is bound up with the Resource Description and Access sea change that is underway. But crufty as they are, data systems built around MARC have served the library community well over the past 30 years. Some of the motivations for Linked Data are specifically for Linked Open Data, where the linking isn’t as important as the openness. The LOD-LAM summit captured some of this spirit in the 4 Star Classification Scheme for Linked Open Cultural Metadata, which focuses on licensing issues. There was a strong undercurrent at the Stanford meeting about licensing issues. The group recognized that explicit licensing is important, but it was intentionally kept out of scope of most of the discussion. Still I think you can expect to see some of the heavy hitters from this group exert some influence in this arena to help bring clarity to licensing issues around our data. I think that some of the ideas of opening up the data, and disrupting existing business workflows around the data can seem quite scary to those who have (against a lot of odds) gotten them working. I’m thinking of the various cooperative cataloging efforts that allow work to get done in libraries today.

Truth be told, I may have inspired some of the “fear” around Linked Data by suggesting that the Stanford group work on a manifesto to rally around, much like what the Agile Manifesto did for the Agile software development movement. I don’t think we had come to enough consensus to really get a manifesto together, but on the last day the sub-group I was in came up with a straw man (near the bottom of the piratepad notes) to throw darts at. Later on (on my own) I kind of wordsmithed them into a briefer list. I’ll conclude this blog post by including the “manifesto” here not as some official manifesto of the workshop (it certainly is not), but more as a personal manifesto, that I’d like to think has been guiding some of the work I have been involved in at the Library of Congress over the past few years:

Manifesto for Linked Libraries

We are uncovering better ways of publishing, sharing and using information by doing it and helping others do it. Through this work we have come to value:

  • Publishing data on the Web for discovery over preserving it in dark archives.
  • Continuous improvement of data over waiting to publish perfect data.
  • Semantically structured data over flat unstructured data.
  • Collaboration over working alone.
  • Web standards over domain-specific standards.
  • Use of open, commonly understood licenses over closed, local licenses.

That is, while there is value in the items on the right, we value the items on the left more.

The manifesto is also on a publicly editable Google Doc; so if you feel the need to add or comment please have a go. I was looking for an alternative to “Linked Libraries” since it was not inclusive of archives and museums … but I didn’t spend much time obsessing on it. One of the selfish motivations for publishing the manifesto here was to capture it a particular point in time where I was happy with it :-)

Spotify, Rdio and Albums of the Year

Featured

I’ve recently started listening to streamed music a bit more on Rdio right around when Spotify launched in the US. I noticed that some albums that I might want to listen to weren’t available for streaming on Rdio, so I got it into my head that I’d like to compare the coverage of Rdio and Spotify–but I wasn’t sure what albums to check. Earlier this week I remembered that since 2007 Alf Eaton has put together Albums of the Year (AOTY) lists, where he has compiled the top albums of the year from a variety of sources. I think Alf’s tastes tend a bit toward independent music, which suits me fine because that’s the kind of stuff I listen to. So …

The AOTY HTML is eminently readable (thanks Alf), so I wrote a scraper to pull down the albums and store it as a JSON file. With this in hand I was able to use the Rdio and Spotify web services to look up the albums, and record whether it was found, and whether it was streamable in the United States, which I saved off as another JSON file. So, of the 7,406 unique albums on AOTY, 60% of them were available in Spotify, and 46% in Rdio 32% of them were available on Spotify, and 46% on Rdio (the strikeout is because of a bug that Alf spotted in the Spotify lookup code). I put the data in a public Fusion Table if you want to look at the results. If you notice anomalies please let me know. And speaking of anomalies…

Caveat Lector!

I was kind of surprised that Teen Dream by Beach House (which was mentioned on 27 AOTY lists in 2010) wasn’t showing up as being streamable on Spotify. So I asked on Twitter and Google+ if people in the US saw it as streamable. The results were kind of surprising. People from Michigan, Illinois, Texas, New York and the District of Columbia confirmed what the Web Service told me, that the album was not streamable. But then there were people in Massachusetts and California who reported it as streamable. What’s more, premium membership didn’t seem to affect the availability: the Massachusetts subscriber had a free account, and the Californian had a premium account, and both could stream it. So take the numbers above with a boulder sized grain of salt. It’s not clear what’s going on.

The spotify search API does not require authentication, and they have nice results that include all territories where the content is available. Rdio’s search API does require authentication, which apparently is used to tie your account to a geographic region, which in turn effects what whether the API will say the album is streamable or not.

So anyway, it was interesting to play around with the APIs a bit. It didn’t seem like the service agreements for the various APIs prevented this sort of exploration. I like the fact that Rdio is web based (go Django), and doesn’t require a proprietary client to use. But it looks like the coverage in Spotify is better. I’m not sure I will make any changes. If anyone has any information about whether streamability of content can vary within the United States I would be interested to hear it. This rights stuff is hard. Given the complexity of managing the rights to this content I’m kind of amazed that Rdio and Spotify exist at all…and I’m very glad that they do.

Update: it turns out that the folks who saw Teen Dream as available had it in their personal collection (Spotify is smart like that), which is why Spotify said it was available to them. So, no crazy state-by-state rights issues need to be entertained :-)

DbHd

Featured

Via Ivan Herman I found out about this 1 year old, excellent presentation from Hans Rosling at the World Bank about the importance of sharing our data.

DbHd is certainly a problem. But it strikes me that, paradoxically, it’s the love and care that people put into their datasets, that makes them so valuable to share.

If nobody was hugging their data, then nobody would care about setting it free on the Web.

an ode to node

Featured

When I made my first edit to Wikipedia a few years ago I can remember watching the recent changes page to see my contribution pop up. I was shocked to see just how quickly my edit was swept up in the torrent of edits that are going on all the time. I think everyone who googles for topical information is familiar with the experience of having Wikipedia articles routinely appear near the top of their search results. In hindsight it should’ve been obvious, but the level of participation in the curation of content at Wikipedia struck me as significant…and somehow different. It was wonderful to see living evidence of so many people caring to collaboratively document our world.

The Obsession

I work as a software developer in the cultural heritage sector, and often find myself building editing environments for users to collaboratively create and edit content. These systems typically get used here and there; but they in no way compare to the sheer volume of edit activity that Wikipedia sees from around the world, every single day. I guess I’d read about crowdsourcing, but had never been provided with a window into it like this before. My wife encourages her 5th grade students to think think critically about Wikipedia as an information source. One way she has done this in the past was by having them author an article for their school, which didn’t have an article previously. I wanted to help her and her students see how they were part of a large community of Wikipedia editors; and to give them a tactile sense of the amount of people who are actively engaged in making Wikipedia better.

A few months later Georgi Kobilarov let me know about the many IRC channels where various bits of metadata about recent changes in Wikipedia are announced. Georgi told me about a bot that the BBC run to track of changes to Wikipedia, so that relevant article content can be pulled back to the BBC. I guess a light bulb turned on. Could I use these channels to show people how much Wikipedia is actively curated, without requiring them to reload the recent changes page, connect to some cryptic IRC channels, or dig around in some (wonderfully) detailed statistics. More importantly, could it be done in a playful way?

The Apps

Some more time passed and I came across some new tools (more about these below) that made it easy to throw together a Web visualization of the Wikipedia update stream. The tools proved to be so much fun that I ended up making two.

wikistream displays the edits to 38 language wikipedias as a river of flowing text. The content moves by so quickly that I had to add a pause button (the letter p) in order to test things like clicking on the update to see the change that was made. The little icons to the left indicate whether the edit was made by a registered Wikipedia user, an anonymous user, or a bot (there are lots of them). After getting some good feedback on the wikitech-l discussion list I added some knobs to limit updates to specific languages and types of user, or size of the edit. I also added a periodically updating background image based on uploads to the Wikimedia Commons.

The second visualization app is called wikipulse. Dario Taraborelli of the Wikimedia Foundation emailed me with the idea to use the same update data stream I used in wikistream to fuel a higher level view of the edit activity using the gauge widget in Google’s Chart API. To the left is one of these gauges which displays the edits per minute on 36 wikipedia properties. If you visit wikipulse you will also see individual gauges for each language wikipedia. It’s a bit overkill seeing all the gauges on the screen, but it’s also kind of fun to see them update automatically every second relative to each other, based on the live edit activity.

The Tools


For both of these apps I needed to log into the wikimedia IRC server, listen on ~30 different channels, push all the updates through some code that helped visualize the data in some way, and then get this data out to the browser. I had heard good things about node for high concurrency network programming from several people. I ran across a node library called socket.io that reported to make it easy to stream updates from the server to the client, in a browser independent way, using a variety of transport protocols. Instinctively it felt like the pub/sub model would also be handy for connecting up the IRC updates with the webapp. I had been wanting to play around with the pub/sub features in redis for some time, and since there is a nice redis library for node I decided to give it a try.

Like many web developers I am used to writing JavaScript for the browser. Tools like jQuery and underscore.js successfully raised the bar to the point that I’m able to write JavaScript and still look myself in the mirror in the morning. But I was still a bit skeptical about JavaScript running on the server side. The thing I didn’t count on was how well node’s event driven model, the library support (socket.io, redis, express), and the functional programming style fit the domain of making the Wikipedia update stream available on the Web.

For example here’s is the code to connect to the ~30 IRC chatrooms stored in the channels variable, and send all the messages to a function processMessage:

1
2
3
4
5
6
var client = new irc({server: 'irc.wikimedia.org', nick: config.ircNick});
 
client.connect(function () {
  client.join(channels);
  client.on('privmsg', processMessage);
});

The processMessage function then parses the IRC message into a JavaScript dictionary and publishes it to a ‘wikipedia’ channel in redis:

1
2
3
4
function processMessage (msg) {
  m = parse_msg(msg.params);
  redis.publish('wikipedia', m);
}

Then over in my wikistream web application I set up socket.io so that when a browser goes to my webapp it negotiates for the best way to get updates from the server. Once a connection is established the server subscribes to the wikipedia channel and sends any updates it receives out to the browser. When the browser disconnects, the connection to redis is closed.

1
2
3
4
5
6
7
8
9
10
11
12
var io = sio.listen(app);
 
io.sockets.on('connection', function(socket) {
  var updates = redis.createClient();
  updates.subscribe('wikipedia');
  updates.on("message", function (channel, message) {
    socket.send(message);
  });
  socket.on('disconnect', function() {
    updates.quit();
  });
});

Each update is represented as a JavaScript dictionary, which socket.io and node’s redis client transparently serialize and deserialize. In order to understand the socket.io protocol a bit more, I wrote a little python script that connects to wikistream.inkdroid.org, negotiates for the xhr-polling transport, and prints out the stream JSON to the console. It’s a demonstration of how a socket.io instance like wikistream can be used as an API for creating a firehose like service. Although I guess the example might’ve been a bit cleaner to negotiate for a websocket instead.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
{
  'anonymous': False,
  'comment': '/* Anatomy */  changed statement that orbit was the eye to saying that the orbit was the eye socket for accuracy',
  'delta': 7,
  'flag': '',
  'namespace': 'article',
  'newPage': False,
  'page': 'Optic nerve',
  'pageUrl': 'http://en.wikipedia.org/wiki/Optic_nerve',
  'robot': False,
  'unpatrolled': False,
  'url': 'http://en.wikipedia.org/w/index.php?diff=449570600&oldid=447889877',
  'user': 'Moearly',
  'userUrl': 'http://en.wikipedia.org/wiki/User:Moearly',
  'wikipedia': '#en.wikipedia',
  'wikipediaLong': 'English Wikipedia',
  'wikipediaShort': 'en',
  'wikipediaUrl': 'http://en.wikipedia.org'
}

This felt so easy, it really made me re-evaluate everything I thought I knew about JavaScript. Plus it all became worth it when Ward Cunningham (the creator of the Wiki idea) wrote on the wiki-research list:

I’ve written this app several times using technology from text-to-speech to quartz-composer. I have to tip my hat to Ed for doing a better job than I ever did and doing it in a way that he makes look effortless. Kudos to Ed for sharing both the page and the software that produces it. You made my morning.

Ward is a personal hero of mine, so making his morning pretty much made my professional career.

I guess this is all a long way of saying what many of you probably already know…the tooling around JavaScript (and especially node) has changed so much, that it really does offer a radically new programming environment, that is worth checking out, especially for network programming. The event driven model that is baked into node, and the fact that v8 runs blisteringly fast, make it possible to write apps that do a whole lot in one low memory process. This is handy when deploying an app to an EC2 mini instance or Heroku, which is where wikipulse is running…for free.

Of course it helped that my wife and kids got a kick out of wikistream and wikipulse. I suspect that they think I’m a bit obsessed with Wikipedia, but that’s ok … because I kinda am.

the digital repository marketplace

Featured

The University of Southern California recently announced its Digital Repository (USCDR) which is a joint venture between the Shoah Foundation Institute and the University of Southern California. The site is quite an impressive brochure that describes the various services that their digital preservation system provides. But a few things struck me as odd. I was definitely pleased to see a prominent description of access services centered on the Web:

The USCDR can provide global access to digital collections through an expertly managed, cloud-computing environment. With its own content distribution network (CDN), the repository can make a digital collection available around the world, securely, rapidly, and reliably. The USCDR’s CDN is an efficient, high-performance alternative to leading commercial content distribution networks. The USCDR’s network consists of a system of disk arrays that are strategically located around the world. Each site allows customers to upload materials and provides users with high-speed access to the collection. The network supports efficient content downloads and real-time, on-demand streaming. The repository can also arrange content delivery through commercial CDNs that specialize in video and rich media.

But from this description it seems clear that the USCDR is creating their own content delivery network, despite the fact that there is already a good marketplace for these services. I would have thought it would be more efficient for the USCDR to provide plugins for the various CDNs rather than go through the effort (and cost) of building out one themselves. Digital repositories are just a drop in the ocean of Web publishers that need fast and cheap delivery networks for their content. Does the USCDR really think they are going to be able to compete and innovate in this marketplace? I’d also be kind of curious to see what public websites there are right now that are built on top of the USCDR.

Secondly, in the section on Cataloging this segment jumped out at me:

The USC Digital Repository (USCDR) offers cost-effective cataloging services for large digital collections by applying a sophisticated system that tags groups of related items, making them easier to find and retrieve. The repository can convert archives of all types to indexed, searchable digital collections. The repository team then creates and manages searchable indices that are customized to reflect the particular nature of a collection.

The USCDR’s cataloging system employs patented software created by the USC Shoah Foundation Institute (SFI) that lets the customers define the basic elements of their collections, as well as the relationships among those elements. The repository’s control standards for metadata verify that users obtain consistent and accurate search results. The repository also supports the use of any standard thesaurus or classification system, as well as the use of customized systems for special collections.

I’m certainly not a patent expert, but doesn’t it seem ill advised to build a digital preservation system around a patented technology? Sure, most of our running systems use possibly thousands of patented technologies, but ordinarily we are insulated from them by standards like POSIX, HTTP, or TCP/IP that allow us to swap out various technologies for other ones. If the particular technique to cataloging built into the USCDR is protected by a patent for 20 years, won’t that limit the dissemination of the technique into other digital preservation systems, and ultimately undermine the ability of people to move their content in and out of digital preservation systems as they become available–what Greg Janée calls relay supporting archives. I guess without more details of the patented technology it’s hard to say, but I would be worried about it.

After working in this repository space for a few years I guess I’ve become pretty jaded about turnkey digital repository systems that say they do it all. Not that it’s impossible, but it always seems like a risky leap for an organization to take. I guess I’m also a software developer, which adds quite a bit of bias. But on the other hand it’s great to see a repository systems that are beginning to address the basic concerns raised by the Blue Ribbon Task Force on Sustainable Digital Preservation and Access, which identified the need for building sustainable models for digital preservation. The California Digital Library is doing something similar with its UC3 Merritt system, which offers fee based curation services to the University of California (which USC is not part of).

Incidentally the service costs of USCDR and Merritt are quite difficult to compare. Merritt’s Excel Calculator says their cost is $1040 per TB per year (which is pretty straightforward, but doesn’t seem to account for the degree to which the data is accessed). The USCDR is listed as $70/TB per month for Disk-based File-Server Access, and $1000/TB for 20 years for Preservation Services. That would seem indicate the raw storage is a bit less than Merritt at $840.00 per TB per year. But what the preservation services are, and how the 20 year cost would be applied over a growing collection of content seems unclear to me. Perhaps I’m misinterpreting disk-based file-server access, which might actually refer to terabytes of data sent outside their USCDR CDN. In that case the $70/TB measures up quite nicely with a recent quote from Amazon S3 at $120.51 per terabyte transferred out per month. But again, does USCDR really think it can compete in the cloud storage space?

Based on the current pricing models, where there are no access driven costs, the USCDR and Merritt might find a lot of clients outside of the traditional digital repository ecosystem (I’m thinking online marketing or pornography) that have images they would like to serve at high volume for no cost other than the disk storage. That was my bad idea of a joke, if you couldn’t tell. But seriously I sometimes worry that digital repository systems are oriented around the functionality of a dark archive, where lots of data goes in, and not much data comes back out for access.

day of digital archives psa

Featured

Today is Day of Digital Archives day and I had this semi-thoughtful post written up about BagIt and how it’s a brain dead simple format to use to package up your files so that you’ll know if you still have them 5 minutes, 5 hours, 5 days, 5 years, maybe even 5 decades from now–if the notion of directories and files persists that long.

But I deleted that…you’re welcome…

I was also going to write about how in a fit of web performance art Mark Pilgrim recently deleted his online presence, including various extremely useful opensource tools, and several popular online books, only to see them re-materialize on the Web at new locations.

But I deleted most of that too…you’re welcome again!

Here’s a public service announcement instead. If you happen to use Franco Lazzarino’s Ruby BagIt Library to create bags that contains largish files (> 500MB), you might have accidentally created bad SHA1 manifests. I added a test, and fixed the bug with help from Mark Matienzo and Michael Klein, and sent a pull request. It hasn’t been applied yet, so here’s to hoping it will.

At $mpow we’ve been getting terabytes of data from this social media company that has been bagging their data using this Ruby library. Many of the files are multi-gigabytes gzip compressed. And many of the bags now have bad SHA1 manifests. The social media company wasn’t sure what the problem was, and told us just to ignore the SHA1 manifests. Which is easy enough to do.

It seems like no matter how simple the spec, it’s easy to create bugs. If you create bags, throw Bag-Software-Agent into your bag-info.txt…you never know who might find it useful.

thanks Wikipedia

Featured

Support Wikipedia
The field of software development, the Web and libraries is changing so fast that there is no way to know everything I need to know to do my job well. Wikipedia continues to be an essential resource for learning about technologies, algorithms, people, and history related to my work. It’s hard to imagine what the world would be like without it. Thanks for another year of awesome Wikipedia! The check is in the mail; well OK it’s actually coming from PayPal … you know the drill.

fascinating, hypnotic, inspirational, appalling, irrelevant

Featured

Thanks to everyone that noticed the Wikistream coverage in the NextWeb article and elsewhere. If you happen to have tweeted about Wikistream in the last 2 days you should see your avatar to the left. Click on it to make it bigger. I’m in there somewhere too :-)

Before Sunday the site hadn’t seen more than 180 unique visitors per day, and on Monday it saw almost 30,000. The site is kind of different since it streams all the Wikipedia updates to the browser as JSON, where it is then displayed. I had some nail-biting moments as I watched Node frequently streaming up to 300 concurrent connections. It was a wild ride for my little Linode VPS with 512MB of RAM, where WordPress and a Django website were also running…but it seemed to weather the storm OK. Mostly I think I could have used more RAM during peak usage when Node and Redis were wanting enough memory to cause the system to swap. Thanks to Gabe and Chris for helping me get the cache headers set right in Express.

I thought briefly about upgrading to a larger Linode instance, putting the app on EC2, or maybe asking Wikimedia if they wanted to host it. But Wikistream is really more a piece of performance art than it is a useful website. I’m expecting people that have looked at Wikistream once will have seen how much Wikipedia is actively edited, and not feel compelled to look at it again. After a few days I expect the usage to plummet and it can go back to running comfortably on my little Linode VPS to serve as a live prop in presentations about Wikipedia, crowd-sourcing, Web culture, etc.

One of my favorite mentions of Wikistream came from Nat Torkington’s Four Short Links on O’Reilly Radar. Nat described Wikistream as

fascinating and hypnotic and inspirational and appalling and irrelevant all at once

I took this as high-praise of course. I could only get the last two days out of Twitter’s search API, which misses the day that the NextWeb article appeared, followed by it getting picked up on Hacker News. But it was 226 tweets, and provided for a fun little data set to look at. I wrote a little script to look for URLs in the tweets, unshorten them and come up with a list of web pages that mentioned Wikistream in the past few days. One thing that was really interesting to me was the predominance of non-English websites. Here’s a list of some of them if you are interested.

OK, I’m finished with the narcissistic navel gazing for a bit. But seriously, thanks for the attention. I’ve never experienced anything like that before. Wow.