lots of copies keeps epubs safe

Over the weekend you probably saw the announcements going around about Google Books releasing +1 million public domain ebooks on the web as epubs. This is great news: epub is a web friendly, open format – and having all this content available as epub is important.

Now I might be greedy, but when I saw that 1 million epubs are available my mind immediately jumps to thinking of getting them, indexing them and whatnot. Then I guiltily justified my greedy thoughts by pondering the conventional digital preservation wisdom that Lots of Copies Keeps Stuff Safe (LOCKSS). The books are in the public domain, so …. why not?

Google Books has a really nice API, which lets you get back search results as Atom, with lots of links to things like thumbnails, annotations, item views, etc. You also get a nice amount of Dublin Core metadata. And you can limit your search to books published before 1923. For example here’s a search for pre-1923 books that mention “Stevenson” (disclaimer: I don’t think the 1923 limit is actually working):

curl 'http://books.google.com/books/feeds/volumes?tbs=cd_max:Jan%2001_2%201923&q=Stevenson' | xmllint --format -

which yields:

< ?xml version="1.0" encoding="UTF-8"?>

  http://www.google.com/books/feeds/volumes
  2010-08-30T20:37:27.000Z
  
  Search results for Stevenson
  
  
  
  
  
    Google Books Search
    http://www.google.com
  
  Google Book Search data API
  206
  1
  10
  
    http://www.google.com/books/feeds/volumes/ENMWAAAAYAAJ
    2010-08-30T20:37:27.000Z
    
    Kidnapped
    
    
    
    
    
    
    
    
    
    
    Robert Louis Stevenson
    1909
    308 pages
    book
    ENMWAAAAYAAJ
    HARVARD:HN1JZ9
    Kidnapped
    being memoirs of the adventures of David Balfour in the year 1851 ...
  
  
    http://www.google.com/books/feeds/volumes/WZ0vAAAAMAAJ
    2010-08-30T20:37:27.000Z
    
    Treasure Island
    
    
    
    
    
    
    
    
    
    
    Robert Louis Stevenson
    George Edmund Varian
    1918
    CHAPTER I THE OLD SEA DOG AT THE &quot;ADMIRAL BENBOW&quot; SQUIRE Trelawney, Dr. Livesey, 
and the rest of these gentlemen having asked me to write down the whole ...
    306 pages
    book
    WZ0vAAAAMAAJ
    NYPL:33433075793830
    Fiction
    Treasure Island
  
  
    http://www.google.com/books/feeds/volumes/REUrAQAAIAAJ
    2010-08-30T20:37:27.000Z
    
    Stevenson
    
    
    
    
    
    
    
    
    
    
    Adlai Ewing Stevenson
    Grace Darling
    David Darling
    1977-10
    127 pages
    book
    REUrAQAAIAAJ
    STANFORD:36105037014342
    McGraw-Hill/Contemporary
    Biography & Autobiography
    Stevenson
  
  
    http://www.google.com/books/feeds/volumes/3ibdGgAACAAJ
    2010-08-30T20:37:27.000Z
    
    Stevenson
    
    
    
    
    
    
    
    
    
    
    Robert Louis Stevenson
    2007-01-17
    This scarce antiquarian book is included in our special Legacy Reprint Series.
    128 pages
    book
    3ibdGgAACAAJ
    ISBN:1430495375
    ISBN:9781430495376
    Kessinger Pub Co
    Poetry
    Stevenson
    Day by Day
  
  
    http://www.google.com/books/feeds/volumes/3QI-AAAAYAAJ
    2010-08-30T20:37:27.000Z
    
    A child's garden of verses
    
    
    
    
    
    
    
    
    
    
    Robert Louis Stevenson
    1914
    IN winter I get up at night And dress by yellow candle-light. In summer, quite 
the other way, I have to go to bed by day. I have to go to bed and see The ...
    136 pages
    book
    3QI-AAAAYAAJ
    CORNELL:31924052752262
    Children's poetry, Scottish
    A child's garden of verses
    by Robert Louis Stevenson; illustrated by Charles Robinson
  
  
    http://www.google.com/books/feeds/volumes/Gmk-AAAAYAAJ
    2010-08-30T20:37:27.000Z
    
    Travels with a donkey in the Cevennes
    
    
    
    
    
    
    
    
    
    
    Robert Louis Stevenson
    1916
    THE DONKEY, THE PACK, AND THE PACK - SADDLE IN a little place called Le 
Monastier, in a pleasant highland valley fifteen miles from Le Puy, I spent 
about a ...
    287 pages
    book
    Gmk-AAAAYAAJ
    HARVARD:HWP541
    Cévennes Mountains (France)
    Travels with a donkey in the Cevennes
    An inland voyage
  
  
    http://www.google.com/books/feeds/volumes/f3A-AAAAYAAJ
    2010-08-30T20:37:27.000Z
    
    St. Ives
    
    
    
    
    
    
    
    
    
    
    Robert Louis Stevenson
    1906
    IVES CHAPTER IA TALE OF A LION RAMPANT IT was in the month of May,, that I was 
so unlucky as to fall at last into the hands of the enemy. ...
    528 pages
    book
    f3A-AAAAYAAJ
    HARVARD:HWP61W
    St. Ives
    being the adventures of a French prisoner in England
  
  
    http://www.google.com/books/feeds/volumes/4mb8LuKKwocC
    2010-08-30T20:37:27.000Z
    
    Cruising with Robert Louis Stevenson
    
    
    
    
    
    
    
    
    
    
    Oliver S. Buckton
    2007
    Cruising with Robert Louis Stevenson: Travel, Narrative, and the Colonial Body is the first book-length study about the influence of travel on Robert Louis ...
    344 pages
    book
    4mb8LuKKwocC
    ISBN:0821417568
    ISBN:9780821417560
    Ohio Univ Pr
    Literary Criticism
    Cruising with Robert Louis Stevenson
    travel, narrative, and the colonial body
  
  
    http://www.google.com/books/feeds/volumes/4yo9AAAAYAAJ
    2010-08-30T20:37:27.000Z
    
    New Arabian nights
    
    
    
    
    
    
    
    
    
    
    Robert Louis Stevenson
    1922
    THE SUICIDE CLUB STORY OF THE YOUNG MAN WITH THE CREAM TARTS DURING his 
residence in London, the accomplished Prince Florizel of Bohemia gained the ...
    386 pages
    book
    4yo9AAAAYAAJ
    HARVARD:HWP51H
    Fiction
    New Arabian nights
  
  
    http://www.google.com/books/feeds/volumes/z2Yf1FX02EkC
    2010-08-30T20:37:27.000Z
    
    Robert Louis Stevenson
    
    
    
    
    
    
    
    
    
    
    Richard Ambrosini
    Richard Dury
    2006
    As the editors point out in their Introduction, Stevenson reinvented the “personal essay” and the “walking tour essay,” in texts of ironic stylistic ...
    377 pages
    book
    z2Yf1FX02EkC
    ISBN:0299212246
    ISBN:9780299212247
    Univ of Wisconsin Pr
    Literary Criticism
    Robert Louis Stevenson
    writer of boundaries
  

Now it would be nice if the Atom included <link> elements for the epubs themselves. Perhaps the feed could even use the recently released “acquisition” link relation defined by OPDS v1.0. For example, by including something like the following in each atom:entry element:


Theoretically it should be possible to construct the appropriate link for the epub, based on what data is available in the Atom. But it would enable quite a bit of use of the epubs to make their URLs available explicitly in a programmatic way. Unfortunately we would still be limited to dipping into the full dataset using a query, instead of being able to crawl the entire archive, with something like a paged Atom feed. From a conversation over on get-theinfo it appears that this approach might not be as easy as it sounds. Also, it turns out that magically, many of the books have been uploaded to the Internet Archive. 902,188 of them in fact.

So maybe not that much work needs to be done. But presumably more public domain content will become available from Google Books, and it would be nice to be able to say there was at least one other copy of it elsewhere, for digital preservation purposes. It would be great to see Google step up and do some good, by making their API usable for folks wanting to replicate the public domain content. Still, at least they haven’t of done evil by locking it away completely. Dan Brickley had an interesting suggestion to possibly collaborate on this work.


simplicity and digital preservation, sorta

Over on the Digital Curation discussion list Erik Hetzner of the California Digital Library raised the topic of simplicity as it relates to digital preservation, and specifically to CDL’s notion of Curation Microservices. He referenced a recent bit of writing by Martin Odersky (the creator of Scala) with the title Simple or Complicated. In one of the responses Brian Tingle (also of CDL) suggested that simplicity for an end user and simplicity for the programmer are often inversely related. My friend Kevin Clarke prodded me in #code4lib into making my response to the discussion list into a blog post so, here it is (slightly edited).

For me, the Odersky piece is a really nice essay on why simplicity is often in the eye of the beholder. Often the key to simplicity is working with people who see things in roughly the same way. People who have similar needs, that are met by using particular approaches and tools. Basically a shared and healthy culture to make emergent complexity palatable.

Brian made the point about simplicity for programmers having an inversely proportional relationship to simplicity for end users, or in his own words:

I think that the simpler we make it for the programmers, usually the more complicated it becomes for the end users, and visa versa.

I think the only thing to keep in mind is that the distinction between programmers and end users isn’t always clear.

As a software developer I’m constantly using, or inheriting someone else’s code: be it a third party library that I have a dependency on, or a piece of software that somebody wrote once upon a time, who has moved on elsewhere. In both these cases I’m effectively an end-user of a program that somebody else designed and implemented. The interfaces and abstractions that this software developer has chosen are the things I (as an end user) need to be able to understand and work with. Ultimately, I think that it’s easier to keep software usable for end users (of whatever flavor) by keeping the software design itself simple.

Simplicity makes the software easier to refactor over time when the inevitable happens, and someone wants some new or altered behavior. Simplicity also should make it clear when a suggested change to a piece of software doesn’t fit the design of the software in question, and is best done elsewhere. One of the best rules of thumb I’ve encountered over the years to help get to this place is the Unix Philosophy:

Write programs that do one thing and do it well. Write programs to work together.

As has been noted elsewhere, composability is one of the guiding principles of the Microservices approach–and it’s why I’m a big fan (in principle). Another aspect to the Unix philosophy that Microservices seems to embody is:

Data dominates.

The software can (and will) come and go, but we are left with the data. That’s the reality of digital preservation. It could be argued that the programs themselves are data, which gets us into sci-fi virtualization scenarios. Maybe someday, but I personally don’t think we’re there yet.

Another approach I’ve found that works well to help ensure code simplicity has been unit testing. Admittedly it’s a bit of a religion, but at the end of the day, writing tests for your code encourages you to use the APIs, interfaces and abstractions that you are creating. So you notice sooner when things don’t make sense. And of course, they let you refactor with a safety net, when the inevitable changes rear their head.

And, another slightly more humorous way to help ensure simplicity:

Always code as if the person who ends up maintaining your code is a violent psychopath who knows where you live.

Which leads me to a jedi mind trick my former colleague Keyser Söze Andy Boyko tried to teach me (I think): it’s useful to know when you don’t have to write any code at all. Sometimes existing code can be used in a new context. And sometimes the perceived problem can be recast, or examined from a new perspective that makes the problem go away. I’m not sure what all this has to do with digital preservation. The great thing about what CDL is doing with microservices is they are trying to focus on the what, and not the how of digital preservation. Whatever ends up happening with the implementation of Merritt itself, I think they are discovering what the useful patterns of digital preservation are, trying them out, and documenting them…and it’s incredibly important work that I don’t really see happening much elsewhere.


bad xml smells

I’m used to refactoring code smells, but sometimes you can catch a bad whiff in XML too.

Before:

< ?xml version="1.0" encoding="UTF-8"?>


     
     
        
            Library of Congress
        
    

    
                    
        
            
                
                    
                            Library of Congress
                    
                    
                        sn82015056
                        
                    
                
            
        
    
                            
   
    
    
    
        
            
                
                    
                        
                           
                              The National Forum (Washington, DC), 1910-19??
                           
                           
                           
                              
The first issue of the National Forum was likely released on April 30, 1910 
and the newspaper ran through at least November 12 of that year. The four-page African-American 
weekly covered such local events as Howard University graduations and Baptist church activities, but its 
pages also included national news, sports, home maintenance, women's news, science, editorial 
cartoons, and reprinted stories from national newspapers. Its primary focus was on how the news 
affected the city's black community. A unique feature was its coverage of Elks Club meetings and 
activities.  Business manager John H. Wills contributed the community-centered "Vanity Fair" column that
 usually appeared on the front page of each issue. The publisher and editor was Ralph W. White, who 
went on to publish another African-American newspaper, the 
McDowell Times of Keystone, West Virginia. Originally located at 
609 F St., NW, the newspaper's offices moved in August to 1022 U Street, N.W. to be closer to the 
African-American community it served.  No extant first issue of the National
 Forum exists.
                              
                           
                        
                    
                
            
        
    
        
        
    
        

After:

< ?xml version="1.0" encoding="utf-8"?>
< !DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">

  
    The National Forum (Washington, DC), 1910-19??
    
    
    
    
  
  
    

The first issue of the National Forum was likely released on April 30, 1910 and the newspaper ran through at least November 12 of that year. The four-page African-American weekly covered such local events as Howard University graduations and Baptist church activities, but its pages also included national news, sports, home maintenance, women's news, science, editorial cartoons, and reprinted stories from national newspapers. Its primary focus was on how the news affected the city's black community. A unique feature was its coverage of Elks Club meetings and activities. Business manager John H. Wills contributed the community-centered "Vanity Fair" column that usually appeared on the front page of each issue. The publisher and editor was Ralph W. White, who went on to publish another African-American newspaper, the McDowell Times of Keystone, West Virginia. Originally located at 609 F St., NW, the newspaper's offices moved in August to 1022 U Street, N.W. to be closer to the African-American community it served. No extant first issue of the National Forum exists.

I basically took a complicated METS wrapper around some XHTML, which was really just expressing metadata about the HTML, and refactored it as XHTML. Not that METS is a bad XML smell generally, but in this particular case it was overkill. If you look closely you’ll see I’m using RDFa, similar to what Facebook are doing with their OpenGraph Protocol. There’s less to get wrong, what’s there should look more familiar to web developers who aren’t versed in arcane library standards, and I can now read the metadata from the XHTML with an RDFa aware parser, like Python’s rdflib:

>>> import rdflib
>>> g = rdflib.Graph()
>>> g.parse('essays/1.html', format='rdfa')
>>> for triple in g: print triple
... 
(rdflib.term.URIRef('file:///home/ed/Projects/essays/essays/1.html'), rdflib.term.URIRef('http://purl.org/dc/terms/creator'), rdflib.term.Literal(u'Library of Congress'))
(rdflib.term.URIRef('file:///home/ed/Projects/essays/essays/1.html'), rdflib.term.URIRef('http://purl.org/dc/terms/title'), rdflib.term.Literal(u'The National Forum (Washington, DC), 1910-19??'))
(rdflib.term.URIRef('file:///home/ed/Projects/essays/essays/1.html'), rdflib.term.URIRef('http://purl.org/dc/terms/description'), rdflib.term.Literal(u'\n    

\nThe first issue of the National Forum was likely released on April 30, 1910 and the newspaper ran through at least November 12 of that year. The four-page African-American weekly covered such local events as Howard University graduations and Baptist church activities, but its pages also included national news, sports, home maintenance, women s news, science, editorial cartoons, and reprinted stories from national newspapers. Its primary focus was on how the news affected the city s black community. A unique feature was its coverage of Elks Club meetings and activities. Business manager John H. Wills contributed the community-centered "Vanity Fair" column that usually appeared on the front page of each issue. The publisher and editor was Ralph W. White, who went on to publish another African-American newspaper, the McDowell Times of Keystone, West Virginia. Originally located at 609 F St., NW, the newspaper s offices moved in August to 1022 U Street, N.W. to be closer to the African-American community it served. No extant first issue of the National Forum exists.\n

\n ', datatype=rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral'))) (rdflib.term.URIRef('file:///home/ed/Projects/essays/essays/1.html'), rdflib.term.URIRef('http://purl.org/dc/terms/subject'), rdflib.term.Literal(u'http://chroniclingamerica.loc.gov/lccn/sn82015056#title')) (rdflib.term.URIRef('file:///home/ed/Projects/essays/essays/1.html'), rdflib.term.URIRef('http://purl.org/dc/terms/created'), rdflib.term.Literal(u'2007-01-10T09:00:00'))


top hosts referenced in wikipedia (part 2)

Jodi Schneider pointed out to me in an email that my previous post about the top 100 hosts referenced in wikipedia may have been slightly off balance since it counted all pages on wikipedia (talk pages, files, etc), and was not limited to only links in articles. The indicator for her was the high ranking of www.google.com, which seemed odd to her in the article space.

So I downloaded the enwiki-latest-page.sql.gz, loaded it in, and then joined on it in my query to come up with a new list. Jodi was right, it’s a lot more interesting:

This removed a lot of the interwiki links between the English wikipedia and other language wikipedias (which would be interesting to look at in their own right). It also removed administrative links to things like www.dnsstuff.com. Also interesting is that it removed www.facebook.com from the list, which probably were linked to from user profile pages? The neat thing is it introduced new sites into the top 100 like the following:

adsabs.harvard.edu
bioguide.congress.gov
cfa-www.harvard.edu
eclipse.gsfc.nasa.gov
openjurist.org
select.nytimes.com
ssd.jpl.nasa.gov
worldcat.org
www1.arbitron.com
www.animenewsnetwork.com
www.cbc.ca
www.cricinfo.com
www.cricketarchive.com
www.discogs.com
www.expasy.org
www.fifa.com
www.gutenberg.org
www.history.navy.mil
www.hockeydb.com
www.imagesofengland.org.uk
www.independent.co.uk
www.jstor.org
www.leighrayment.com
www.mtv.com
www.nfl.com
www.nhm.ac.uk
www.nps.gov
www.racingpost.com
www.radio-locator.com
www.reuters.com
www.rollingstone.com
www.rsssf.com
www.soccerbase.com
www.usatoday.com
www.variety.com

We can see a lot more pop culture media present: newspapers, magazines, sporting information. Also we can see research oriented websites like worldcat.org, ssd.jpl.nasa.gov, adsabs.harvard.edu make it into the top 100.

I work for the US federal government so I was interested to look at what .gov domains were in the top 100:

hostname links
www.ncbi.nlm.nih.gov 419816
www.pubmedcentral.nih.gov 62134
geonames.usgs.gov 57423
factfinder.census.gov 48530
www.census.gov 33018
www.nr.nps.gov 25962
www.fcc.gov 25941
ssd.jpl.nasa.gov 20178
eclipse.gsfc.nasa.gov 20063
bioguide.congress.gov 18880
www.nlm.nih.gov 15115
www.nps.gov 12196

Which points to the importance of federal biomedical, geospatial, scientific, demographic and biographical information to wikipedians. It would be interesting to take a look at higher education institutions at some point. Doing these one off reports is giving me some ideas about what linkypedia could turn into. Thanks Jodi.


notes on retooling libraries

If you work in the digital preservation field and haven’t seen Dorothea Salo’s Retooling Libraries for the Data Challenge in the latest issue of Ariadne definitely give it a read. Dorothea takes an unflinching look at the at the scope and characteristics of data assets currently being generated by scholarly research, and how equipped traditional digital library efforts are to deal with it. I haven’t seen so many of the issues I’ve had to deal with (largely unconsciously) as part of my daily work so neatly summarized before. Having them laid out in such a lucid, insightful and succinct way is really refreshing–and inspiring.

In the section on Taylorist Production Processes, Dorothea makes a really good point that libraries have tended to optimize workflows for particular types of materials (e.g. photographs, books, journal articles, maps). When materials require a great deal of specialized knowledge to deal with, and the tools that are used to manage and provide access to the content are similarly specialized, it’s not hard to understand why this balkanization has happened. On occasion I’ve heard folks (/me points finger at self) bemoan the launch of a new website as the creation of “yet another silo”. But the technical silos that we perceive are largely artifacts of the social silos we work in: archives, special collections, rare books, maps, etc. The collections we break up our libraries into…the projects that mirror those collections. We need to work better together before we can build common digital preservation tools. To paraphrase something David Brunton has said to me before: we need to think of our collections more as sets of things that can be rearranged at will, with ease and even impunity. In fact the architecture of the Web (and each website on it) is all about doing that.

Even though it can be tough (particularly in large organizations) I think we can in fact achieve some levels of common tooling (in areas like storage and auditing); but we must admit (to ourselves at least) that some levels of access will likely always be specialized in terms of technical infrastructure and user interface requirements:

Some, though not all, data can be shoehorned into a digital library not optimised for them, but only at the cost of the affordances surrounding them. Consider data over which a complex Web-based interaction environment has been built. The data can be removed from the environment for preservation, but only at the cost of loss of the specialised interactions that make the data valuable to begin with. If the dataset can be browsed via the Web interface, a static Web snapshot becomes possible, but it too will lack sophisticated interaction. If the digital library takes on the not inconsiderable job of recreating the entire environment, it is committing to rewriting interaction code over and over again indefinitely as computing environments change.

Dorothea’s statement about committing to rewriting interaction code over and over again is important. I’m a software developer, and a web developer to boot – so there’s nothing I like more than yanking the data out of one encrusted old app, and creating it a-fresh using the web-framework-du-jour. But in my heart of hearts I know that while this may work for large collections of homogeneous data, it doesn’t scale very well for a vast sea of heterogeneous data. However, all is not lost. As the JISC are fond of saying:

The coolest thing to do with your data will be thought of by someone else.

So why don’t us data archivers get out of the business of building the “interaction code”. Maybe our primary service should be to act as data wholesalers who collect it, and make it available in bulk to those who do want to build access layers on top of it. Lets make our data easy for other people to use (with clear licensing) and reference (with web identifiers) so that they can annotate it, and we can pull back those annotations and views. In a way this is kind of hearkening back to the idea of Data Providers and Service Providers that was talked about a lot in the context of OAI-PMH. But in this case we’d be making the objects available as well as the metadata that describes them, similar to the use cases around OAI-ORE. I got a chance to chat with Kate Zwaard of the GPO at CurateCamp a few weeks ago, and learned how the new Federal Register is a presentation application for raw XML data being made available by the GPO. Part of the challenge is making these flows of data public, and giving credit where credit is due – not only to the creators of the shiny site you see, but to the folks behind the scenes who make it possible.

Another part of Dorothea’s essay that stuck out a bit for me, was the advice to split ingest, storage and access systems.

Ingest, storage, and end-user interfaces should be as loosely coupled as possible. Ideally, the same storage pool should be available to as many ingest mechanisms as researchers and their technology staff can dream up, and the items within should be usable within as many reuse, remix, and re-evaluation environments as the Web can produce.

This is something we (myself and other folks at LC) did as part of the tooling to support the National Digital Newspaper Program. Our initial stab at the software architecture was to use Fedora to manage the full life cycle (from ingest, to storage, to access) of the newspaper content we receive from program awardees around the US. The only trouble was that we wanted the access system to support heavy use by researchers and also robots (Google, Yahoo, Microsoft, etc) building their own views on the content. Unfortunately the way we had put the pieces together we couldn’t support that. Increasingly we found ourselves working around Fedora as much as possible to squeeze a bit more performance out of the system.

So in the end we (and by we I mean David) decided to bite the bullet and split off the inventory systems keeping track of where received content lives (what storage systems, etc) from the access systems that delivered content on the Web. Ultimately this meant we could leverage industry proven web development tools to deliver the newspaper content…which was a huge win. Now that’s not saying that Fedora can’t be used to provide access to content. I think the problems we experienced may well have been the result of our use of Fedora, rather than Fedora itself. Having to do multiple, large XSLT transforms to source XML files to render a page is painful. While it’s a bit of a truism, a good software developer tries to pick the right tool for the job. Half the battle there is deciding on the right granularity for the job … the single job we were trying to solve with Fedora (preservation and access) was too big for us to do either right.

Having a system that’s decomposable, like the approach that CDL is taking with Microservices is essential for long-term thinking about software in the context of digital preservation. I guess you could say “there’s no-there-there” with Microservices, since there’s not really a system to download–but in a way that’s kind of the point.

I guess this is just a long way of saying, Thanks Dorothea! :-)


top hosts referenced in english wikipedia

I’ve recently been experimenting a bit to provide some tools to allow libraries, archives and museums to see how Wikipedians are using their content as primary source material. I didn’t actually anticipate the interest in having a specialized tool like linkypedia to monitor who is using your institutions content on Wikipedia. So, the demo site is having some scaling problems–not the least of which is the feeble VM that it is running on. That’s why I wanted to make the code available for other people to run where it made sense. At least before I had some time to think through how to scale it better.

Anyhow, I wanted to get a handle on just how many external links there are in the full snapshot of English Wikipedia. A month or so ago Jakob Voss pointed me at the External Links SQL dump over at Wikipedia as a possible way to circumvent heavy use of the Wikipedia’s API, by providing a baseline to update against. So I thought to myself that I could just suck this down, and import it into MySQL and run some analysis on that to see how many links and what sorts of host name concentrations there were.

Sucking down the file didn’t take too long. But the mysql import on the dump spent about 24hrs (on my laptop) before I killed it. On a hunch I peeked into the 4.5G sql file and noticed that the table had several indexes defined. So I went through some contortions with csplit to remove the indexes from the DDL, and lo and behold it loaded in something like 20 minutes. Then I wrote a some python to query the database, get each external link url, extract the hostname from the url, and write it out through a unix pipeline to count up the unique hostnames:

./hosts.py | sort -S 1g | uniq -c | sort -rn > enwiki-externallinks-hostnames.txt

This is a little unix trick my old boss Fred Lindberg taught me years ago, and it stills works remarkably well: 30,127,734 urls were sorted into 2,162,790 unique domains in another 20 minutes or so. If you are curious the full output is available here. The number 1 host was toolserver.org with 3,169,993 links. This wasn’t too surprising since it is a hostname heavily used by wikipedians as they go about their business. Next is www.google.com at 2,117,967 links, which appeared to be quite a few canned searches. This wasn’t terribly exciting either. So I removed toolserver.org and www.google.com (so as not to visually skew things too much), and charted the rest of the top 100:

Top 100-2 Hostnames in Wikipedia External Links

I figured that could be of some interest to somebody, sometime. I didn’t find similar current stats available anywhere on the web. But if you know of them please let me know. The high ranking of www.ncbi.nlm.nih.gov and dx.doi.org were pleasant surprises. I did a little superficial digging and found some fascinating bots like Citation Bot and ProteinBoxBot, which seem to trawl external article databases looking for appropriate Wikipedia pages to add links to. Kind of amazing.


version control and digital curation

For some time now I have been meaning to write about some of the issues related to version control in repositories as they relate to some projects going on at $work. Most repository systems have a requirement to maintain original data as submitted. But as we all know this content often changes over time–sometimes immediately. Change is in the very nature of digital preservation; as archaic formats are migrated to fresher, more usable ones, and the wheels of movage keep turning. At the same time it’s essential that when content is used and cited that the specific version in time is cited, or as Cliff Lynch is quoted as saying in the Attributes of a Trusted Repository:

It is very easy to replace an electronic dataset with an updated copy, and … the replacement can have wide-reaching effects. The processes of authorship … produce different versions which in an electronic environment can easily go into broad circulation; if each draft is not carefully labeled and dated it is difficult to tell which draft one is looking at or whether one has the “final” version of a work.

For example at \(work we have a pilot project to process journal content submitted by publishers. We don't have the luxury of telling them exactly what content to submit (packaging formats, xml schemas, formats, etc), but on receipt we want to normalize the content for downstream applications by <a href="http://purl.org/net/bagit">bagging</a> it up, and then extracting some of the content into a standard location.</p> <p>So a publisher makes an issue of a journal available for pickup via FTP:</p> <pre> adi_v2_i2 |-- adi_015.pdf |-- adi_015.xml |-- adi_018.pdf |-- adi_018.xml |-- adi_019.pdf |-- adi_019.xml |-- adi_v2_i2_ofc.pdf |-- adi_v2_i2_toc.pdf `-- adi_v2_i2_toc.xml </pre> <p>The first thing we do is <a href="http://purl.org/net/bagit">bag</a> up the content, to capture what was retrieved (manifest + checksums) and stash away some metadata about where it came from.</p> <pre> adi_v2_i2 |-- bag-info.txt |-- bagit.txt |-- data | |-- adi_015.pdf | |-- adi_015.xml | |-- adi_018.pdf | |-- adi_018.xml | |-- adi_019.pdf | |-- adi_019.xml | |-- adi_v2_i2_ofc.pdf | |-- adi_v2_i2_toc.pdf | `-- adi_v2_i2_toc.xml `-- manifest-md5.txt </pre> <p>Next we run some software that knows about the particularities of this publishers content, and persist it into the bag in a predictable, normalized way:</p> <pre> adi_v2_i2 |-- bag-info.txt |-- bagit.txt |-- data | |-- adi_015.pdf | |-- adi_015.xml | |-- adi_018.pdf | |-- adi_018.xml | |-- adi_019.pdf | |-- adi_019.xml | |-- adi_v2_i2_ofc.pdf | |-- adi_v2_i2_toc.pdf | |-- adi_v2_i2_toc.xml | `-- eco | `-- articles | |-- 1 | | |-- article.pdf | | |-- meta.json | | `-- ocr.xml | |-- 2 | | |-- article.pdf | | |-- meta.json | | `-- ocr.xml | `-- 3 | |-- article.pdf | |-- meta.json | `-- ocr.xml `-- manifest-md5.txt </pre> <p>The point of this post isn't the particular way these changes were layered into the filesystem--it could be done in a multitude of other ways (mets, oai-ore, mpeg21-didl, foxml, etc). The point is rather that the data has undergone two transformations very soon after the it was obtained. And if we are successful, it would undergo many more as it is preserved over time. The advantage of layering content like this into the filesystem is that makes some attempt to preserve the data by disentangling it from the code that operates on it, which will hopefully be swapped out at some point. At the same time we want to be able to get back to the original data, and to also possibly rollback changes which inadvertently corrupted or destroyed portions of the data. Shit happens.</p> <p>So the way I see it we have three options:</p> <ul> <li>Ignore the problem and hope it will go away.</li> <li>Create full copies of the content, and link them together</li> <li>Version the content</li> </ul> <p>While tongue in cheek, option 1 is sometimes reality. Perhaps your repository content is such that it never changes. Or if it does, it happens in systems that are downstream from your repository. The overhead of managing versions isn't something that you need to worry about...yet. I've been told that <a href="http://dspace.org">DSpace</a> doesn't currently handle versioning, and that they are looking to upcoming <a href="http://web.archive.org/web/20110812055244/https://wiki.duraspace.org/display/DSPACE/DSpace-Fedora+Integration+FAQ">Fedora integration</a> to provide versioning options.</p> <p>In option 2 version control is achieved by copying the files that make up a repository object, making modifications to the relevant files, and then linking this new copy to the older copy. We currently do this in our internal handling of the content in the National Digital Newspaper Program, where we work at what we call the <em>batch level</em>, which is essentially a directory of content shipped to the Library of Congress, containing a bunch of jp2, xml, pdf and tiff files. When we have accepted a batch, and later have discovered a problem, we then version the batch by creating a copy of the directory. When service copies of the batches are 50G or so, these full copies can add up. Sometimes just to make a few small modifications to some of the XML files we end up having what appears to be largely redundant copies of multi-gigabyte directories. Disk is cheap as they say, but it makes me feel a bit dirty.</p> <p>Option 3 is to leverage some sort of version control system. One of the criteria I have for a version control system for data is that versioned content should not be dependent on a remote server. The reason for this is I want to be able to push the versioned content around to other systems, tape backup, etc and not have it become stale because some server configuration happened to change at some point. So in my mind, subversion and cvs are out. Revision control systems like <a href="http://www.gnu.org/software/rcs/">rcs</a>, <a href="http://git-scm.com/">git</a>, <a href="http://mercurial.selenic.com/">mercurial</a>, <a href="http://bazaar.canonical.com/en/">bazaar</a> on the other hand are possibilities. Some of the nice things about using git or mercurial:</p> <ul> <li>they are very active projects, so there are lots of eyes on the code fixing bugs, making enhancements</li> <li>the content repositories are distributed, so they allow content to migrate into other contexts</li> <li>developers who come to inherit content will be familiar with using them</li> <li>you get audit logs for free</li> <li>they include functionality to publish content on the web, and to copy or clone repository content</li> </ul> <p>But there are trade offs with using git, mercurial, etc since these are quite complex pieces of software. Changes in them sometimes trigger backwards incompatible changes in repository formats--which require upgrading the repository. Fortunately there are tools for these upgrades, but one must stay on top of them. These systems also have a tendency to see periods of active use, followed by an exodus to a newer shinier version control system. But there are usually tools to convert your repositories from the old to the new.</p> <p>To anyone familiar with digital preservation this situation should seem pretty familiar: format migration, or more generally <a href="http://en.wikipedia.org/wiki/Data_migration">data migration</a>.</p> <p>The <a href="http://www.cdlib.org/">California Digital Library</a> have recognized the need for version control, but have taken a slightly different approach. CDL created a specification for a simplified revision control system called <a href="https://confluence.ucop.edu/display/Curation/ReDD">Reverse Directory Deltas (ReDD</a>), and built an implementation into their <a href="http://www.cdlib.org/services/uc3/curation/storage.html">Storage Service</a>. The advantage that ReDD has is that the repository format is quite a bit more transparent, and less code dependent than other repository formats like git, mercurial, etc. It is also a specification that CDL manages, and can change at will, rather than being at the mercy of an external opensource project. As an exercise about a year ago I wanted to see how easy it was to use ReDD by <a href="http://github.com/edsu/dflat">implementing it</a>. I ended up calling the tool dflat, since ReDD fits in with CDL's filesystem convention <a href="https://confluence.ucop.edu/display/Curation/D-flat">D-flat specification</a>.</p> <p>In the pilot project I mentioned above I started out wanting to use ReDD. I didn't really want to use all of D-flat, because it did a bit more than I needed, and thought I could refactor the ReDD bits of my <a href="http://github.com/edsu/dflat">dflat code</a> into a ReDD library, and then use it to version the journal content. But in looking at the code a year later I found myself wondering about the trade offs again. Mostly I wondered about the bugs that might be lurking in the code, which nobody else was really using. And I thought about new developers that might work on the project, and who wouldn't know what the heck ReDD was and why I wasn't using something more common.</p> <p>So I decided to be lazy, and use mercurial instead.</p> <p>I figured this was a pilot project, so what better way to get a handle on the issues of using distributed revision control for managing data? The project was using the Python programming language extensively, and Mercurial is written in Python -- so it seemed like a logical choice. I also began to think of the problem of upgrading software and repository formats as another facet to the format migration problem...which we wanted to have a story for anyway.</p> <p>I wanted to start a conversation about some of these issues at <a href="http://groups.google.com/group/digital-curation/web/curation-technology-sig">Curate Camp</a>, so I wrote a pretty simplistic tool that compares git, mercurial and dflat/redd in terms of the time it takes to initialize a repository, and commit changes to an arbitrary set of content in a directory called 'data'. For example I ran it against a gigabyte of mp3 content, and it generates the following output:</p> <pre> ed@curry:~/Projects\) git clone git://github.com/edsu/versioning-metrics.git ed@curry:~/Projects/versioning-metrics$ cp ~/Music data ed@curry:~/Projects/versioning-metrics$ ./compare.py


archival context on the web

On Mark’s advice I attended the Moving Forward With Authority Society of American Archivists pre-conference, which focused on the role of authority control in archival descriptions. There were lots of presentations/discussions over the day, but by and large most of them revolved around the recent release of the Encoded Archival Context: Corporate Bodies, Persons and Families (EAC-CPF) XML schema. Work on the EAC-CPF began in 2001 with the Toronto Tenets, which articulated the need for encoding not only the contents of archival finding aids (w/ Encoded Archival Description), but also the context (people, families, organizations) that the finding aid referenced:

Archival context information consists of information describing the circumstances under which records (defined broadly here to include personal papers and records of organizations) have been created and used. This context includes the identification and characteristics of the persons, organizations, and families who have been the creators, users, or subjects of records, as well as the relationships amongst them.
Toronto Tenets

I’ll admit I’m jaded. So much XML has gone under the bridge, that it’s hard for me to get terribly excited (anymore) about yet-another-xml-schema. Yes, structured data is good. Yes, encouraging people make their data available similarly is important. But for me to get hooked I need a story for how this structured data is going to live, and be used on the Web. This is the main reason I found Daniel Pitti’s talk about the Social Networks of Archival Context (SNAC) to be so exciting.

SNAC is an NEH (and possibly IMLS soon) funded project of the University of Virginia, California Digital Library, and the Berkeley School for Information. The project’s general goal is to:

… unlock descriptions of people from descriptions of their records and link them together in exciting new ways.

Where “descriptions of their records” are EAD XML documents, and “linking them together” means exposing the entities buried in finding aids (people, organizations, etc), assigning identifiers to them, and linking them together on the web. I guess I might be reading what I want a bit into the goal based on the presentation, and my interest in Linked Data. If you are interested, more accurate information about the project can be found in the NEH Proposal that this quote came from.

Even though SNAC was only very recently funded, Daniel was already able to demonstrate a prototype application that he and Brian Tingle worked on. If I’m remembering this right, Daniel basically got a hold of the EAD finding aids from the Library of Congress, extracted contextual information (people, families, corporate names) from relevant EAD elements, and serialized these facts as EAC-CPF documents. Brian then imported the documents using an extension to the eXtensible Text Framwork, which allowed XTF to be EAC-CPF aware.

The end result is a web application that lets you view distinct web pages for individuals mentioned in the archival material. For example here’s one for John von Neumann

All those names in the list on the right are themselves hyperlinks which take you to that person’s page. If you were able to scroll down (the prototype hasn’t been formally launched yet) you could see links to corporate names like Los Alamos Scientific Laboratory, The Institute for Advanced Study, US Atomic Energy Commission, etc. You would also see a Resources section that lists related finding aids and books, such as the John von Neumann Papers finding aid at the Library of Congress.

I don’t think I was the only one in the audience to immediately see the utility of this. In fact it is territory well trodden by OCLC and the other libraries involved in the VIAF project which essentially creates web pages for authority records for people like John Von Neumann who have written books. It’s also similar to what People Australia, BibApp and VIVO are doing to establish richly linked public pages for people. As Daniel pointed out: archives, libraries and museums do a lot of things differently; but ultimately they all have a deep and abiding interest in the intellectual output, and artifacts created by people. So maybe this is an area where we can see more collaboration across the cultural divides between cultural heritage institutions. The activity of putting EAD documents, and their transformed HTML cousins on the web is important. But for them to be more useful they need to be contextualized in the web itself using applications like this SNAC prototype.

I immediately found myself wondering if the URL say for this SNAC view of John von Neumann could be considered an identifier for the EAC-CPF record. And what if the HTML contained the structured EAC-CPF data using xml+xslt, a microformat, rdfa or a link rel pointing at an external XML document? Notice I said an instead of the identifier. If something like EAC-CPF is going to catch on lots of archives would need to start generating (and publishing) them. Inevitably there would be duplication: e.g. multiple institutions with their own notion of John von Neumann. I think this would be a good problem to have, and that having web resolvable identifiers for these records would allow them to be knitted together. It would also allow hubs of EAC-CPF to bubble up, rather than requiring some single institution to serve as the master database (as in the case of VIAF).

A few things that would be nice to see for EAC-CPF would be:

  • Instructions on how to link EAD documents to EAC-CPF documents.
  • Recommendations on how to make EAC-CPF data available on the Web.
  • A wikipage or some low-cost page for letting people share where they are publishing EAC-CPF.

In addition I think it would be cool to see:

Anyhow I just wanted to take a moment to say how exciting it is to see the stuff hiding in finding aids making it out onto the Web, with URLs for resources like People, Corporations and Families. I hope to see more as the SNAC project integrates more with existing name authority files (work that Ray Larson at Berkeley is going to be doing), and importing finding aids from more institutions with different EAD encoding practices.


federal register embraces the web and opensource

Tom Lee of the Sunlight Foundation blogged yesterday about the new Federal Register website. The facelift was also announced a few days earlier by the Archivist of the United States, David Ferriero. If you aren’t familiar with it already, the Federal Register is basically the daily newspaper of the United States Federal Government, which details all the rules and regulations of the federal agencies. It is compiled by the Office of the Federal Register located in the National Archives, and printed by the Government Printing Office. As the video describing the new site points out, the Federal Register began publication in 1936 in the depths of the Great Depression as a way to communicate in one place all that the agencies were doing to try to jump start the economy. So it seems like a fitting time to be rethinking the role of the Federal Register.

I’m no usability expert, but just a few minutes browsing the new site and comparing it to the old one make it clear what a leap forward this is. Hopefully the legal status of the new site will be ironed out shortly.

Most of all it’s great to see that the Federal Register is now a single web application. The service it provides to the American public is important enough to deserve its own dedicated web presence. As the developers point out in their video describing the effort, they wanted to make the Federal Register a “first class citizen of the web”…and I think they are certainly helping do that. This might seem obvious, but often there is a temptation to jam publications from the print world (like the Federal Register) into dumbed down monolithic repositories that treat all “objects” the same. Proponents of this approach tend to characterize one off websites like Federal Register 2.0 as “yet another silo”. But I think it’s important to remember that the web was really created to break down the silo walls, and that every well designed web site is actually the antithesis of a silo. In fact, monolithic repository systems that treat all publications as static documents to be uniformly managed are more like silos than these ‘one off’ dedicated web applications.

As a software developer working in the federal government there were a few things about the Federal Register 2.0 that I found really exciting:

  • Fruitful collaboration between federal employees and citizen activist/geeks initiated by a software development contest.
  • Extensive use of opensource technologies like Ruby, Ruby on Rails, MySQL, Sphinx, nginx, Varnish, Passenger, Apache2, Ubuntu Linux, Chef. Opensource technologies encourage collaboration by allowing citizen activists/technologists to participate without having to drop a princely sum.
  • Release of the source code for the website itself, using decentralized revision control (git) so that people can easily contribute changes, and see how the site was put together.
  • Extensive use of syndicated feeds to communicate how how content is being added to the site, ical feeds to keep on top of events going on in your area, and detailed XML for each entry.
  • The robots.txt file for the site makes the content fully crawlable by web indexers, except for search related portions of the website. Excluding dynamic search results is often important for performance reasons, but much of the article content can be discovered via links, see below about permalinks. They also have made a sitemap available for crawlers to efficiently discover URLs for the content.
  • Deployment of the web application to the cloud using Amazon’s EC2 and S3 services. Cloud computing allows computing resources to scale to meet demand. In effect this means that government IT shops don’t have to make big up front investments in infrastructure to make new services available. I guess the jury is still out, but I think this will eventually prove to greatly lower the barrier to innovation in the egov sector. It also lets the more progressive developers in government leap frog ancient technologies and bureaucracies to get things done in a timely manner.
  • And last, but certainly not least … now every entry in the Federal Register has a URL!. Permalinks for the Federal Register are incredibly important for citability reasons. I predict that we’ll quickly see more and more people referencing specific parts of the Federal Register in social media sites like Facebook, Twitter and out on the open web in blogs, and in collaborative applications like Wikipedia.

I would like to see more bulk access to XML data made available, for re-purposing on other websites–although I guess it might be able to walk from the syndicated feeds to the detailed XML. Also, the search functionality is so rich it would be useful to have an OpenSearch description that documents it, and perhaps provides some hooks for getting back JSON and/or XML representations. Perhaps even following the lead of the London Gazette and trying to make some of the structured metadata available in the the HTML using RDFa. It also looks like content is only available for 2008 on, so it might be interesting to see how easy it would be to make more of the historic content available.

But the great thing about what these folks have done is now I can fork the project on github, see how easy it is to add the changes, and let the developers know about my updates to see if they are worth merging back into the production website. This is an incredible leap forward for egov efforts–so hats off to everyone who helped make this happen.


linking things and common sense

Tom Scott’s recent Linking Things post got me jotting down what I’ve been thinking lately about URIs, Linked Data and the Web. First go read Tom’s post if you haven’t already. He does a really nice job of setting the stage for why people care about using distinct URIs (web identifiers) for identifying web documents (aka information resources) and real world things (aka non-information resources). Tom’s opinions are grounded in the experience of really putting these ideas into practice at the BBC. His key point, which he attributes to Michael Smethurst, is that:

Some people will tell you that the whole non-information resource thing isn’t necessary – we have a web of documents and we just don’t need to worry about URIs for non-information resources; others will claim that everything is a thing and so every URL is, in effect, a non-information resource.

Michael, however, recently made a very good point (as usual): all the interesting assertions are about real world things not documents. The only metadata, the only assertions people talk about when it comes to documents are relatively boring: author, publication date, copyright details etc.

If this is the case then perhaps we should focus on using RDF to describe real world things, and not the documents about those things.

I think this is an important observation, but I don’t really agree with the conclusion. I would conclude instead that the distinction between real world and document URIs is a non-issue. We should be able to tell if the thing being described is a document or a real world thing based on the vocabulary terms that are being used.

For example, if I assert:

<http://en.wikipedia.org/wiki/William_Shakespeare> a foaf:Person ; foaf:name "William Shakespeare" .

Isn’t it reasonable to assume http://en.wikipedia.org/wiki/William_Shakespeare identifies a person whose name is William Shakespeare? I don’t have to try to resolve the URL and see if I get a 303 or 200 response code do I?

And if I also assert,

<http://en.wikipedia.org/wiki/William_Shakespeares> dcterms:modified "2010-06-28T17:02:41-04:00"^^<http://www.w3.org/2001/XMLSchema#dateTime>

can’t I can assume that http://en.wikipedia.org/wiki/William_Shakespeare identifies a document that was modified on 2010-06-28T17:02:41? Does it really make sense to think that the person William Shakespeare was modified then? Not really…

Similarly if I said,

<http://en.wikipedia.org/wiki/William_Shakespeare> cc:license <http://creativecommons.org/licenses/by-sa/3.0/> .

Isn’t it reasonable to assume that http://en.wikipedia.org/wiki/William_Shakespeare identifies a document that is licensed with the Attribution-ShareAlike 3.0 Unported license? It doesn’t really make sense to say that the person William Shakespeare is licensed with Attribution-ShareAlike 3.0 Unported does it? Not really…

Why does the Linked Data community lean on using identifiers to do this common sense work? Well, largely because people argued about it for three years and this is the resolution the W3C came to. In general I like the REST approach of saying a URL identifies a Resource, and that when you resolve one you get back a Representation (a document of some kind, html, rdf/xml, whatever). Why does it have to be more complicated than that?

If it’s not clear if an assertion is about a document or a thing, why isn’t that a problem with the vocabulary in use being underspecified and vague? I believe this is essentially the point that Xiaoshu Wang made three years ago in his paper URI Identity and Web Architecture Revisited.

To get back to Tom’s point, I agree that the really interesting assertions in Linked Data are about things, and their relations, or as Richard Rorty said a bit more expansively:

There is nothing to be known about anything except an initially large, and forever expandable, web of relations to other things. Everything that can serve as a term of relation can be dissolved into another set of relations, and so on for ever. There are, so to speak, relations all the way down, all the way up, and all the way out in every direction: you never reach something which is not just one more nexus of relations.

Philosophy and Social Hope, pp 53-54.

But assertions about a document, albeit being a bit more on the dry side, are also useful and important, such as: who created the web document, when they created it, a license associated with the document, its relation to previous versions, etc. As a software developer working in a library I’m actually really interested in this sort of administrivia. In fact the Open Archives Initiative Object Reuse and Exchange vocabulary, and the Memento efforts are largely about relating web documents together in meaningful and useful ways: to be able to harvest compound objects out of the web, and to navigate between versions of web documents. Heck, the Dublin Core vocabulary started out as an effort to describe networked resources (essentially documents), and the gist of the Dublin Core Metadata Terms retain much of this flavor. So I think RDF is also important for describing documents on the web, or (more accurately) representations.

So, in short:

  1. URLs identify resources.
  2. A resource can be anything.
  3. When you resolve a URL you get a representation of that resource.
  4. If a representation is some sort of flavor of RDF, the semantics of an RDF vocabulary should make it clear what is being described.
  5. If it’s not clear, maybe the vocabulary sucks.

I think this is basically the point that Harry Halpin and Pat Hayes were making in their paper In Defense of Ambiguity. A URL has a dual role: it identifies resources, and it allows us to access representations of resources. This ambiguity is the source of its great utility, expressiveness and power. It’s why we see URLs on the sides of buses and buildings. It’s why a QR Code slapped on some real world thing has a URL embedded in it.

In an ideal world (where people agreed with Xiaoshu, Harry and Pat) I don’t think this would mean we would have to redo all the Linked Data that we have already. I think it just means that publishers who want the granularity of distinguishing between real world things and documents at the identifier level can have it. It would also mean that the Linked Data space can accommodate RESTafarians, and other mere mortals who don’t want to ponder whether their resources are information resources or not. And, of course, it would mean we could use a URL like http://en.wikipedia.org/wiki/William_Shakespeare to identify William Shakespeare in our RDF data …

Wouldn’t that be nice?