Archive for the ‘libraries’ Category

MIME types and library metadata

Wednesday, April 23rd, 2008

While thinking about library metadata and RESTful web services I got to wondering how many application/*+xml MIME types have actually been registered. It turns out that 120 out of the 633 other application/* MIME types.

Does it seem like a generally useful thing to be able to identify metadata representations with MIME types? Rebecca Guenther registered application/marc back in 1997. Maybe we could have application/marc+xml, application/mods+xml, application/dc+xml?

MIME types for established library metadata formats would be useful to use in applications like AtomPub implementations, or say OAI-ORE resource maps that want to identify the format of a particular resource. In general it would be useful to have in RESTful environments where content-negotiation for resources is encouraged.

If you are curious, here is a current (as of Apr 23, 2008) list of registered MIME types that are in the application/*+xml space.

application/atom+xml
application/atomcat+xml
application/atomsvc+xml
application/auth-policy+xml
application/beep+xml
application/ccxml+xml
application/cellml+xml
application/cnrp+xml
application/conference-info+xml
application/cpl+xml
application/csta+xml
application/CSTAdata+xml
application/davmount+xml
application/dialog-info+xml
application/epp+xml
application/im-iscomposing+xml
application/kpml-request+xml
application/kpml-response+xml
application/mbms-associated-procedure-description+xml
application/mbms-deregister+xml
application/mbms-envelope+xml
application/mbms-msk-response+xml
application/mbms-msk+xml
application/mbms-protection-description+xml
application/mbms-reception-report+xml
application/mbms-register-response+xml
application/mbms-register+xml
application/mbms-user-service-description+xml
application/media_control+xml
application/mediaservercontrol+xml
application/oebps-package+xml
application/pidf+xml
application/pls+xml
application/poc-settings+xml
application/rdf+xml
application/reginfo+xml
application/resource-lists+xml
application/rlmi+xml
application/rls-services+xml
application/samlassertion+xml
application/samlmetadata+xml
application/sbml+xml
application/shf+xml
application/simple-filter+xml
application/smil+xml
application/soap+xml
application/sparql-results+xml
application/spirits-event+xml
application/srgs+xml
application/ssml+xml
application/vnd.3gpp.bsf+xml
application/vnd.3gpp2.bcmcsinfo+xml
application/vnd.adobe.xdp+xml
application/vnd.apple.installer+xml
application/vnd.avistar+xml
application/vnd.chemdraw+xml
application/vnd.criticaltools.wbs+xml
application/vnd.ctct.ws+xml
application/vnd.eszigno3+xml
application/vnd.google-earth.kml+xml
application/vnd.HandHeld-Entertainment+xml
application/vnd.informedcontrol.rms+xml
application/vnd.irepository.package+xml
application/vnd.liberty-request+xml
application/vnd.llamagraphics.life-balance.exchange+xml
application/vnd.marlin.drm.actiontoken+xml
application/vnd.marlin.drm.conftoken+xml
application/vnd.mozilla.xul+xml
application/vnd.ms-playready.initiator+xml
application/vnd.nokia.conml+xml
application/vnd.nokia.iptv.config+xml
application/vnd.nokia.landmark+xml
application/vnd.nokia.landmarkcollection+xml
application/vnd.nokia.n-gage.ac+xml
application/vnd.nokia.pcd+xml
application/vnd.oma.bcast.associated-procedure-parameter+xml
application/vnd.oma.bcast.drm-trigger+xml
application/vnd.oma.bcast.imd+xml
application/vnd.oma.bcast.notification+xml
application/vnd.oma.bcast.sgdd+xml
application/vnd.oma.bcast.smartcard-trigger+xml
application/vnd.oma.bcast.sprov+xml
application/vnd.oma.dd2+xml
application/vnd.oma.drm.risd+xml
application/vnd.oma.group-usage-list+xml
application/vnd.oma.poc.detailed-progress-report+xml
application/vnd.oma.poc.final-report+xml
application/vnd.oma.poc.groups+xml
application/vnd.oma.poc.invocation-descriptor+xml
application/vnd.oma.poc.optimized-progress-report+xml
application/vnd.oma.xcap-directory+xml
application/vnd.omads-email+xml
application/vnd.omads-file+xml
application/vnd.omads-folder+xml
application/vnd.otps.ct-kip+xml
application/vnd.poc.group-advertisement+xml
application/vnd.pwg-xhtml-print+xml
application/vnd.recordare.musicxml+xml
application/vnd.solent.sdkm+xml
application/vnd.sun.wadl+xml
application/vnd.syncml.dm+xml
application/vnd.syncml+xml
application/vnd.uoml+xml
application/vnd.wv.csp+xml
application/vnd.wv.ssp+xml
application/vnd.zzazz.deck+xml
application/voicexml+xml
application/watcherinfo+xml
application/wsdl+xml
application/wspolicy+xml
application/xcap-att+xml
application/xcap-caps+xml
application/xcap-el+xml
application/xcap-error+xml
application/xcap-ns+xml
application/xenc+xml
application/xhtml+xml
application/xmpp+xml
application/xop+xml
application/xv+xml

literals and resources

Wednesday, March 26th, 2008

There’s a fascinating modeling discussion going on over on the DC-RDA list about whether RDA properties should reference literals or resources in descriptions. For example when describing an author you could use a literal:

Twain, Mark, 1835-1910

or a resource:

http://lccn.loc.gov/n79021164

There are some shades of gray in between (using blank nodes, auto-generated URIs, typed literals) but that’s the basic gist of it. The discussion basically concerns what the DC-RDA Application Profile should allow. There seems to be two competing interests:

  1. perceived ease of migrating legacy data (MARC -> RDA)
  2. perceived benefits to explicitly modeling the relationships found in bibliographic data

More information can also be found in the blogs of Karen Coyle and Jon Phipps.

My personal opinion is that RDA should take the high road on this one and really drive home the value proposition for using resources wherever possible, modeling relationships in bibliographic data, and leveraging hundreds of years of work maintaining controlled vocabularies. This will have the positive side effect of pushing library controlled vocabularies (LCSH, name authority, language and geographic codes, etc.) into the open on the web. More importantly I think it will highlight what libraries (at their best) do best, for the larger semantic web and computing world. I think it’s worth limping along a bit longer with MARC and waiting for RDA to actually “do the right thing”.

How to do this effectively is another matter, and is really what the discussion is about. It’s really nice to see people talking openly about these issues.

(PS, using an author isn’t a particularly good example because I don’t see it in the current list of RDA properties…)

(PSS, no that lccn url doesn’t currently resolve (it does for bibliographic records, but not authority) or return rdf (hopefully someday))

oai-ore post baltimore thoughts

Thursday, March 13th, 2008

The recent OAI-ORE meeting was just up the road in Baltimore, so it was easy for a bunch of us from the Library of Congress to attend. I work on a team at LC that is specifically looking at the role that repositories play at the library; I’ve implemented OAI-PMH data providers and harvesters, and in the past couple of years I’ve gotten increasingly interested in semantic web technologies — so OAI-ORE is of particular interest to me. I’ve commented a bit about OAI-ORE on here before, but I figure it can’t hurt to follow in my coworker’s footsteps and summarize my thoughts after the meeting.

(BTW, above is an image of some constellations I nabbed off of wikipedia. I included it here because the repeated analogy (during the meeting) of OAI-ORE resource maps as constellations was really compelling — and quite poetic.)

The Vocabulary

It seems to me that the real innovation of the OAI-ORE effort is that it provides a lightweight RDF vocabulary for talking about aggregated resources on the web. Unfortunately I think that this kernel gets a little bit lost in the 6 specification documents that were released en masse a few months ago.

The ORE vocabulary essentially consists of three new resource types: ore:ResourceMap, ore:Aggregation, ore:AggregatedResource ; and 5 new properties to use with those types: ore:describes, ore:isDescribedBy, ore:aggregates, ore:isAggregatedBy, ore:analogousTo. In addition, the Vocabulary document
provides guidance on how to use a few terms from the DublinCore vocabulary: dc:creator, dc:rights, dcterms:modified, dcterms:created.

The vocabulary is small, so if I were them I would publish the vocabulary elements using hash URIs, instead of slash URIs. The reason for this is that you don’t have to jigger the web server to do a httpRange-14 style 303 correctly:

  • http://www.openarchives.org/ore/0.2/terms#Aggregation
  • http://www.openarchives.org/ore/0.2/terms#AggregatedResource
  • http://www.openarchives.org/ore/0.2/terms#ResourceMap
  • http://www.openarchives.org/ore/0.2/terms#describes
  • http://www.openarchives.org/ore/0.2/terms#isDescribedBy
  • http://www.openarchives.org/ore/0.2/terms#aggregates
  • http://www.openarchives.org/ore/0.2/terms#isAggregatedBy
  • http://www.openarchives.org/ore/0.2/terms#analogousTo

Also, I think ore:AggregatedResource is currently missing from the rdf/xml vocabulary, so it should be added. Also ore:isDescribedBy seems to be commented out.

There is a lot of redundancy between the Abstract Data Model and the Vocabulary documents–so I would recommend collapsing them down into a single, succinct document. This is in keeping with the DRY principle and will have the added benefit of making it easier for newbies to hit the ground running (not having to wade through multiple docs and mentally reconcile them). I could understand having a separate Abstract Data Model document if it were totally divorced from the web and semantic web technologies like RDF, but it’s not.

The Graph

The OAI-ORE effort seemed to be mostly driven by a desire to take harvesting agents the last mile to the actual repository resources themselves–enabling digital library objects (in addition to their metadata) to be harvested from repositories (using HTTP) ; and to be referenced from other contexts (say objects in other repositories). This desire was born out of real, hard won experience with harvesting metadata records, and marked a shift from metadata-centric harvesting to resource-centric harvesting.

In addition OAI-ORE marks a departure from predictable and mind-numbing arguments about SIP formats (METS, DIDL, FOXML, IEEE LOM, XFDU, etc). Yet as soon as we have our shiny new OAI-ORE vocabulary we have to learn yet-another-packaging-format, this time one built on top of Atom.

First, let me just say I’m a big fan of RFC 4287, in particular how it is used in the RESTful Atom Publishing Protocol (RFC 5023). I also think it makes sense to have an Atom serialization for OAI-ORE resource maps — assuming there is a GRDDL transform for turning it into RDF. But the workshop in Baltimore seemed to stress that the Atom serialization was the only way to do OAI-ORE, and didn’t emphasize that there are in fact lots of ways of representing RDF graphs on the web. For example GRDDL allows you to associate arbitrary XML with an XSLT transform to extract a RDF graph. And you could encode your RDF graph directly with RDFa, N3, Turtle, ntriples, or RDF/XML.

Perhaps there is a feeling that stressing the RDF graph too much will alienate some people who are more familiar with XML technologies. Or perhaps all these graph serialization choices could be perceived as being too overwhelming. But I think the opposite extreme of making it look like you can only use an overloaded Atom document as a means to publishing ORE resource maps is misguided, and will ultimately slow adoption. Why not encourage people to publish GRDDL transforms for METS, DIDL or mark up their “splash pages” with RDFa? This would bring the true value of the OAI-ORE work home–it’s not about yet-another-packaging format, it’s about what the various packaging formats have in common on the web.

Release Early, Release Often

In hindsight I think it would’ve been helpful for the OAI-ORE group to privately build consensus about the core OAI-ORE vocabulary (if necessary), then release that into the world wild web for discussion. Then once the kinks were worked out, and there was general understanding, moving on to issues such as discovery and serialization. As it stands the various documents were all dumped at the same time, and seem somewhat fragmented, and in places redundant. Clearly a lot of conversations have gone on that aren’t happening on the public discussion list.

I expressed interest in being part of the OAI-ORE and was politely turned down. I’m actually kind of glad really because I also don’t want to be part of some cabal of digital library practitioners. Maybe I should’ve titled this post “Sour Grapes” :-) Seriously though, the digital library needs good practical solutions and communities of users that encourage widespread adoption and tool support. We don’t need research-ware. Having secret discussions and occasional public events that feel more like lectures than meetings isn’t a good way to encourage adoption.

Anyhow, I hope that this isn’t all seen as being too harsh. Everyone’s a critic eh? All in all there is a lot in OAI-ORE to be proud of. The effort to integrate Web Architecture into Digital Library practices is most welcome indeed. Keep up the good work y’all.

oai-ore and the shadow web

Friday, February 22nd, 2008

The OAI-ORE meeting is coming up, and in general I’ve been really impressed with the alpha specs that have come out. It’s not clear that there’s an established vocabulary for talking about aggregated resources on the web, so the Data Model and Vocabulary documents were of particular interest to me.

One thing I didn’t quite understand, and which I think may have some significance for implementors, is some language in the Discovery document on the subject of URI conflation:

The Data Model document [ORE Model] explicitly prohibits a URI of a ReM (URI-R) ever returning anything other than a ReM. This allows multiple representations to be associated with URI-R, such as using content negotiation to return ReMs in different languages, character sets, or compression encodings. But it does not allow URI-R to return a human readable “splash page”, either by HTTP content negotiation or redirection. For example, clients MUST NOT merge with content negotiation the following URI pair that would correspond to a ReM and a “splash page” for an object:

If I’m understanding right this would prohibit using technologies like microformats, eRDF, RDFa and GRDDL in a “splash page” to represent the resource map. It seems odd to me that you can represent a resource map in Atom, but not in HTML.

To illustrate what this might look like I took a splash page off of arXiv (hope that was ok!) and marked it up with oai-ore RDFa.

Take a look. So all I did is modify the existing XHTML at arxiv.org, and I’ve been able to represent an ORE Resource Map. This seems like a relatively simple, and powerful way for existing repositories to make their aggregated resources available.

RDFa just entered Last Call, but there are already multiple implementations. Try out the GetN3 bookmarklet on the splash page, and you should see some triples come back. I ran them through the validator at w3c and got the following graph (kinda too big to include here inline).

This kind of issue seem to be at the heart of what Ian Davis refers to when he asks “Is the Semantic Web Destined to be a Shadow?“. Andy Powell and Pete Johnston have also been strong voices for integrating digital library repositories and the web–and they are also involved with the oai-ore effort. It feels like some of the oia-ore language could be loosened a bit to allow machine readable and human readable information to commingle a bit more.

calais and ocr newspaper data

Wednesday, February 13th, 2008

Like you I’ve been reading about the new Reuters Calais Web Service. The basic gist is you can send the service text and get back machine readable data about recognized entities (personal names, state/province names, city names, etc). The response format is kind of interesting because it’s RDF that uses a bunch of homespun vocabularies.

At work Dan, Brian and I have been working on ways to map document centric XML formats to intellectual models represented as OWL. At our last meeting one of our colleagues passed out the Calais documentation, and suggested we might want to take a look at it in the context of this work. It’s a very different approach in that Calais is doing natural language processing and we instead are looking for patterns in the structure of XML. But the end result is the same–an RDF graph. We essentially have large amounts of XML metadata for newspapers, but we also have large amounts of OCR for the newspaper pages themselves. Perfect fodder for nlp and calais…

To aid in the process I wrote a helper utility (calais.py) that bundles up the Calais web service into a function call that returns a rdf graph, courtesy of Dan’s rdflib:

  import calais
  graph = calais_graph(content)

This is dependent on you getting a calais license key and stashing it away in ~/.calais. I wrote a couple sample scripts that use calais.py to do stuff like output all the personal names found in the text. For example here’s the people script. note, the angly brackets are missing from the sparql prefixes intentionally, since they don’t render properly (yet) in wordpress.

  from calais import calais_graph
  from sys import argv
 
  filename = argv[1]
  content = file(filename).read()
  g = calais_graph(content)
 
  sparql = """
          PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
          PREFIX ct: http://s.opencalais.com/1/type/em/e/
          PREFIX cp: http://s.opencalais.com/1/pred/
          SELECT ?name
          WHERE {
            ?subject rdf:type ct:People .
            ?subject cp:name ?name .
          }
          """
 
  for row in g.query(sparql):
      print row[0]

Notice the content is sent to calais, the graph comes back, and then a SPARQL query is executed on it? Here’s what we get when I run this OCR data through (take a look at the linked OCR to see just how irregular this data is).

  ed@curry:~/bzr/calais$ ./people data/ndnp\:774348
  McKmley
  Edwin W. Joy
  A. Musto
  JOHN D. SPRECKELS
  George Dlxoh
  Le Roy
  Bryan
  Charles P. Braslan
  Siegerfs Angostura Bitters
  James Stafford
  Herbert Putnam
  H. G. Pond
  Charles F. Joy
  Santa Rosa
  Allen S. Qlmsted
  Pptter Palmer

Clearly there are some errors, but you could imagine ranked list of these as they occurred across a million pages, where the anomalies would fall off on the long tail somewhere. It could be really useful in faceted browse applications. And here’s the output of cities.

  ed@curry:~/bzr/calais$ ./cities data/ndnp:774348
  Valencia
  San Jose
  Seattle
  Newport
  Santa Clara
  St. Louis
  New York
  Haifa
  Venice
  Rochester
  Fremont
  San Francisco
  San Francisco
  Chicago
  Oakland
  Los Angeles
  Fresno
  Watsonville
  Philadelphia
  Washington
  CHICAGO

Not too shabby. If you want to try this out, install rdflib, and you can grab calais.py and the sample scripts and OCR samples from my bzr repo:

  bzr branch http://inkdroid.org/bzr/calais

If you do dive into calais.py you’ll notice that currently the REST interface is returning the RDF escaped in an XML envelope of some kind. I think this is a bug, but calais.py extracts and unescapes the RDF.

lcsh, thesauri and skos

Wednesday, January 23rd, 2008

Simon Spero has an interesting post on why LCSH cannot be considered a thesaurus. At $work I’ve been working on mapping LCSH/MARC to SKOS, so Simon’s efforts in both collecting and analyzing LCSH authority data have been extremely valuable. In particular Simon and Leonard Willpower’s involvement with SKOS alerted me relatively early on to some of the problems that lie in store when thinking of LCSH in terms of a thesaurus.

The problem stems from very specific (standardized) notions of what thesauri are. Z39-19-2005 defines broader relationships in thesauri as being transitive. So if a has the broader term b, and b has the broader term c, then you can infer a has the broader term c.

Now consider the broader relationships (BT for those of you w/ the red books handy, or care to browse authorities.loc.gov from the comfort of your chair) from the heading “Non-alcoholic cocktails”:

If broader relationships are to be considered transitive one is obliged to treat Alcoholic beverages as a broader term for Non-alcoholic cocktails. But clearly it’s nonsense to consider a non-alcoholic cocktail a specialization of an alcoholic beverage. As Simon pointed out the problem was recognized by Mary Dykstra soon after LCSH adopted terminology from the thesauri world (BT, NT, RT) in 1986. Her article, LC Subject Headings Disguised as a Thesaurus describes the many difficulties of treating LCSH as a thesaurus. In the example above from LCSH the broader (BT) relationship is used for both hierarchical (IS-A) relationships, as well as part/whole (HAS-A) relationships. According to thesauri folks this is a no-no.

LCSH aside, the semantics of broader/narrower have been an issue for SKOS for a fair amount of time. Guus Schreiber proposed a resolution, which was just accepted at yesterday’s SWD telecon. SKOS is trying to straddle several different worlds, enabling the representation of a range of knowledge organization systems from thesauri and taxonomies to subject heading lists, folksonomy and other controlled vocabularies. To remain flexible in this way, while still appealing to the thesaurus world a compromise was reached where the skos:broader and skos:narrower semantic relations were declared to be sub-properties of two new properties: skos:broaderTransitive and skos:narrowerTransitive (respectively). Since transitivity is not inherited, SKOS can still be used by people who want to represent loose broader relationships (LCSH, and others). At the same time SKOS will allow vocabulary owners to infer transitive broader/narrower relationships across concepts. Incidentally the SKOS Reference was just approved yesterday as a W3C Working Draft, which is its first step along the way to hopefully becoming a Recommendation.

My pottering about with LCSH and SKOS has also illustrated the value in making links between concepts explicit. Modeling LCSH as a graph data structure (SKOS), where each concept has a unique identifier has been a simple and yet powerful step in working with the data. For example to generate the image above, I simply wrote a script that transformed the subgraph related to “Non-alcoholic cocktails” to a graphviz dot file:


digraph G {
  rankdir = "BT"
  "Non-alcoholic cocktails" -> "Cocktails";
  "Alcoholic beverages" -> "Beverages";
  "Non-alcoholic beverages" -> "Beverages";
  "Cocktails" -> "Alcoholic beverages";
  "Non-alcoholic cocktails" -> "Non-alcoholic beverages";
  "Non-alcoholic beer" -> "Non-alcoholic beverages";
}

And then ran that through the graphviz dot utility:


% dot -T png non-alcoholic-cocktails.dot > non-alcoholic-cocktails.png

to generate the PNG file you see. It’s my hope that making a richly linked graph like LCSH/SKOS available will enable not only enhanced use of the vocabulary, but also aid in creative, collaborative refactoring of the graph. I know that these issues are not new to LC, however tools that enable refactoring along the lines of what Margherita Sini proposed for the cocktail problem above will only be possible in a world where the graph can easily be manipulated and, downstream applications (library catalogs, etc) can easily adapt to the changing concept scheme.

tripleshot

Friday, January 11th, 2008

Recently there was a bit of interesting news around a MARBI Discussion Paper 2008-DP04 regarding semweb technologies at LC.

Related to this work are RDF/OWL representations and models for MODS and MARC, which we are also developing. Several representations of MODS in RDF/OWL, such as the one from the SIMILE project, have been made available as part of various projects and we have found they useful for our analysis and to inform our design process. We want to bring them together into one easily downloaded and maintained RDF/OWL file for use in community experimentation with RDF applications. Our time line is to have the MODS RDF ready for community comment by June.

WoGroFuBiCo cloud

Friday, January 11th, 2008

access accessible addition al american analysis application applications appropriate archives areas association authority available based benefit benefits bibliographic broad broader catalog catalogers cataloging catalogs cataloguing chain change changes classification code collaboration collections committee communities community congress consequences consider considered content continue control controlled cooperative cost costs create created creating creation current data databases dc description descriptive desired develop developed development different digital discovery distribution dublin ed education effort encourage enhance environment et evidence exchange exist findings focus format formats frameworks frbr future greater group headings hidden identifiers identify ifla impact include including increase increasingly information institution institutions international knowledge language lc lcs lcsh libraries limited lis maintaining make management marc materials metadata model national need needs networks new number oclc online organization organizations outcomes outside participants particular pcc possible potential practice practices primary principles process processes production program programs provide public publishers quo range rare rda recommendations records reference relationships report require requirements research resource resources responsibility results role rules search serve service services share shared sharing sources special specific standards states status subject supply support systems technology terms time today tools types unique united university use used users using value variety various vendors vocabularies washington ways web working works

same stats as before, but the top 200 this time, and as a cloud. It’s crying out for some kind of stemming to collapse some terms together I suppose…but it’s also 3:17AM.

WoGroFuBiCo wc

Thursday, January 10th, 2008

word count
library 263
bibliographic 236
data 170
libraries 144
lc 127
control 109
information 98
cataloging 91
records 88
subject 82
materials 81
standards 81
use 80
congress 79
work 76
record 73
community 67
users 61
working 59
group 58
access 57
recommendations 56
resources 53
authority 52
metadata 47
future 46
new 40
environment 37
development 37
web 36
collections 35
systems 35
available 35
creation 35
services 34
headings 32
national 31
findings 30
research 30
unique 29
sharing 29
oclc 28
model 28
catalog 28
international 27
develop 27
value 27
lcsh 26
pcc 26
user 26
need 26
report 25
make 25
practices 25
rda 25
used 25
time 24
needs 24
rare 24
including 24
provide 23
discovery 23
communities 23
special 23
frbr 23
current 22
resource 22
rules 22
digital 21
cooperative 21
program 21
participants 21
management 21
service 20
dc 20
programs 20
online 20
costs 20
washington 20
standard 19
support 19
knowledge 19
different 19
appropriate 19
effort 18
applications 18
marc 18
shared 18
exchange 18
process 18
changes 17
lcs 17
increase 16
public 16
search 16
creating 16
broader 16
catalogs 16
controlled 16

I converted the pdf to text file called ‘lc’ with xpdf and then wrote a little python:

#!/usr/bin/env python
 
from urllib import urlopen
from re import sub
 
stop_words = urlopen('http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words').read().split()
text = file('lc').read()
 
counts = {}
for word in text.split():
    word = word.lower()
    word = sub(r'\W', '', word)
    word = sub(r'\d+', '', word)
    if word == ''  or word in stop_words: continue
    counts[word] = counts.get(word,0) + 1
 
words = counts.keys()
words.sort(lambda a,b: cmp(counts[b], counts[a]))
for word in words[0:100]:
    print "%20s %i" % (word, counts[word])

Does me writing code to read the report count as reading the report? …

permalinks reloaded

Monday, December 17th, 2007

The recently announced Zotero / InternetArchive partnership is exciting on a bunch of levels. The one that immediately struck me was the use of the Internet Archive URI. As you may have noticed before all the content in Internet Archive Wayback Machine can be referenced with a URL that looks something like:

  • http://web.archive.org/web/{yyyymmddhhmmss}/{url}

Where url is the document URL you want to look up in the archive at the given time. So for example:

is a URL for what http://google.com looked like on December 02, 1998 at 23:04:10. Perhaps this is documented somewhere prominent or is common knowledge, but it looks like you can play with the timestamp, and archive.org will adjust as needed, redirecting you to the closest snapshot it can find:

and even:

which redirects to the most recent content for a given URL. It’s just a good old 302 at work:

ed@curry:~$ curl -I http://web.archive.org/web/199812/http://www.google.com/
HTTP/1.1 302 Found
Date: Mon, 17 Dec 2007 21:11:12 GMT
Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.2 mod_ssl/2.0.54 OpenSSL/0.9.7g mod_perl/2.0.1 Perl/v5.8.7
Location: http://web.archive.org/web/19981202230410/www.google.com/
Content-Type: text/html; charset=iso-8859-1

So anyhow, pretty cool use of URIs and HTTP right? The addition of zotero to the mix will mean that scholars can cite the web as it appeared at a particular point in time:

… as scholars begin to use not only traditional primary sources that have been digitized but also “born digital” materials on the web (blogs, online essays, documents transcribed into HTML), the possibility arises for Zotero users to leverage the resources of IA to ensure a more reliable form of scholarly communication. One of the Internet Archive’s great strengths is that it has not only archived the web but also given each page a permanent URI that includes a time and date stamp in addition to the URL.

Currently when a scholar using Zotero wishes to save a web page for their research they simply store a local copy. For some, perhaps many, purposes this is fine. But for web documents that a scholar believes will be important to share, cite, or collaboratively annotate (e.g., among a group of coauthors of an article or book) we will provide a second option in the Zotero web save function to grab a permanent copy and URI from IA’s web archive. A scholar who shares this item in their library can then be sure that all others who choose to use it will be referring to the exact same document.

This is pretty fundamental to scholarship on the web. Of course when generating a time anchored permalink with zotero one can well expect that archive.org will on occasion not have a snapshot of said content, resulting in a 404. It would be great if archive.org could leverage these requests for snapshots as requests to go out and archive the page. One could imagine a blocking and nonblocking request: the former which would spawn a request to fetch a particular URI, stash content away, and return the permalink; and the latter which would just quickly return the best match its already got (which may be a 404).

Anyhow, it’s really good to see these two outfits working together. Nice work!

ps. dear lazyweb is there a documented archive.org api available?