canonical question

Posted on May 15th, 2009 in web | 4 Comments »

As the last post indicated I’m part of a team at loc.gov working on an application that serves up page views like this for historic newspapers–almost a million of them in fact. For each page view there is another URL for a view of the OCR text gleaned from that image, such as this. Yeah, kind of yuckster at the moment, but we’re working on that.

Perhaps it’s obvious, but the goal of making the OCR html view available is so that search engine crawlers can come and index it. Then when someone is searching for someone’s name, say Dr. Herbert D. Burnham in Google they’ll come to page 3 in the 08/25/1901 issue of the New York Tribune. And this can happen without the searcher needing to know anything about the Chronicling America project beforehand. Classic SEO

The current OCR view at the moment is quite confusing, so we wanted to tell Google that when they link to the page in their search results they use the page zoom view instead. We reached for Google’s (and now other major search engine’s) rel=”canonical”, since it seemed like a perfect fit.

… we now support a format that allows you to publicly specify your preferred version of a URL. If your site has identical or vastly similar content that’s accessible through multiple URLs, this format provides you with more control over the URL returned in search results. It also helps to make sure that properties such as link popularity are consolidated to your preferred version.

From our logs we can see that Google has indeed come and fetched both the page viewer and the ocr view for this particular page, and also the text/plain and application/xml views.

66.249.71.166 - - [05/May/2009:23:31:51 -0400] "GET /lccn/sn83030214/1901-08-25/ed-1/seq-15/ HTTP/1.1" 200 15566 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "*/*"
66.249.71.227 - - [06/May/2009:02:02:51 -0400] "GET /lccn/sn83030214/1901-08-25/ed-1/seq-15.pdf HTTP/1.1" 200 3119248 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "*/*"
66.249.71.165 - - [06/May/2009:02:03:46 -0400] "GET /lccn/sn83030214/1901-08-25/ed-1/seq-15/ocr/ HTTP/1.1" 200 47075 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "*/*"
66.249.71.172 - - [06/May/2009:04:34:02 -0400] "GET /lccn/sn83030214/1901-08-25/ed-1/seq-15/ocr.txt HTTP/1.1" 200 40300 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "*/*"
66.249.71.202 - - [06/May/2009:04:36:07 -0400] "GET /lccn/sn83030214/1901-08-25/ed-1/seq-15/ocr.xml HTTP/1.1" 200 1447056 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "*/*"

But it doesn’t look like the ocr html view is being indexed, at least based on the results for this query I just ran. We can see the .txt file is showing up, which was harvested just after the OCR html view … so it really ought to be in the search results.

A bit of text in a recent www2009 paper Sitemaps: Above and Beyond the Crawl of Duty by Uri Schonfeld and Narayanan Shivakumar made me think …

Amazon.com also suffers from URL canonicalization issues, multiple URLs reference identical or similar content. For example, our (Google’s) Discovery crawl crawls both

  • http://…/B0000FEFEFW?showViewpoints=1
  • http://…/B0000FEFEFW?filterBy=addFiveStar

The two URLs return identical content and offer little value since these pages off two “different” views on an empty customer review list. Simple crawlers cannot detect these type of duplicate URLs without downloading all duplicate URLs first, processing their content, and wasting resources in the process.

So. Could it be that google crawls and indexes http://chroniclingamerica.loc.gov/lccn/sn83030214/1901-08-25/ed-1/seq-15/, where it discovers http://chroniclingamerica.loc.gov/lccn/sn83030214/1901-08-25/ed-1/seq-15/ocr/ which it crawls and sees the canonical URI, which it knows it has already indexed, so it doesn’t waste any resources re-indexing?

It seems like a somewhat non-obvious (to me) side effect of asserting a canonical relationship with another URI is that Google will not index the document at the alternate URI. I guess I’m just learning to only use canonical when a site has “identical or vastly similar content that’s accessible through multiple URLs” … Does this seem about right to you?

(thanks Erik Wilde for the pointer to the Schonfeld and Shivakuma paper)

rest, the semantic web and my feeble brain

Posted on May 14th, 2009 in semweb, web | 15 Comments »

Imagine you were minting close to a million URIs for historic newspaper pages such as:

http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1/

for pages like:

The web page allows you to zoom in quite close and see lots of detail in the page:

Now lets say I want to describe this Newspaper Page in RDF. I need to decide what subject URI to hang the description off of. Should I consider this Newspaper Page resource an information resource, or a real world resource? The answer to this question determines whether or not I can hang my description of the page off the above URI, for example:

<http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1/>
  dcterms:issued "1898-01-01"^^<http://www.w3.org/2001/XMLSchema#date> .

Or if I need to mint a new URI for the page as a real world thing:

<http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1#page>
  dcterms:issued "1898-01-01"^^<http://www.w3.org/2001/XMLSchema#date> .

AWWW 1 provides some guidance:

By design a URI identifies one resource. We do not limit the scope of what might be a resource. The term “resource” is used in a general sense for whatever might be identified by a URI. It is conventional on the hypertext Web to describe Web pages, images, product catalogs, etc. as “resources”. The distinguishing characteristic of these resources is that all of their essential characteristics can be conveyed in a message. We identify this set as “information resources.”

This document is an example of an information resource. It consists of words and punctuation symbols and graphics and other artifacts that can be encoded, with varying degrees of fidelity, into a sequence of bits. There is nothing about the essential information content of this document that cannot in principle be transfered in a message. In the case of this document, the message payload is the representation of this document.

Can all of the essential characteristics of this newspaper page be sent down the wire as a message to a client? The text of the page is pretty legible after zooming in and you can see pictures, headlines, etc. You can’t feel the texture of the page itself, but you can’t in the microfilm that the page images were generated from. So I’m inclined to say yes.

Cool URIs for the Semantic Web also has some advice:

It is important to understand that using URIs, it is possible to identify both a thing (which may exist outside of the Web) and a Web document describing the thing. For example the person Alice is described on her homepage. Bob may not like the look of the homepage, but fancy the person Alice. So two URIs are needed, one for Alice, one for the homepage or a RDF document describing Alice. The question is where to draw the line between the case where either is possible and the case where only descriptions are available.

According to W3C guidelines ([AWWW], section 2.2.), we have a Web document (there called information resource) if all its essential characteristics can be conveyed in a message. Examples are a Web page, an image or a product catalog.

In HTTP, because a 200 response code should be sent when a Web document has been accessed, but a different setup is needed when publishing URIs that are meant to identify entities which are not Web documents.

This makes me think that I will need distinct identifiers for the abstract notion of the Newspaper Page, and the HTML document itself, if it is important to describe them separately. Say for example if I wanted to say the publisher of the web page was the Library of Congress, but the publisher of the Newspaper Page was Charles M. Shortridge. If I don’t have distinct identifiers I will have to say:

<http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1/>
  dc:publisher <http://loc.gov>,
  <http://www.joincalifornia.com/candidate/12338>
  .

Pondering this Information Resource Sniff-Test got me re-reading Xiaoshu Wang’s paper URI Identity and Web Architecture Revisited again. And I’ve come away more convinced that maybe he’s right: that the real issue lies in my vocabulary usage (dc:publisher in this example), and not with whether my URI identifies an Information Resource or not. So maybe new vocabulary is needed in order to describe the representation?

<http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1/>
  web:repPublisher <http://loc.gov> ;
  dcterms:publisher <http://www.joincalifornia.com/candidate/12338>
  .

But there isn’t a community of practice behind Xiaoshu’s position, at least not one like the Linked Data community. Unless perhaps his position is closer to the REST community which is going strong at the moment, especially in AtomPub circles. Members of the linked-data/semweb community would most likely say that there needs to be either hash or 303′ing URIs for the Newspaper Page, distinct from the URIs for the document describing the Newspaper Page. As a late comer to the httpRange-14 debate I don’t think I ever internalized how REST and the Semantic Web are slightly out of tune w/ each other regarding resources on the web.

So. Should I have two different URIs: one for the real-world Newspaper Page, and one for the HTML document that describes that page? Is the Newspaper Page an Information Resource? Am I muddling up something here? Am I thinking too much? Should I just let sleeping dogs lie? Your opinion, advice, therapy would be greatly appreciated.

VocabularySoup (1)

Posted on March 26th, 2009 in metadata, ruby, semweb, web | No Comments »

It’s been great to see RDFa being picked up by web2.0 publishers like Digg and MySpace. You can use the RDFa Distiller to extract the RDFa from a given web page u by constructing a URI like:

http://www.w3.org/2007/08/pyRdfa/extract?format=turtle&uri=u

Which translates kind of nicely into a command line utility to add to your ~/bin:

#!/bin/sh
curl "http://www.w3.org/2007/08/pyRdfa/extract?format=turtle&uri=$1"

So with that little shell script in hand I can now look at the RDFa something like Yo La Tengo’s page on MySpace:

ed@rorty:~$ rdfa http://www.myspace.com/yolatengo

@prefix myspace: <http://x.myspacecdn.com/modules/sitesearch/static/rdf/profileschema.rdf#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xhv: <http://www.w3.org/1999/xhtml/vocab#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://www.myspace.com/YO LA TENGO> a myspace:MusicProfile ;
     myspace:profileType "Music" .

<http://www.myspace.com/yolatengo> xhv:stylesheet
         <http://x.myspacecdn.com/modules/common/static/css/global_j03fjftp.css>,
         <http://x.myspacecdn.com/modules/common/static/css/header/profileheader008.css>,
         <http://x.myspacecdn.com/modules/common/static/css/myspace_jvtnwmp4.css>,
         <http://x.myspacecdn.com/modules/common/static/css/profile_adl4r-y8.css>,
         <http://x.myspacecdn.com/modules/profiles/static/css/musicv2_wo4zzzd-.css> ;
     myspace:addToFriends <http://friends.myspace.com/index.cfm?fuseaction=invite.addfriend_verify&friendID=91362837> ;
     myspace:friendCount "33993" ;
     myspace:headline "\"<b>YO LA TENGO IS MURDERING THE CLASSICS</b>\""^^rdf:XMLLiteral ;
     myspace:photo <http://viewmorepics.myspace.com/index.cfm?fuseaction=user.viewAlbums&friendID=91362837> ;
     myspace:sendMessage <http://messaging.myspace.com/index.cfm?fuseaction=mail.message&friendID=91362837&MyToken=62964687-f06b-4b8b-8227-ba97f133a029> ;
     myspace:viewPictures <http://viewmorepics.myspace.com/index.cfm?fuseaction=user.viewAlbums&friendID=91362837> .

Today I learned that “the world’s largest community for sharing presentations” SlideShare is now using RDFa as well. For example here is the metadata SlideShare makes available for Tom Scott’s recent presentation at CERN for the 20th birthday of the web:

ed@rorty:~$ rdfa http://www.slideshare.net/derivadow/www20-what-does-the-history-of-the-web-tell-us-about-its-future

@prefix dc: <http://purl.org/dc/terms/> .
@prefix hx: <http://purl.org/NET/hinclude> .
@prefix media: <http://search.yahoo.com/searchmonkey/media/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xhv: <http://www.w3.org/1999/xhtml/vocab#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://www.slideshare.net/derivadow/www20-what-does-the-history-of-the-web-tell-us-about-its-future> dc:creator "Tom Scott"@en ;
     dc:description "Following my invitation to speak at the WWW@20 celebrations - this is my attempt to squash the interesting bits into a s"@en ;
     media:height "355"@en ;
     media:presentation <http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=www20departmentatalpresentation-090325122157-phpapp02&stripped_title=www20-what-does-the-history-of-the-web-tell-us-about-its-future> ;
     media:thumbnail <http://cdn.slidesharecdn.com/www20departmentatalpresentation-090325122157-phpapp02-thumbnail?1238020296> ;
     media:title "www@20 what does the history of the web tell us about its future?"@en ;
     media:width "425"@en ;
     xhv:alternate <http://www.slideshare.net/rss/latest> ;
     xhv:icon <http://www.slideshare.net/favicon.ico> ;
     xhv:stylesheet <http://public.slidesharecdn.com/v3/styles/slideview.css?1238021672> .

I guess it’s nerdy but I find it really interesting to look at the vocabulary usage. You can see SlideShare is using Yahoo’s media vocabulary as well as DublinCore; and MySpace has opted to create their own vocabulary. The really wonderful thing about RDF is that it allows you to reuse parts of someone else’s vocabulary, in addition to creating your own, or doing both. As a technology RDF encourages this, as do documents like How to Publish Linked Data on the Web and the Semantic Web FAQ.

A common perception of the 14 year Dublin Core effort is that it has largely been about coming to consensus about a set of vocabulary terms to use when describing web resources. I think it’s important to recognize that the Dublin Core community has also been a role model for how to create and share your vocabulary on the web so it can be assembled, discovered, understood, used, and remixed. More recently the Microformats community has done something similar, but by targeting web developers (who are actually coding up the HTML) rather than library/infosci professionals. The real message of the Dublin Core and Microformats efforts aren’t that there ought to be one vocabulary to describe information resources, but that we can use the web to collaboratively build and deploy the vocabularies we need.

As we see more and more metadata making it online as RDFa, LinkedData and Microformats the community really needs tool support for visualizing vocabulary use. These tools will aid data publishers in choosing what vocabularies they could use in their descriptions. They will also aid consumers, harvesters of the web to understand which vocabularies are important to understand (a.k.a write code for). How can we make this easier?

I guess the simplest visualization is the ‘view source’ feature that was built into early web browsers, and enabled the propagation of HTML–which is what my command line shell script approximates, and other plugins like Operator and Fuzz make much more friendly. Another approach is to throw a query at an index like Sindice which indexes large swathes of linked data, rdfa and microformats, and easily click through to the “Ontologies” view for a search result that lists the vocabularies used. Jay Luker covered some of these approaches in his Vocabularies for Linked Data: Finding, Selecting, Creating presentation at code4lib last month.

But it would be really interesting to see more tools that detailed vocabulary usage in a more aggregated way–kind of like what Google did in 2005 for HTML in their Web Authoring Statistics. Are some people already doing this? I hope you know of something I don’t.

Up next in part 2 (if I ever get the nerve to publish it) my insane ramblings about why I think XML Schema is nice, but not really web friendly enough to encourage metadata vocabulary use/reuse on the web.

LibraryThing Ubuntu Screen Saver

Posted on March 11th, 2009 in python | 1 Comment »

I read about the LibraryThing Mac Screensaver and of course wanted the same thing for my Ubuntu workstation at $work. Naturally, I’m supposed to be working on some high-priority tickets on a tight deadline…so I started to work right away on how to do this. Your tax dollars at work, etc…

I’m sure that there’s a much more elegant way of doing this, but I basically created a simple python program extract-images that will pull image urls out of arbitrary text, suck down the images, and dump them to a directory. This can be combined with cron and the standard GLSlideshow screensaver, which displays a slideshow of images in a particular directory.

So you just download extract-images, put it in your path, add a crontab entry like (substituting edsu for your LibraryThing username):

00 14 * * * extract-images http://www.librarything.com/labs-screensaver.php?userid=edsu /home/ed/Pictures/covers

And then tell GLSlideshow where your images are by adding this to your ~/.xscreensaver

imageDirectory:   /home/ed/Pictures/covers
chooseRandomImages:   True

Dear $manager, it really didn’t take me that long to do this. Honest!

APIs Suck

Posted on March 5th, 2009 in libraries, politics | 2 Comments »

With TransparencyCamp last weekend, news of the mandated use of feed syndication by Federal Agencies receiving funds from the Recovery Act, recent blog posts by Tim O’Reilly and the Special Libraries Association, an article in Newsweek, news of Carl Malamud’s bid to become the Public Printer of the United States (aka head of the GPO), and the W3C eGov meeting coming up next week it looks like issues related public access to government data (specifically Library of Congress bibliographic and legislative data) are hitting the mainstream media, and getting political mind-share. Exciting times.

One thing that bubbled up at code4lib2009 last week was the notion that APIs Suck. Not that web2.0 APIs are wrong or bad…they’re actually great, especially when compared to a world where no machine access to the data existed before. The point is that sometimes just having access to the raw data in the ‘lowest level format’ is the ideal. Rather than service providers trying to guess what you are trying to do with their data, and absorbing the computational responsibility of delivering it, why not make the data readily available using a protocol like HTTP? Put the data in a directory, turn on Indexes, do some sensible caching, and maybe gzip compression and let people grab it, and robots crawl it. Or maybe use something like Amazon Public Datasets. It seems like a relatively easy first step, that involves very little custom software development, and one with the ability make a huge impact.

I’m a federal employee, so I really can’t come out and formally advocate directly for political appointments. But I have to say it would great to see someone like Malamud at the helm of the GPO, since he’s been doing just this kind of work for 20 years. Exciting times.

c4l09

Posted on March 3rd, 2009 in libraries, people | 1 Comment »

So code4lib2009 was a whole lot of fun. The amazing thing about the conference isn’t really reflected in the program of talks. I feel like I can say that since I was one of them.

The real value is the social space and the time to talk to people you’ve seen online, throw around ideas, get background/contextual information on projects, etc. Hats off to Jean Rainwater and Birkin Diana for picking an beautifully casual and intimate hotel to hold the conference in.

It’s taken me a few days to get some perspective on all that happened. In the meantime I’ve read a few accounts that capture important aspects of the event from: Terry Reese, Jon Phipps, Jay Luker, Declan Fleming, Richard Wallis (1,2,3), Dan Chudnov, Gabe Farrell.

The Linked Data Pre-conference was quite valuable. For one it gave attendees some experience in what it means to publish data in a distributed way, and to write code to aggregate it using a attendees/FOAF experiment. Mike Giarlo aptly surmised from this that the key points for teaching beginners about linked data are that:

  1. Validators are essential
  2. You are not your FOAF

In other words:

  1. Am I doing this rdf/xml, turtle, rdfa right?
  2. ZOMG, httpRange-14!

Ian Davis presented the basics of RDF for people who are already familiar with traditional data management. Apparently Ian’s slides hit #1 for the day on SlideShare, which highlights the interest in linked data that is percolating through the Web. The pre-conf was very well attended as well.

Some folks like Jonathan Brinley and Michael Klein were able to hack on a Supybot Plugin to work with the FOAF data generated by the crawler. I also got chatting with William Denton about the potential of linked data for FRBR/RDA efforts. Unfortunately I didn’t hear about Alistair Miles‘ new project on google-code for exploring the translation of traditional MARC/MODS into RDA/FRBR until after the event. Most of the other slides from presenters at the pre-conf are available from the wiki page.

I was really struck by some of the issues that Dan Chudnov raised in his talk about Caching and Proxying Linked Data right before lunch. In particular his comparison of the Linking Open Data Cloud to what libraries understand as their ready reference collection:


See p.9 of Dan’s slides

Dan explored how we need to think about the technical and administrative details of managing linked-data if linked-data is to be taken seriously by the library community. Relatedly the pre-conf gave me an opportunity to publicly apologize to Anders Söderbäck for yanking lcsh.info offline in such an abrupt manner, and disturbing his links from subject authority records at libris.kb.se to lcsh.info. Dan’s ideas for consuming library linked data and Anders and mine experience publishing library linked data gelled nicely in my brain. Similar ideas from Jon Phipps (one of the authors of Best Practice Recipes for Publishing RDF Vocabularies) have led me to believe this could be a nice little area for some research.

Prepping for the pre-conference itself was good fun, since it led me to discover a series of connections between the early development of the www and Brown University (where the conference was being held) and the history of hyperdata/text: in a nutshell it was Tim Berners-Lee’s proposal for the web -> Dynatext -> Steve DeRose -> Andy van Dam -> Hypertext Editing System -> Ted Nelson -> Doug Engelbart -> Vannevar Bush. Yeah, I guess you had to be there … or maybe that didn’t help. At any rate the slides, complete with breakdancing instructions are available.

I haven’t even started talking about the main event yet. The things I took away from the 3 days of presentations and talks, in no particular order were:

  • I want to learn more about the Author-ID effort that Geoffrey Bilder talked about
  • Stefano Mazzocchi’s keynote and Sean Hannan’s presentation convinced me that I need to understand and play with Freebase’s JavaScript application development environment Acre and the sparql-ish, query by example Metaweb Query Language (MQL). It seems like Freebase is exploring some really interesting territory in building a shared knowledge base of machine readable, human editable data, which can sit behind a seemingly infinite amount of web presentation layers.
  • Terence Ingram’s presentation, Ross Singer’s presentation about Jangle, me and Mike’s SWORD presentation, and a chat with Fedora/REST proponent Matt Zumwalt, and hearing about the Talis Platform have convinced me that real REST has got mind-share and traction in the library technology world.
  • Ian Davis’ keynote on the second day captured for me, the constant challenge it is to stay true to the roots of the web, and how important it is to stay true to them. It was really interesting to hear how he emphasized the importance of data over code, and the necessity for decentralization compared with the centralization.
  • Chatting with Jodi Schneider and William Denton and listening to their presentation made me want to understand RDA and FRBR at a practical level. This includes getting into the vocabularies that are being developed, and trying to convert some data. The history of FRBR in particular as told by Bill is also a gateway into a really fascinating history of cataloging. Also the work that Diane Hillman and Jon Phipps have been doing to enable vocabulary development like RDA/FRBR seems really important to keep abreast of.

More tidbits will probably float into my blog or into my tweets over the coming weeks, as the beer wears off, and the ideas sink in. But for now I’ll leave you with some of my favorite photos from the conference. It’s the people that makes code4lib what it is. It was great to connect up, and meet new folks in the field.

 

Oh and in case you missed it, the tweetstream and the other fine photos.

the importance of being crawled

Posted on January 29th, 2009 in web | No Comments »

While lcsh.info was up and running harvesters actively crawled it. At its core all lcsh.info did was mint a URI for every Library of Congress Subject Heading. This is similar in spirit to Brewster Kahle’s more ambitious OpenLibrary project to mint a URI for every book, or in his words:

One web page for every book

Aside: It’s also similar in spirit to RESTful web development, and to the linked data, semantic web effort generally.

Minting a URI for every Library of Congress Subject Heading meant that there were lots of densely interlinked pages. Some researchers at Stanford did a data visualization of LCSH two years ago, which illustrates just how deeply linked LCSH is:

I wanted lcsh.info to get crawled so I intentionally put some high level, well connected concepts (Humanities, Science, etc) on the home page to provide a doorway for web crawlers to walk through into the site and begin discovering all the broader, narrower, related links between concepts–without having to perform a search.

So lcsh.info is down now, but it turns out you can still see its shadow living on in quite a usable form in web search engines. For example type this into any of the big three search engines:

site:lcsh.info mathematics

And you’ll see:

Google

Yahoo



Microsoft



It’s interesting that (unlike Google and Yahoo) Microsoft’s relevancy ranking actually puts the heading for “Mathematics” at the top. Also note that simple things like giving the page a good title, and descriptive text make the heading show up in usable form in each search engine.

It’s not too surprising that trying the same for authorities.loc.gov doesn’t work out so well. Umm, yeah http://authorities.loc.gov/robots.txt

On the one hand, I’m just being nostalgic looking at the content that once was there &sigh;. But on the other there seems to be a powerful message here, that putting data out onto the open web, and making it crawlable means your content is viewable via lots of different lenses. Maybe you don’t have to get search exactly right on your website, let other people do it for you.

Two other things come to mind: LOCKSS and Brewster’s even more ambitious project. I’ve been sort hoping that somehow or another the Internet Archive and the Open Library would find there way into being publicly funded projects. What if? I can daydream right?

crawling bibliographic data

Posted on January 22nd, 2009 in html, libraries, web, worldcat | 2 Comments »

Today’s Guardian article Why you can’t find a library book in your search engine prompted me to look at Worldcat’s robots.txt file for the first time. Part of the beauty of the web is that it’s an open information space where anyone (people and robots) can start with a single URL and follow their nose to other URLs. This seemingly simple principle is what has allowed a advertising^w search company like Google (that we all use every day) to grow and prosper.

The robots.txt file is a simple mechanism that allows web publishers to tell web crawlers what they are allowed to look at on a website. Predictably, the files are always found at the root of a website in a file named robots.txt. You don’t have to have one, but many publishers like to control what gets indexed on their website, sometimes to hide content, and other times to shield what may be costly server side operations. Anyway, here’s what you see today for worldcat.org:

User-agent: *
Disallow: /search

Sitemap: http://worldcat.org/identities/sitemap_index.xml

So this instructs a web crawler to not follow any links that match /search in the path, such as:

http://www.worldcat.org/search?qt=worldcat_org_all&q=everything+is+miscellaneous

Now if you look on the homepage for Worldcat there are very few links into the dense bibliographic information space that is worldcat. But you’ll notice a few in the lower left box “Create lists”. So a crawler could for example discover a link to:

http://www.worldcat.org/oclc/77271226

This URL is allowed by the robots.txt so the harvester could go on to that page. Once at that item page there are lots of links to other bibliographic records: but notice the ones to other record displays all seem to match the /search pattern disallowed by the robots.txt, such as:

http://www.worldcat.org/search?q=au%3AC++S+Harris&qt=hot_author

or

http://www.worldcat.org/search?q=su%3ALondon+%28England%29+Fiction.&qt=hot_subject

So a web crawler will not be able to wander into the rich syndetic structure of Worldcat and start indexing.

However, all is not lost. Notice above that OCLC does reference a Worldcat sitemap in their robots.txt. Sitemaps are a lightweight mechanism that Yahoo, Google and Microsoft developed for instructing a web harvester on how to walk through a site.

So if we look at OCLC’s sitemap sitemap we’ll see this:

< ?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
        http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd">
    <sitemap>
      <loc>http://worldcat.org/identities/lccn-no99-80690.sitemap.xml</loc>
      <lastmod>2008-05-19</lastmod>
    </sitemap>
    <sitemap>
      <loc>http://worldcat.org/identities/lccn-sh95-8559.sitemap.xml</loc>
      <lastmod>2008-05-19</lastmod>
    </sitemap>
  </sitemapindex>

This essentially defers to two other sitemaps. The first 30 lines of the first one (careful in clicking it’s big!) looks like:

< ?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
        http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
  <url>
    <loc>http://worldcat.org/identities/lccn-no99-80690</loc>
    <lastmod>2008-05-19</lastmod>
    <changefreq>monthly</changefreq>
    <priority>1.0000</priority>
  </url>
  <url>
    <loc>http://worldcat.org/identities/lccn-n78-95332</loc>
    <lastmod>2008-05-19</lastmod>
    <changefreq>monthly</changefreq>
    <priority>1.0000</priority>
  </url>
  <url>
    <loc>http://worldcat.org/identities/lccn-n79-41716</loc>
    <lastmod>2008-05-19</lastmod>
    <changefreq>monthly</changefreq>
    <priority>1.0000</priority>
  </url>
  <url>
    <loc>http://worldcat.org/identities/lccn-n80-92173</loc>
    <lastmod>2008-05-19</lastmod>
    <changefreq>monthly</changefreq>
    <priority>1.0000</priority>
  </url>
  ...
</urlset>

Now we can see the beauty of sitemaps. They are basically just an XML representation for sets of web resources, much like syndicated feeds. There are actually 40,000 links listed in the first sitemap file, and 12,496 in the second. Now URLs like

http://worldcat.org/identities/lccn-no99-80690

are clearly allowed by the robots.txt file. So indexers can wander around and index the lovely identities portion of Worldcat. It’s interesting though, that the content served up by the identities portion of Worldcat is not HTML–it’s XML that’s transformed client side to HTML w/ XSLT. So it’s unclear how much a stock web crawler would be able to discover from the XML. If google/yahoo/microsoft’s crawlers are able to apply the XSLT transform, they will get some HTML to chew on. But notice in the HTML view that all the links into Worldcat proper (that aren’t other identities) are disallowed because they start with /search.

And a quick grep and perl pipeline confirm that all 52496 urls in the sitemap are to the identies portion of the site…

So this is a long way of asking: I wonder if web crawlers are crawling the books views on Worldcat at all? I imagine someone else has written about this already, and there is a known answer, but I felt like writing about the web and library data anyhow.

Since OCLC has gone through the effort of providing a web presentation for millions of books, and even links out to the libraries that hold them, they seem uniquely positioned to provide a global gateway for web crawlers to the library catalogs around the world. The links from worldcat out to the rest of the world’s catalogs would turn OCLC into a bibliographic super node in the graph of the web, much like Amazon and Google Books. But perhaps this is perceived as giving up the family jewels? Or maybe it would put to much stress on the system? Of course it would also be great to see machine readable data served up in a similar linked way

So in conclusion, it to would be awesome to see either (or maybe both):

  • the /search exclusion removed from the robots.txt file
  • sitemaps added for the web resources that look like http://www.worldcat.org/oclc/77271226

Of course one of the big projects I work on at LC is Chronicling America which is currently excluded by LC’s robots.txt…so I know that there can be real reasons for restricting crawling access (in our case performance problems we are trying to fix).


Oh gosh, I just noticed when re-reading the Guardian article that my lcsh.info experiment was mentioned. Hopefully there will be good news to report from LC on this front shortly.

work identifiers and the web

Posted on January 21st, 2009 in html, libraries, semweb, worldcat | 9 Comments »

Michael Smethurst’s In Search of Cultural Identifiers post over at the BBC Radio Labs got me thinking about web identifiers for works, about LibraryThing and OCLC as linked library data providers, and finally about the International Standard Text Code. Admittedly it’s kind of a hodge-podge of topics, and I’m going to taking some liberties with what ‘linked data’ and ‘works’ mean, so bear with me.

Both OCLC Worldcat and LibraryThing mint URIs for bibliographic works, like these for Wide Sargasso Sea:

So the library community really does have web identifiers for works–or more precisely web identifiers for human readable records about works. What’s missing (IMHO) is the ability to use that identifier to get back something meaningful for a machine. Tools like Zotero need to scrape the screen to pull out the data points of interest to citation management. Sure, if you want you can implement COinS or unAPI to allow the metadata to be extracted, but could there be a more web-friendly way of doing this?

Consider how blog syndication works on the web. You visit a blog (like this one) and your browser is able to magically figure out the location of an RSS or Atom feed for the blog, and give you an option to subscribe to it.

Well it’s not really magic it’s just a bit of markup in the HTML:

<link rel="alternate" 
         type="application/rss+xml" 
         title="inkdroid RSS Feed" 
         href="http://inkdroid.org/journal/feed/" />

Simple right?

Now back to work identifiers. Consider that both Worldcat and LibraryThing have web2.0 apis for retrieving machine readable data for a work:

http://www.librarything.com/services/rest/1.0/?method=librarything.ck.getwork&id={work_id}&apikey={your_key}

or:

http://www.worldcat.org/webservices/catalog/content/{oclc_number}?wskey={key}

What if the web pages for these resources at OCLC and LibraryThing linked directly to these machine readable versions? For example if the page for Wide Sargasso Sea at LibraryThing contained this in its <head> element:

<link rel="alternate" 
         type="application/xml" 
         title="XML for Wide Sargasso Sea" 
         href="http://www.librarything.com/services/rest/1.0/?method=librarything.ck.getwork&id=27239&apikey=d231aa37c9b4f5d304a60a3d0ad1dad4" />

This would allow browsers, plugin tools like Zotero and web crawlers to follow the natural grain of the web and discover the machine readable representation. Admittedly this is something that COinS and unapi are designed to do. But the COinS and unAPI protocols are really optimized for making citation data, and non web identifiers available and routable via a resolver of some kind. Maybe I’m just over reaching a bit, but this approach of using the <link> header seems to embrace the notion that there are resources within the Worldcat and Librarything websites, and there can be alternate representations of those resources that can be discovered in a hypertext-driven way.

Of course there is the issue of the API key. In the example above I used the demo key in LibraryThing’s docs. More important in the context of web identifiers for works is the need to distinguish between the identifier for the record, and the identifier for the concept of the work, which is most elegantly solved (IMHO) by following a pattern from the Cool URIs for the Semantic Web doc. But I think it’s important that people realize that it’s not necessary to jump headlong into RDF to start leveraging some of the principles behind the Architecture of the World Wide Web. Henry Thompson has a nice web-centric discussion of this issue in his What’s a URI and Why Does it Matter?

While writing this blog post I noticed a thread over on Autocat that Bowker has been named the US Registrar for the International Standard Text Code. The gist is that the ISTC will be a “global identification system for textual works”, and that registrars (like Bowker) will mint identifiers for works, such as:

ISTC 0A9 2002 12B4A105 7

Where the structure of the identifier is roughly:

ISTC {registration agency} {year element} {work element} {check digit}

It’s interesting that the meat of the ISTC is the work element that is:

… assigned automatically by the central ISTC registration system after a metadata record has been submitted for registration and the system has verified that the record is unique;

The metadata record in question is actually a chunk of ONIX, which presumably Bowker will send to the ISTC central registrar, and get back a work id.

This work that the ISTC is taking on is really important–and one would imagine quite costly. One thing I would suggest to them is that they may want to make the ISTC codes have a URI equivalent like:

http://istc-international/0A9/2002/12B4A1057

They also should encourage Bowker and other registrars to publish their work identifiers on the web:

http://bowker.com/istc/0A9/2002/12B4A1057

It seems to me that we might (in the long term) be better served by a system that embraces the distributed nature of the web. A web in which organizations like Bowker, ISTC, OCLC, LibraryThing, Library of Congress and national libraries publish their work identifiers using URIs, and return meaningful metadata for them. Rather than waiting for other people to solve our problems, why don’t we start solving them ourselves bottom-up instead of waiting for someone else to solve it top-down?

Anyhow I feel like I’m kind of being messy in suggesting this linked-data-lite idea. Is it heresy? My alibi/excuse is that I’ve been sitting in the same room as dchud for extended periods of time.

q & a

Posted on January 7th, 2009 in life | 8 Comments »

Q: What do 100 year old knitting patterns and a lost Robert Louis-Stevenson story have in common?

A: A digitally preserved newspaper page.

Q: What about if you add:

A: Just a typical lunch time conversation at Pete’s with a couple people I work with. The cool thing (for me) is that this is normal, involves a host of smart/interesting characters, and is routinely encouraged. I love my job.