crawling bibliographic data
Todayâs Guardian article Why you canât find a library book in your search engine prompted me to look at Worldcatâs robots.txt file for the first time. Part of the beauty of the web is that itâs an open information space where anyone (people and robots) can start with a single URL and follow their nose to other URLs. This seemingly simple principle is what has allowed a advertising^w search company like Google (that we all use every day) to grow and prosper.
The robots.txt file is a simple mechanism that allows web publishers to tell web crawlers what they are allowed to look at on a website. Predictably, the files are always found at the root of a website in a file named robots.txt. You donât have to have one, but many publishers like to control what gets indexed on their website, sometimes to hide content, and other times to shield what may be costly server side operations. Anyway, hereâs what you see today for worldcat.org:
User-agent: * Disallow: /search Sitemap: http://worldcat.org/identities/sitemap_index.xml
So this instructs a web crawler to not follow any links that match /search in the path, such as:
http://www.worldcat.org/search?qt=worldcat_org_all&q=everything+is+miscellaneous
Now if you look on the homepage for Worldcat there are very few links into the dense bibliographic information space that is worldcat. But youâll notice a few in the lower left box âCreate listsâ. So a crawler could for example discover a link to:
This URL is allowed by the robots.txt so the harvester could go on to that page. Once at that item page there are lots of links to other bibliographic records: but notice the ones to other record displays all seem to match the /search pattern disallowed by the robots.txt, such as:
http://www.worldcat.org/search?q=au%3AC++S+Harris&qt=hot_author
or
http://www.worldcat.org/search?q=su%3ALondon+%28England%29+Fiction.&qt=hot_subject
So a web crawler will not be able to wander into the rich syndetic structure of Worldcat and start indexing.
However, all is not lost. Notice above that OCLC does reference a Worldcat sitemap in their robots.txt. Sitemaps are a lightweight mechanism that Yahoo, Google and Microsoft developed for instructing a web harvester on how to walk through a site.
So if we look at OCLCâs sitemap sitemap weâll see this:
< ?xml version="1.0" encoding="UTF-8"?>
http://worldcat.org/identities/lccn-no99-80690.sitemap.xml
2008-05-19
http://worldcat.org/identities/lccn-sh95-8559.sitemap.xml
2008-05-19
This essentially defers to two other sitemaps. The first 30 lines of the first one (careful in clicking itâs big!) looks like:
< ?xml version="1.0" encoding="UTF-8"?>
http://worldcat.org/identities/lccn-no99-80690
2008-05-19
monthly
1.0000
http://worldcat.org/identities/lccn-n78-95332
2008-05-19
monthly
1.0000
http://worldcat.org/identities/lccn-n79-41716
2008-05-19
monthly
1.0000
http://worldcat.org/identities/lccn-n80-92173
2008-05-19
monthly
1.0000
...
Now we can see the beauty of sitemaps. They are basically just an XML representation for sets of web resources, much like syndicated feeds. There are actually 40,000 links listed in the first sitemap file, and 12,496 in the second. Now URLs like
are clearly allowed by the robots.txt file. So indexers can wander around and index the lovely identities portion of Worldcat. Itâs interesting though, that the content served up by the identities portion of Worldcat is not HTMLâitâs XML thatâs transformed client side to HTML w/ XSLT. So itâs unclear how much a stock web crawler would be able to discover from the XML. If google/yahoo/microsoftâs crawlers are able to apply the XSLT transform, they will get some HTML to chew on. But notice in the HTML view that all the links into Worldcat proper (that arenât other identities) are disallowed because they start with /search.
And a quick grep and perl pipeline confirm that all 52496 urls in the sitemap are to the identies portion of the siteâŚ
So this is a long way of asking: I wonder if web crawlers are crawling the books views on Worldcat at all? I imagine someone else has written about this already, and there is a known answer, but I felt like writing about the web and library data anyhow.
Since OCLC has gone through the effort of providing a web presentation for millions of books, and even links out to the libraries that hold them, they seem uniquely positioned to provide a global gateway for web crawlers to the library catalogs around the world. The links from worldcat out to the rest of the worldâs catalogs would turn OCLC into a bibliographic super node in the graph of the web, much like Amazon and Google Books. But perhaps this is perceived as giving up the family jewels? Or maybe it would put to much stress on the system? Of course it would also be great to see machine readable data served up in a similar linked way
So in conclusion, it to would be awesome to see either (or maybe both):
- the /search exclusion removed from the robots.txt file
- sitemaps added for the web resources that look like http://www.worldcat.org/oclc/77271226
Of course one of the big projects I work on at LC is Chronicling America which is currently excluded by LCâs robots.txtâŚso I know that there can be real reasons for restricting crawling access (in our case performance problems we are trying to fix).
Oh gosh, I just noticed when re-reading the Guardian article
that my lcsh.info experiment was mentioned. Hopefully there will be good
news to report from LC on this front shortly.