the importance of being crawled

While lcsh.info was up and running harvesters actively crawled it. At its core all lcsh.info did was mint a URI for every Library of Congress Subject Heading. This is similar in spirit to Brewster Kahle’s more ambitious OpenLibrary project to mint a URI for every book, or in his words:

One web page for every book

Aside: It’s also similar in spirit to RESTful web development, and to the linked data, semantic web effort generally.

Minting a URI for every Library of Congress Subject Heading meant that there were lots of densely interlinked pages. Some researchers at Stanford did a data visualization of LCSH two years ago, which illustrates just how deeply linked LCSH is:

I wanted lcsh.info to get crawled so I intentionally put some high level, well connected concepts (Humanities, Science, etc) on the home page to provide a doorway for web crawlers to walk through into the site and begin discovering all the broader, narrower, related links between concepts–without having to perform a search.

So lcsh.info is down now, but it turns out you can still see its shadow living on in quite a usable form in web search engines. For example type this into any of the big three search engines:

site:lcsh.info mathematics

And you’ll see:

Google

Yahoo



Microsoft



It’s interesting that (unlike Google and Yahoo) Microsoft’s relevancy ranking actually puts the heading for “Mathematics” at the top. Also note that simple things like giving the page a good title, and descriptive text make the heading show up in usable form in each search engine.

It’s not too surprising that trying the same for authorities.loc.gov doesn’t work out so well. Umm, yeah http://authorities.loc.gov/robots.txt

On the one hand, I’m just being nostalgic looking at the content that once was there &sigh;. But on the other there seems to be a powerful message here, that putting data out onto the open web, and making it crawlable means your content is viewable via lots of different lenses. Maybe you don’t have to get search exactly right on your website, let other people do it for you.

Two other things come to mind: LOCKSS and Brewster’s even more ambitious project. I’ve been sort hoping that somehow or another the Internet Archive and the Open Library would find there way into being publicly funded projects. What if? I can daydream right?