the importance of being crawled
While lcsh.info was up and running harvesters actively crawled it. At its core all lcsh.info did was mint a URI for every Library of Congress Subject Heading. This is similar in spirit to Brewster Kahleâs more ambitious OpenLibrary project to mint a URI for every book, or in his words:
One web page for every book
Aside: Itâs also similar in spirit to RESTful web development, and to the linked data, semantic web effort generally.
Minting a URI for every Library of Congress Subject Heading meant that there were lots of densely interlinked pages. Some researchers at Stanford did a data visualization of LCSH two years ago, which illustrates just how deeply linked LCSH is:
I wanted lcsh.info to get crawled so I intentionally put some high level, well connected concepts (Humanities, Science, etc) on the home page to provide a doorway for web crawlers to walk through into the site and begin discovering all the broader, narrower, related links between conceptsâwithout having to perform a search.
So lcsh.info is down now, but it turns out you can still see its shadow living on in quite a usable form in web search engines. For example type this into any of the big three search engines:
site:lcsh.info mathematics
And youâll see:
Yahoo
Microsoft
Itâs interesting that (unlike Google and Yahoo) Microsoftâs relevancy ranking actually puts the heading for âMathematicsâ at the top. Also note that simple things like giving the page a good title, and descriptive text make the heading show up in usable form in each search engine.
Itâs not too surprising that trying the same for authorities.loc.gov doesnât work out so well. Umm, yeah http://authorities.loc.gov/robots.txtâŚ
On the one hand, Iâm just being nostalgic looking at the content that once was there &sigh;. But on the other there seems to be a powerful message here, that putting data out onto the open web, and making it crawlable means your content is viewable via lots of different lenses. Maybe you donât have to get search exactly right on your website, let other people do it for you.
Two other things come to mind: LOCKSS and Brewsterâs even more ambitious project. Iâve been sort hoping that somehow or another the Internet Archive and the Open Library would find there way into being publicly funded projects. What if? I can daydream right?