wee bit

As is my custom, this morning I asked Zoia (the bot in #code4lib) for this day in history from the Computer History Musuem. Lately I’ve been filtering it through the Pirate plugin, which transforms arbitrary text into something a pirate might say. Anyhow, today’s was pretty humorous.

11:32 < edsu> @pirate [tdih]
11:32 < zoia> edsu: Claude Shannon be born in Gaylord, Michigan.  Known as th' 
              inventor 'o information theory, Shannon be th' first to use th' 
              word "wee bit."  Shannon, a contemporary 'o Johny-boy von 
              Neumann, Howard Aiken, 'n Alan Turin', sets th' stage fer th' 
              recognition 'o th' basic theory 'o information that could be 
              processed by th' machines th' other pioneers developed.  He 
              investigates information distortion, redundancy 'n noise, 'n (1 
              more message)
11:33 < edsu> @more
11:33 < zoia> edsu: provides a means fer information measurement.  He 
              identifies th' wee bit as th' fundamental unit 'o both data 'n 
              computation.

Happy Birthday Cap’n Shannon.


Dear Footnote Bot

Thanks for taking an interest in the historic content on a website I help run. We want to see the NDNP newspaper content get crawled, indexed and re-purposed in as many places as possible. So we appreciate the time and effort you are spending on getting the OCR XML and JPEG2000 files into Footnote. I am a big fan of Footnote and what you are doing to help historical/genealogical researchers who subscribe to your product.

But since I have your ear, it would be nice if you identified yourself as a bot. Right now you are pretending to be Internet Explorer:

38.101.149.14 - - [22/Apr/2010:18:38:39 -0400] "GET /lccn/sn86069496/1909-09-08/ed-1/seq-8.jp2 HTTP/1.1" 200 3170304 "-" "Internet Explorer 6 (MSIE 6; Windows XP)" "*/*" "-" "No-Cache"

Oh, and could you stop sending the Pragma: No-Cache header with every HTTP request? We have a reverse-proxy in front of our dynamic content so that we don’t waste CPU cycles regenerating pages that haven’t changed. It’s what allows us to make our content available to well behaved web crawlers. But every request you send bypasses our cache, and makes our site to do extra work.

It’s true, we can ignore your request to bypass our cache. In fact, that’s what we’re doing now. This means we can’t shift-reload in our browser to force the content to refresh–but we’ll manage. Maybe you could be a good citizen of the Web and send an If-Modified-Since header–or perhaps just don’t send Pragma: No Cache?

Identifying yourself with a User-Agent string like “footbot/0.1 +(http://footnote.com/footbot)” would be neighborly too :-)

Yours Sincerely,
Ed

PS

ed@curry:~$ whois 38.101.149.14
...
%rwhois V-1.5:0010b0:00 rwhois.cogentco.com
38.101.149.14
network:ID:NET4-2665950018
network:Network-Name:NET4-2665950018
network:IP-Network:38.101.149.0/24
network:Postal-Code:84042
network:State:UT
network:City:Linden
network:Street-Address:355 South 520 West
network:Org-Name:iArchives Inc dba Footnote
network:Tech-Contact:ZC108-ARIN
network:Updated:2008-05-21 13:05:26
network:Updated-by:Gus Reese


research ideas for library linked data

The past few weeks have seen some pretty big news for Library Linked Data. On April 7th the Hungarian National Library announced that its entire library catalog, digital library holdings, and name/subject authority data are now available as Linked Data. Then just a bit more than a week later, on April 16th the German National Library announced that it was making its name and subject authority files available as Linked Data.

This adds to the pioneering work that the Royal Library of Sweden has already done in making all of its catalog and authority data available, which they announced almost two years ago now. Add to this that OCLC is also publishing the Virtual International Authority File as Linked Data, and that the Library of Congress also makes its subject authority data available as Linked Data and things are starting to get interesting.

About 16 months ago at the Dublin Core Conference in Berlin Alistair Miles predicted that we’d see several implementations of Linked Data at major libraries within the year. I must admit, while I was sympathetic to the cause, I was also pretty skeptical that this would come to pass. But here we are, just a bit past a year and two national libraries and a major library data distributor have decided to publish some of their data assets as Linked Data.

Hey Al, crow never tasted so good…

So now it’s starting to feel like there’s enough extant library Linked Data to start looking at patterns of usage, to see if there are any emerging best practices we could work towards. In particular I think it would be interesting to take a look at:

  • What vocabularies are being used, and is there emerging consensus about which to use?
  • What licenses (if any) are associated with the data?
  • How much linking and interlinking is going on?
  • What sorts of mechanisms does the publisher offer for getting the data: sitemap, feeds, SPARQL, bulk download?
  • What is the quality of the data: granularity, link integrity, vocabulary usage.
  • What approaches to identifiers for “real world things” have publishers taken: hash, slash, 303, PURLs, reuse of traditional identifiers, etc.
  • What are the relative sizes of the pools of library linked data?
  • How are updates being managed?

Tomorrow I’m meeting with some folks at the Metadata Research Center at the School of Information and Library Science at the University of North Carolina to talk about their HIVE project. Barbara Tillett and Libby Dechman of LC are also here to talk about the use of LCSH, VIAF and RDA. I’m hoping to convince some of the folks at the MRC that answering some of these questions about the use of Linked Data in libraries could be valuable to the library research community. The rumored W3C Incubator Group for Cultural Heritage Institutions and the Semantic Web couldn’t come at a better time.


history and genealogy at semwebdc


spine CC BY 2.0

Last week’s Washington DC Semantic Web Meetup focused on History and Genealogy Semantics. It was a pretty small, friendly crowd (about 15-20) that met for the first time at the Library of Congress. The group included folks from PBS, the National Archives, the Library of Congress, and the Center for History and New Media–as well as some regulars from the Washington DC SGML/XML Users Group.

Brian Eubanks gave a presentation on what the Semantic Web, Linked Data and specifically RDF and Named Graphs have to offer genealogical research. He took us on a tour through a variety of websites, such as Land Records Database at the Bureau of Land Management, Ancestry.com, Footnote and Google Books and made a strong case for using RDF to link these sorts of documents with a family tree.

As more and more historic records make their way online as Web Documents with URIs, RDF becomes an increasingly useful data model for providing provenance and source information for a family tree. On sites like Ancestry.com it is important to understand the provenance of genealogical assertions, since Ancestry.com allows you to merge other people’s family trees into your own, based on likely common ancestors. In situations like this researchers need to be able to evaluate the credibility or truthfulness of other people’s trees–and being able to source the family tree links to the documents that support them is an essential part of the equation.

Along the way Brian let people know about a variety of vocabularies that are available for making assertions that are of value to genealogical research:

  • rdfcal : for Events
  • BIO : for biographical information
  • Relationship : for describing the links between people
  • FOAF : for describing people
  • TriG : for identifying the assertions that a researcher makes and linking them to a given document

The beautiful thing about RDF for me, is that it’s possible to find and use these vocabularies in concert, and I’m not tempted to create the-greatest-genealogy-vocabulary that does it all. In addition, Brian pointed out that sites like dbpedia and geonames are great sources of names (URIs) for people, places and events that can be used in building descriptions. Brian has started the History and Genealogy Semantics Working Group which has an open membership, and encourages anyone with interest in this area to join. While writing this post I happened to run across a Wikipedia page about Family Tree Mapping, which indicated that some genealogical software already supports geocoding family trees. As usual it seems like the geo community is leading the way in making semantics on the web down to earth and practical.

I followed Brian by giving a brief talk about the Chronicling America, which is the web front-end for data collected by National Digital Newspaper Program, which in turn is a joint project of the Library of Congress and the National Endowment for the Humanities. After giving a brief overview of the program, I described how we were naturally led to using Linked Data and embracing a generally RESTful approach by a few factors:

One thing that I learned during Brian’s presentation is that sites like Footnote are not only going around digitizing historic collections for inclusion in their service, but they also give their subscribers a rich editing environment to search and annotate document text. These annotations are exactly the sort of stuff that would be perfect to represent as and RDF graph, if you wanted to serialize the data. In fact the NSF funded Open Annotation Collaboration project is exploring patterns and emerging best practices in this area. I’ve had it in the back of my mind that allowing users to annotate page content in Chronicling America would be a really nice feature to have. If not at chroniclingamerica.loc.gov proper, then perhaps showing how it could be done by a 3rd party using the API. To some extent we’re already seeing annotation happening in Wikipedia, where people are creating links to newspaper pages and titles in their entries, which we can see in the referrer information in our web server logs. Update: and I just learned that wikipedia themselves provide a service that allows you to discover entries that have outbound links to a particular site, like chroniclingamerica.loc.gov.

Speaking of the API (which really is just REST) if you are interested in learning more about it check out the API Document that Dan Chudnov prepared. I also made my slides available, hopefully the speaker notes provide a bit more context for what I talked about when showing images of various things.

Afterwards a bunch of us headed across the street to have a drink. I was really interested to hear from Sam Deng that (like the group I work in at LC) PBS are big Python and Django shop. We’re going to try to get a little brown bag lunch going on between PBS and LC to talk about their use of Django on Amazon EC2, as well as software like Celery for managing asynchronous task queues.

Also, after chatting with Glenn Clatworthy of PBS, I learned that he has been experimenting with making Linked Data views available for their programs. It was great to hear Glenn describe how assigning each program a URI, and leveraging the nature of the web would make a perfect fit for distributing data in the PBS enterprise. It makes me think that perhaps having a session on what the BBC are doing with Linked Data would be timely?


full link graph?

Peter Norvig of Google mentioned Linked Data in his interview with Reddit Ask Me Anything (thanks Gunnar)

So right from the start researchers are writing code that use our main APIs that
are using the data that everyone else uses. If you want some web pages you use
the full copy of the web. If you want some linked data you use the full link
graph.

Update: Richard Cyganiak correctly points out that Norvig said “link data” not “linked data”. :-) At least we won’t have to ask Google if they are using SPARQL and RDF now …


a middle way for linked data at the bbc

I got the chance to attend the 2nd London Linked Data Meetup that was co-located with dev8d last week, which turned out to be a whole lot of fun. I figured if I waited long enough other people would save me from having to write a good summary/discussion of the event…and they have: thanks Pete Johnston, Ben Summers, Sheila Macneill, Martin Belam and Frankie Roberto.

The main thing that I took away is how much good work the BBC is doing in this space. Given the recent news of cuts at the BBC, it seems like a good time to say publicly how important some of the work they are doing is to the web technology sector. As part of the Meetup Tom Scott gave a presentation on how the BBC are using Linked Data to integrate distinct web properties in the BBC enterprise, like their Programmes and the Wildlife Finder web sites.

The basic idea is that they categorize (dare I say catalog?) television and radio content using wikipedia/dbpedia as a controlled vocabulary. Just doing this relatively simple thing means that they can create another site like the Wildlife Finder that provides a topical guide to the natural world (and also happens to use wikipedia/dbpedia as a controlled vocabulary), that then links to their audio and video content. Since the two sites share a common topic vocabulary, they are able to automatically create links from the topic guides to all the radio and television content that are on a particular topic.

For a practical example take a look consider this page for the Great Basin Bristlecone Pine:

If you scroll down on the page you’ll see a link to a video clip from David Attenborough’s documentary Life on the Programmes portion of the website. Now take a step back and consider how these are two separate applications in the BBC enterprise that are able to build a rich network of links between each other. It’s the shared controlled vocabulary (in this case dbpedia derived from wikipedia) which allows them to do this.

If you take a peak in the html you’ll see the resource has an alternate RDF version:

<link rel="alternate" type="application/rdf+xml" href="/nature/species/Pinus_longaeva.rdf" />

The Resource Description Framework (RDF) is really just the best data model we have for describing stuff that’s on the Web, and the type of links between resources that are on (and off) the Web. Personally, I prefer to look at RDF as Turtle which is pretty easily done with Dave Beckett’s handy rapper utility (aptitude install raptor-utils if you are following from home).

rapper -o turtle http://www.bbc.co.uk/nature/species/Pinus_longaeva

The key bits of the RDF are the description of the Great Basin bristlecone pine:

<http://www.bbc.co.uk/nature/species/Pinus_longaeva> rdfs:seeAlso <http://www.bbc.co.uk/nature/species> ; foaf:primaryTopic <http://www.bbc.co.uk/nature/species/Pinus_longaeva#species> .


web documents and axioms for linked data

A few months ago I took part in a discussion on the pedantic-web list, which started out as a relatively simple question about FOAF usage, and quickly evolved into a conversation about terms people use when talking about Linked Data, and more generally the Web.

I ended up having a very helpful off-list email exchange with Richard Cyganiak (one of the architects of the Linked Data pattern) about some trouble I’ve had understanding what Information Resources and Documents are in the context of Web Architecture. The trouble I had was in determining whether or not a collection of physical newspaper pages I was helping put on the web were Information Resources or not. I needed to know because I wanted to identify the newspaper pages with URIs, and describe them as Linked Data…and the resolvability of these URIs was largely dependent on how I chose to answer the question.

Richard ended up offering up some advice that I’ve since found very useful, and I thought I would transcribe some of it down here just in case you might find it useful as well. My apologies to you (and Richard) if some of this seems out of context. It may really only be useful for people who are in the digital library domain, but perhaps it’s useful elsewhere.

On the subject of what is a Document Richard offered up this way at looking at what are Web Documents:

The Web is a new, blank information space that is, by definition, disjoint from anything else that exists in the world. By setting up and configuring a web server, you make things pop up in that information space (by creating resolvable URIs). By definition, the things that pop up in the information space are a different beast from anything that existed before. They are web pages. They are not the same as things that exist outside of the space, like files on your hard disk, or newspaper articles.

I would avoid the term “document” when talking about representations. Representations are those ephemeral things that go over the wire. A representation is a “byte streams with a media type (and possibly other meta data)”. When I use the term “HTML document”, I mean a resource, identified by a URI, that has (only) HTML representations.

Richard encouraged me to think in terms of Web Documents and not generic Documents. I was getting tripped up by considering Newspaper Pages as Documents…which of course they are in the general sense, but characterized this way it became clear that the Newspaper Pages are not Web Documents. This view on Web Documents is supported in the Cool URIs for the Semantic Web that he co-authored.

Richard also included some axioms that underpin how he thinks about resources in the Linked Data view:

I’m using a few rules that I think should be considered axioms of web architecture:

First, if something exists independently from the Web, then it cannot be a Web Document. (hence two resources, one for the newspaper page and one for the web page)

Second, only Web Documents can have representations (hence the need to describe the newspaper page in a web page, rather than directly providing representations of the newspaper page).

I understand these rules as axioms, that is, they should be followed because they make the system work best, not because they somehow follow from the nature of the world (they don’t).

The pragmatist in me particularly liked how these aren’t supposed to have anything to do with the Real World, but are just ways of thinking about the Web to make it work better. Finally Richard offered some advice on how to reconcile the REST and Linked Data views on identity:

I make sense of the REST worldview like this: In typical REST, all the URIs always identify web documents. The REST folks might claim that they identify other things, like users or items for sale or places on the earth, but actually they just identify a document that is about that thing. The thing itself doesn’t have an identifier. This is perfectly fine for building certain kinds of systems, so the REST guys actually get away with pretending that the URI identifies the thing. But this doesn’t allow you to do certain things, like using domain-independent vocabularies for metadata and coreference, and you get into deep trouble if you want to use this for describing web pages rather than newspaper pages.

I hope I haven’t take any liberties quoting my conversation with Richard out of context like this. I mainly wanted to transcribe Richard’s points (which perhaps he has made elsewhere) so that I could revisit them, without having to dig through my email archive … Comments welcome!


data.australia.gov.au and rdfa

In my previous blog post I was trying to demonstrate the virtues of data.gov.uk making the descriptions of their datasets available as RDFa. Just this morning I learned from Mark Birbeck that the folks down under at data.australia.gov.au did this last October!

For example this page describing a dataset for public Internet locations has this RDF metadata inside it:

<http://data.australia.gov.au/80> cc:attributionName “http://www.centrelink.gov.au/”(???) ; cc:attributionURL <http://www.centrelink.gov.au/> ; dc:coverage.geospatial “Australia”(???) ; dc:coverage.temporal “Not specified”(???) ; dc:creator “Centrelink”(???) ; dc:date.modified “2009-08-31”^^xsd:date ; dc:date.published “2009-08-31”^^xsd:date ; dc:description ""“<p xml:lang=”en-au" xmlns=“http://www.w3.org/1999/xhtml”>Location of Centrelink Offices</p> ""“^^rdf:XMLLiteral ; dc:identifier”80“(???) ; dc:keywords”<a href="http://data.australia.gov.au/tag/social-security" rel="tag" xml:lang="en-au" xmlns="http://www.w3.org/1999/xhtml">Social Security</a>“^^rdf:XMLLiteral ; dc:license”<a href="http://creativecommons.org/licenses/by/2.5/au/" rel="licence" xml:lang="en-au" xmlns="http://www.w3.org/1999/xhtml"><img alt="Creative Commons License" class="licence" src="http://i.creativecommons.org/l/by/2.5/au/88x31.png"/>Creative Commons - Attribution 2.5 Australia (CC-BY)</a>“^^rdf:XMLLiteral ; dc:source”<a href="http://www.centrelink.gov.au/" rel="dc:source" xml:lang="en-au" xmlns="http://www.w3.org/1999/xhtml"/>“^^rdf:XMLLiteral ; dc:subject”<a href="http://data.australia.gov.au/catalogue/community" rel="category tag" title="View all posts in Community" xml:lang="en-au" xmlns="http://www.w3.org/1999/xhtml">Community</a>, <a href="http://data.australia.gov.au/catalogue/employment" rel="category tag" title="View all posts in Employment" xml:lang="en-au" xmlns="http://www.w3.org/1999/xhtml">Employment</a>, <a href="http://data.australia.gov.au/catalogue/government" rel="category tag" title="View all posts in Government" xml:lang="en-au" xmlns="http://www.w3.org/1999/xhtml">Government</a>“^^rdf:XMLLiteral ; dc:title”Location of Centrelink Offices“(???) ; dc:type <http://purl.org/dc/dcmitype/Text> ; agls:jurisdiction”[Commonwealth of] Australia (AU)"(???) ;


data.gov.uk and rdfa

The recent public release of the UK Government’s data.gov.uk site got picked up by the press last week in articles at The Guardian, Prospect Magazine and elswhere. These have been supplemented by some more technical discussions at ReadWriteWeb, Open Knowledge Foundation, Talis, Jeni Tennison’s blog, and some helpful emails from Leigh Dodds (Talis) and Jonathan Gray (Open Knowledge Foundation) on the w3c egovernment discussion list.

One thing that I haven’t seen mentioned so far in public (which I just discovered today) is that data.gov.uk is using RDFa to expose metadata about the datasets in a machine readable way. What this means is that in an HTML page for a dataset like this there are some extra HTML attributes like about, property, rel that have been thoughtfully used to express some structured metadata about the dataset, which can be extracted from the HTML and expressed say as Turtle:

<http://data.gov.uk/id/dataset/agricultural_market_reports> dct:coverage "Great Britain (England, Scotland, Wales)"@en ;
     dct:created "2009-12-04"@en ;
     dct:creator "Department for Environment, Food and Rural Affairs"@en ;
     dct:isReferencedBy <http://data.gov.uk/wiki/index.php/Package:agricultural_market_reports> ; 
     dct:license "Crown Copyright"@en ;
     dct:source <http://statistics.defra.gov.uk/esg/publications/amr/default.asp>, <https://statistics.defra.gov.uk/esg/publications/amr/default.asp> ;
     dct:subject
         <http://data.gov.uk/data/tag/agriculture>,
         <http://data.gov.uk/data/tag/agriculture-and-environment>,
         <http://data.gov.uk/data/tag/environment>,
         <http://data.gov.uk/data/tag/farm-business>,
         <http://data.gov.uk/data/tag/farm-businesses>,
         <http://data.gov.uk/data/tag/farming> .

In fact since data.gov.uk has a nice paging mechanism that lists all the datasets it’s not hard to write a little script that scrapes all the metadata for the datasets (35,478 triples) right out of the web pages.

I also noticed via Stéphane Corlosquet today that data.gov.uk is using the Drupal open-source content management system. To what extent Drupal7’s new RDFa features are being used to layer in this RDFa isn’t clear to me. But it is an exciting development. It’s exciting because data.gov.uk is a great example of how to bubble up data that’s typically locked away in databases of some kind into the HTML that’s out on the web for people to interact with, and for crawlers to crawl and re-purpose.

For example I can now write a utility to check the status of the external dataset links, to make sure they are they are there (200 OK). The complete results by URL can be summarized by rolling up by status code:

Status Code Number of Datasets
200 2977
404 106
502 23
503 14
[Errno socket error] [Errno -2] Name or service not known 8
500 3
nonnumeric port: ’’ 1
[Errno socket error] [Errno 110] Connection timed out 1
400 1

Or I can generate a list of dataset subjects (eventhough it’s already available I guess). Here’s the top 25:

Subject Number of Datasets
health 645
care 427
child 398
population 341
children 295
school 273
health-and-social-care 271
health-well-being-and-care 205
economy 202
economics-and-finance 189
census 188
education 176
communities 154
benefit 153
road 144
children-education-and-skills 121
people-and-places 111
government-receipts-and-expenditure 110
education-and-skills 110
housing 108
environment 107
tax 107
life-in-the-community 106
employment 105
tax-credit 96

I realize it’s early days but here are a few things it would be fun to see at data.gov.uk:

  • add some RDFa and SKOS or CommonTag in tag pages like education: this would allow things to be hooked up a bit more explicitly, tags to be given nice labels, and encourage the reuse of the tagging vocabulary within and outside data.gov.uk
  • link the dataset descriptions to the dataset resources themselves (the pdfs, excel spreadsheets, etc) that are online using a vocabulary like the Open Archives Reuse and Exchange and/or POWDER. This would allow for the harvesting and aggregation not only of the metadata, but the datasets as well.

I imagine much of this sort of hacking around can be enabled by querying the data.gov.uk SPARQL endpoint. But it hasn’t been very clear to me exactly what data is behind there. And there is something comforting about being able to crawl the open web to find the information that’s there in open to view.


5 Tunes for Gillian

Kesa’s good friend Gillian from college days in NOLA sent around an email asking for people’s favorite five songs of last year.

For some reason picking individual songs is hard for me. I guess because I rarely put on a song, and almost always put on an album–as antiquated as that sounds. I do occasionally listen to suggestions on last.fm or random songs in my player-du-jour – but then I don’t really remember the song names.

Anyhow here’s the list I cobbled together, with links out to youtube (that’ll probably break in 28 hrs):