scoping intertwingularity

Dan Brickley’s recent post to the public-lod discussion list about the future of RDF is one of the best articulations of why I appreciate the practice of linking data:

And why would anyone care to get all this semi-related, messy Web data? Because problems don’t come nicely scoped and packaged into cleanly distinct domains. Whenever you try to solve one problem, it borders on a dozen others that are a higher priority for people elsewhere. You think you’re working with ‘events’ data but find yourself with information describing musicians; you think you’re describing musicians, but find yourself describing digital images; you think you’re describing digital images, but find yourself describing geographic locations; you think you’re building a database of geographic locations, and find yourself modeling the opening hours of the businesses based at those locations. To a poet or idealist, these interconnections might be beautiful or inspiring; to a project manager or product manager, they are as likely to be terrifying.

Any practical project at some point needs to be able to say “Enough with all this intertwingularity! this is our bit of the problem space, and forget the rest for now”. In those terms, a linked Web of RDF data provides a kind of safety valve. By dropping in identifiers that link to a big pile of other people’s data, we can hopefully make it easier to keep projects nicely scoped without needlessly restricting future functionality. An events database can remain an events database, but use identifiers for artists and performers, making it possible to filter events by properties of those participants. A database of places can be only a link or two away from records describing the opening hours or business offerings of the things at those places. Linked Data (and for that matter FOAF…) is fundamentally a story about information sharing, rather than about triples. Some information is in RDF triples; but lots more is in documents, videos, spreadsheets, custom formats, or [hence FOAF] in people’s heads.

Dan’s description is also a nice illustration of how the web can help us avoid Yak Shaving, by leveraging the work of others:

Any seemingly pointless activity which is actually necessary to solve a problem which solves a problem which, several levels of recursion later, solves the real problem you’re working on.

I’m just stashing that away here so I can find it again when I need it. Thanks danbri!


Confessions of a Graph Addict

Today I’m going to be at the annual conference of the American Library Association today for a pre-conference about Libraries and Linked Data. I’m going to try talking about how Linked Data, and particularly how the graph data structure fits the way catalogers have typically thought about bibliiographic information. Along the way I’ll include some specific examples of Linked Data projects I’ve worked on at the Library of Congress–and gesture at work that remains to be done.

Tomorrow there’s an unconference style event at ALA to explore the what Linked Data means for Libraries. The pre-conference today is booked up, but the event tomorrow is open to the public, so please consider dropping by if you are interested and in the DC area.


bibliographic records on the web

There are a couple interesting threads (disclaimer I inadvertently started one) going on over on the Open Library technical discussion list about making Linked Data views available for authors. Since the topic was largely how to model people, part of the discussion spilled over to foaf-dev (also my fault).

When making library Linked Data available my preference has been to follow the lead of Martin Malmsten, Anders Söderbäck and the Royal Library of Sweden by modeling authors as People using the FOAF vocabulary:

<http://libris.kb.se/resource/auth/317488>
    libris:key "Berners-Lee, Tim" ;
    a foaf:Person ;
    rdfs:isDefinedBy <http://libris.kb.se/data/auth/317488> ;
    skos:exactMatch <http://viaf.org/viaf/23002995> ;
    foaf:name "Berners-Lee, Tim", "Lee, Tim Berners-", "Tim Berners- Lee", "Tim Berners-Lee" .

It seems sensible enough right? But there is some desire in the library community to model an author as a Bibliographic Resource and then relate this resource to a Person resource. While I can understand wanting to have this level of indirection to assert a bit more control, and to possibly use some emerging vocabularies for RDA, I think (for now) using something like FOAF for modeling authors as people is a good place to start.

It will engage folks from the FOAF community who understand RDF and Linked Data, and get them involved in the Open Library Project. It will make library data fit in with other Linked Data out on the web. Plus, it just kind of fits my brain better to think of authors as people…isn’t that what libraries were trying to do all along with their authority data? I’m not saying that FOAF will have everything the library world needs (it won’t), but it’s an open world and we can add stuff that we need, collaborate, and make it a better place.

Anyway, that’s not really what I wanted to talk about here. Over the course of this discussion Erik Hetzner raised what I thought was an important question:

Are you saying that there is a usable distinction between:

  1. a bibliographic record, and
  2. the data contained in that bibliographic record?

    From above, my first notion would be to model things as, in
    pseudo-Turtle::

    <Victor Hugo> a frbr:Person .
    <Victor Hugo> rdfs:isDefinedBy <bib record> .
    <bib record> dc:modified “…”^^xsd:date .

    But it seems to me that you are adding a further distinction::

    <Victor Hugo> a frbr:Person .
    <Victor Hugo> rdfs:isDefinedBy <bib record> .
    <bib record> rdfs:isDefinedBy <bib record data>
    <bib record data> dc:modified “…”^^xsd:date .

    Is this a usable or useful distinction? Are there times when we want to distinguish between the abstract bibliographic record and the representation of a bibliographic record? In linked data-speak, is a bibliographic record a non-information resource? My thinking has been that a bibliographic record is an information resource, and that one does not need to distinguish between (1) and (2) above.

I think it’s an important question because I don’t think it’s been really discussed much before, and has a direct impact on what sort of URL you can use to identify a Bibliographic Record, and what sort of HTTP response a client gets when it is resolved. This is the httpRange-14 issue, which is covered in Cool URIs for the Semantic Web. If a Bibliographic Record is an Information Resource then its OK to identify the record with any old URL, and for the server to say 200 OK like normal. If it’s not an Information Resource then the URL should either have a hash fragment in it, or the server should respond 303 See Other, and redirect to another location.

In my view if a Bibliographic Record is on the web with a URL, it is useful to think of it as an Information Resource…or (as Richard Cyganiak dubs it) a Web Document. I don’t think it’s worthwhile philosophizing about this, but instead to think about it pragmatically. I think it’s useful to consider

as being an identifier for a bibliographic record that happens to be in HTML. Likewise

are all identifiers for Bibliographic Records in MODS, Dublin Core and MARCXML respectively. It might be useful to link them together as they are with <link> elements in the HTML, or in some RDF serialization. It also could be useful to treat one as canonical, and content negotiate from one of the URLs (e.g. curl –header “Accept: application/marc+xml” http://lccn.loc.gov/99027665). But I think it simplifies deployment of library Linked Data to think of bibliographic records as things that can be put on the web as documents, without worrying too much about httpRange-14. A nice side effect of this is that it would grandfather in all the OPAC record views out there. Maybe it’ll be useful to distinguish between an abstract notion of a bibliographic record, and the actual document that is the bibliographic record – but I’m not seeing it right now…and I think it would introduce a lot of unnecessary complexity in this fragile formative period for library Linked Data.


the 5 stars of open linked data

While perusing the minutes of today’s w3c egov telecon I noticed mention of Tim Berners-Lee’s Bag of Chips talk at the gov2.0 expo last week in Washington, DC. I actually enjoyed the talk not so much for the bag-of-chips example (which is good), but for the examination of Linked Data as part of a continuum of web publishing activities associated with gold stars, like the ones you got in school. Here they are:


wee bit

As is my custom, this morning I asked Zoia (the bot in #code4lib) for this day in history from the Computer History Musuem. Lately I’ve been filtering it through the Pirate plugin, which transforms arbitrary text into something a pirate might say. Anyhow, today’s was pretty humorous.

11:32 < edsu> @pirate [tdih]
11:32 < zoia> edsu: Claude Shannon be born in Gaylord, Michigan.  Known as th' 
              inventor 'o information theory, Shannon be th' first to use th' 
              word "wee bit."  Shannon, a contemporary 'o Johny-boy von 
              Neumann, Howard Aiken, 'n Alan Turin', sets th' stage fer th' 
              recognition 'o th' basic theory 'o information that could be 
              processed by th' machines th' other pioneers developed.  He 
              investigates information distortion, redundancy 'n noise, 'n (1 
              more message)
11:33 < edsu> @more
11:33 < zoia> edsu: provides a means fer information measurement.  He 
              identifies th' wee bit as th' fundamental unit 'o both data 'n 
              computation.

Happy Birthday Cap’n Shannon.


Dear Footnote Bot

Thanks for taking an interest in the historic content on a website I help run. We want to see the NDNP newspaper content get crawled, indexed and re-purposed in as many places as possible. So we appreciate the time and effort you are spending on getting the OCR XML and JPEG2000 files into Footnote. I am a big fan of Footnote and what you are doing to help historical/genealogical researchers who subscribe to your product.

But since I have your ear, it would be nice if you identified yourself as a bot. Right now you are pretending to be Internet Explorer:

38.101.149.14 - - [22/Apr/2010:18:38:39 -0400] "GET /lccn/sn86069496/1909-09-08/ed-1/seq-8.jp2 HTTP/1.1" 200 3170304 "-" "Internet Explorer 6 (MSIE 6; Windows XP)" "*/*" "-" "No-Cache"

Oh, and could you stop sending the Pragma: No-Cache header with every HTTP request? We have a reverse-proxy in front of our dynamic content so that we don’t waste CPU cycles regenerating pages that haven’t changed. It’s what allows us to make our content available to well behaved web crawlers. But every request you send bypasses our cache, and makes our site to do extra work.

It’s true, we can ignore your request to bypass our cache. In fact, that’s what we’re doing now. This means we can’t shift-reload in our browser to force the content to refresh–but we’ll manage. Maybe you could be a good citizen of the Web and send an If-Modified-Since header–or perhaps just don’t send Pragma: No Cache?

Identifying yourself with a User-Agent string like “footbot/0.1 +(http://footnote.com/footbot)” would be neighborly too :-)

Yours Sincerely,
Ed

PS

ed@curry:~$ whois 38.101.149.14
...
%rwhois V-1.5:0010b0:00 rwhois.cogentco.com
38.101.149.14
network:ID:NET4-2665950018
network:Network-Name:NET4-2665950018
network:IP-Network:38.101.149.0/24
network:Postal-Code:84042
network:State:UT
network:City:Linden
network:Street-Address:355 South 520 West
network:Org-Name:iArchives Inc dba Footnote
network:Tech-Contact:ZC108-ARIN
network:Updated:2008-05-21 13:05:26
network:Updated-by:Gus Reese


research ideas for library linked data

The past few weeks have seen some pretty big news for Library Linked Data. On April 7th the Hungarian National Library announced that its entire library catalog, digital library holdings, and name/subject authority data are now available as Linked Data. Then just a bit more than a week later, on April 16th the German National Library announced that it was making its name and subject authority files available as Linked Data.

This adds to the pioneering work that the Royal Library of Sweden has already done in making all of its catalog and authority data available, which they announced almost two years ago now. Add to this that OCLC is also publishing the Virtual International Authority File as Linked Data, and that the Library of Congress also makes its subject authority data available as Linked Data and things are starting to get interesting.

About 16 months ago at the Dublin Core Conference in Berlin Alistair Miles predicted that we’d see several implementations of Linked Data at major libraries within the year. I must admit, while I was sympathetic to the cause, I was also pretty skeptical that this would come to pass. But here we are, just a bit past a year and two national libraries and a major library data distributor have decided to publish some of their data assets as Linked Data.

Hey Al, crow never tasted so good…

So now it’s starting to feel like there’s enough extant library Linked Data to start looking at patterns of usage, to see if there are any emerging best practices we could work towards. In particular I think it would be interesting to take a look at:

  • What vocabularies are being used, and is there emerging consensus about which to use?
  • What licenses (if any) are associated with the data?
  • How much linking and interlinking is going on?
  • What sorts of mechanisms does the publisher offer for getting the data: sitemap, feeds, SPARQL, bulk download?
  • What is the quality of the data: granularity, link integrity, vocabulary usage.
  • What approaches to identifiers for “real world things” have publishers taken: hash, slash, 303, PURLs, reuse of traditional identifiers, etc.
  • What are the relative sizes of the pools of library linked data?
  • How are updates being managed?

Tomorrow I’m meeting with some folks at the Metadata Research Center at the School of Information and Library Science at the University of North Carolina to talk about their HIVE project. Barbara Tillett and Libby Dechman of LC are also here to talk about the use of LCSH, VIAF and RDA. I’m hoping to convince some of the folks at the MRC that answering some of these questions about the use of Linked Data in libraries could be valuable to the library research community. The rumored W3C Incubator Group for Cultural Heritage Institutions and the Semantic Web couldn’t come at a better time.


history and genealogy at semwebdc


spine CC BY 2.0

Last week’s Washington DC Semantic Web Meetup focused on History and Genealogy Semantics. It was a pretty small, friendly crowd (about 15-20) that met for the first time at the Library of Congress. The group included folks from PBS, the National Archives, the Library of Congress, and the Center for History and New Media–as well as some regulars from the Washington DC SGML/XML Users Group.

Brian Eubanks gave a presentation on what the Semantic Web, Linked Data and specifically RDF and Named Graphs have to offer genealogical research. He took us on a tour through a variety of websites, such as Land Records Database at the Bureau of Land Management, Ancestry.com, Footnote and Google Books and made a strong case for using RDF to link these sorts of documents with a family tree.

As more and more historic records make their way online as Web Documents with URIs, RDF becomes an increasingly useful data model for providing provenance and source information for a family tree. On sites like Ancestry.com it is important to understand the provenance of genealogical assertions, since Ancestry.com allows you to merge other people’s family trees into your own, based on likely common ancestors. In situations like this researchers need to be able to evaluate the credibility or truthfulness of other people’s trees–and being able to source the family tree links to the documents that support them is an essential part of the equation.

Along the way Brian let people know about a variety of vocabularies that are available for making assertions that are of value to genealogical research:

  • rdfcal : for Events
  • BIO : for biographical information
  • Relationship : for describing the links between people
  • FOAF : for describing people
  • TriG : for identifying the assertions that a researcher makes and linking them to a given document

The beautiful thing about RDF for me, is that it’s possible to find and use these vocabularies in concert, and I’m not tempted to create the-greatest-genealogy-vocabulary that does it all. In addition, Brian pointed out that sites like dbpedia and geonames are great sources of names (URIs) for people, places and events that can be used in building descriptions. Brian has started the History and Genealogy Semantics Working Group which has an open membership, and encourages anyone with interest in this area to join. While writing this post I happened to run across a Wikipedia page about Family Tree Mapping, which indicated that some genealogical software already supports geocoding family trees. As usual it seems like the geo community is leading the way in making semantics on the web down to earth and practical.

I followed Brian by giving a brief talk about the Chronicling America, which is the web front-end for data collected by National Digital Newspaper Program, which in turn is a joint project of the Library of Congress and the National Endowment for the Humanities. After giving a brief overview of the program, I described how we were naturally led to using Linked Data and embracing a generally RESTful approach by a few factors:

One thing that I learned during Brian’s presentation is that sites like Footnote are not only going around digitizing historic collections for inclusion in their service, but they also give their subscribers a rich editing environment to search and annotate document text. These annotations are exactly the sort of stuff that would be perfect to represent as and RDF graph, if you wanted to serialize the data. In fact the NSF funded Open Annotation Collaboration project is exploring patterns and emerging best practices in this area. I’ve had it in the back of my mind that allowing users to annotate page content in Chronicling America would be a really nice feature to have. If not at chroniclingamerica.loc.gov proper, then perhaps showing how it could be done by a 3rd party using the API. To some extent we’re already seeing annotation happening in Wikipedia, where people are creating links to newspaper pages and titles in their entries, which we can see in the referrer information in our web server logs. Update: and I just learned that wikipedia themselves provide a service that allows you to discover entries that have outbound links to a particular site, like chroniclingamerica.loc.gov.

Speaking of the API (which really is just REST) if you are interested in learning more about it check out the API Document that Dan Chudnov prepared. I also made my slides available, hopefully the speaker notes provide a bit more context for what I talked about when showing images of various things.

Afterwards a bunch of us headed across the street to have a drink. I was really interested to hear from Sam Deng that (like the group I work in at LC) PBS are big Python and Django shop. We’re going to try to get a little brown bag lunch going on between PBS and LC to talk about their use of Django on Amazon EC2, as well as software like Celery for managing asynchronous task queues.

Also, after chatting with Glenn Clatworthy of PBS, I learned that he has been experimenting with making Linked Data views available for their programs. It was great to hear Glenn describe how assigning each program a URI, and leveraging the nature of the web would make a perfect fit for distributing data in the PBS enterprise. It makes me think that perhaps having a session on what the BBC are doing with Linked Data would be timely?


full link graph?

Peter Norvig of Google mentioned Linked Data in his interview with Reddit Ask Me Anything (thanks Gunnar)

So right from the start researchers are writing code that use our main APIs that
are using the data that everyone else uses. If you want some web pages you use
the full copy of the web. If you want some linked data you use the full link
graph.

Update: Richard Cyganiak correctly points out that Norvig said “link data” not “linked data”. :-) At least we won’t have to ask Google if they are using SPARQL and RDF now …


a middle way for linked data at the bbc

I got the chance to attend the 2nd London Linked Data Meetup that was co-located with dev8d last week, which turned out to be a whole lot of fun. I figured if I waited long enough other people would save me from having to write a good summary/discussion of the event…and they have: thanks Pete Johnston, Ben Summers, Sheila Macneill, Martin Belam and Frankie Roberto.

The main thing that I took away is how much good work the BBC is doing in this space. Given the recent news of cuts at the BBC, it seems like a good time to say publicly how important some of the work they are doing is to the web technology sector. As part of the Meetup Tom Scott gave a presentation on how the BBC are using Linked Data to integrate distinct web properties in the BBC enterprise, like their Programmes and the Wildlife Finder web sites.

The basic idea is that they categorize (dare I say catalog?) television and radio content using wikipedia/dbpedia as a controlled vocabulary. Just doing this relatively simple thing means that they can create another site like the Wildlife Finder that provides a topical guide to the natural world (and also happens to use wikipedia/dbpedia as a controlled vocabulary), that then links to their audio and video content. Since the two sites share a common topic vocabulary, they are able to automatically create links from the topic guides to all the radio and television content that are on a particular topic.

For a practical example take a look consider this page for the Great Basin Bristlecone Pine:

If you scroll down on the page you’ll see a link to a video clip from David Attenborough’s documentary Life on the Programmes portion of the website. Now take a step back and consider how these are two separate applications in the BBC enterprise that are able to build a rich network of links between each other. It’s the shared controlled vocabulary (in this case dbpedia derived from wikipedia) which allows them to do this.

If you take a peak in the html you’ll see the resource has an alternate RDF version:

<link rel="alternate" type="application/rdf+xml" href="/nature/species/Pinus_longaeva.rdf" />

The Resource Description Framework (RDF) is really just the best data model we have for describing stuff that’s on the Web, and the type of links between resources that are on (and off) the Web. Personally, I prefer to look at RDF as Turtle which is pretty easily done with Dave Beckett’s handy rapper utility (aptitude install raptor-utils if you are following from home).

rapper -o turtle http://www.bbc.co.uk/nature/species/Pinus_longaeva

The key bits of the RDF are the description of the Great Basin bristlecone pine:

<http://www.bbc.co.uk/nature/species/Pinus_longaeva> rdfs:seeAlso <http://www.bbc.co.uk/nature/species> ; foaf:primaryTopic <http://www.bbc.co.uk/nature/species/Pinus_longaeva#species> .