inkdroid http://inkdroid.org/journal $pithy_personal_mission_statement Wed, 03 Mar 2010 13:44:03 +0000 http://wordpress.org/?v=2.9.2 en hourly 1 full link graph? http://inkdroid.org/journal/2010/03/03/full-link-graph/ http://inkdroid.org/journal/2010/03/03/full-link-graph/#comments Wed, 03 Mar 2010 12:42:36 +0000 ed http://inkdroid.org/journal/?p=1798 Peter Norvig of Google mentioned Linked Data in his interview with Reddit Ask Me Anything (thanks Gunnar)

So right from the start researchers are writing code that use our main APIs that
are using the data that everyone else uses. If you want some web pages you use
the full copy of the web. If you want some linked data you use the full link
graph.

Update: Richard Cyganiak correctly points out that Norvig said “link data” not “linked data”. :-) At least we won’t have to ask Google if they are using SPARQL and RDF now …

]]>
http://inkdroid.org/journal/2010/03/03/full-link-graph/feed/ 2
a middle way for linked data at the bbc http://inkdroid.org/journal/2010/03/02/a-middle-way-for-linked-data-at-the-bbc/ http://inkdroid.org/journal/2010/03/02/a-middle-way-for-linked-data-at-the-bbc/#comments Tue, 02 Mar 2010 20:13:00 +0000 ed http://inkdroid.org/journal/?p=1701 I got the chance to attend the 2nd London Linked Data Meetup that was co-located with dev8d last week, which turned out to be a whole lot of fun. I figured if I waited long enough other people would save me from having to write a good summary/discussion of the event…and they have: thanks Pete Johnston, Ben Summers, Sheila Macneill, Martin Belam and Frankie Roberto.

The main thing that I took away is how much good work the BBC is doing in this space. Given the recent news of cuts at the BBC, it seems like a good time to say publicly how important some of the work they are doing is to the web technology sector. As part of the Meetup Tom Scott gave a presentation on how the BBC are using Linked Data to integrate distinct web properties in the BBC enterprise, like their Programmes and the Wildlife Finder web sites.

The basic idea is that they categorize (dare I say catalog?) television and radio content using wikipedia/dbpedia as a controlled vocabulary. Just doing this relatively simple thing means that they can create another site like the Wildlife Finder that provides a topical guide to the natural world (and also happens to use wikipedia/dbpedia as a controlled vocabulary), that then links to their audio and video content. Since the two sites share a common topic vocabulary, they are able to automatically create links from the topic guides to all the radio and television content that are on a particular topic.

For a practical example take a look consider this page for the Great Basin Bristlecone Pine:

If you scroll down on the page you’ll see a link to a video clip from David Attenborough’s documentary Life on the Programmes portion of the website. Now take a step back and consider how these are two separate applications in the BBC enterprise that are able to build a rich network of links between each other. It’s the shared controlled vocabulary (in this case dbpedia derived from wikipedia) which allows them to do this.

If you take a peak in the html you’ll see the resource has an alternate RDF version:

<link rel="alternate" type="application/rdf+xml" href="/nature/species/Pinus_longaeva.rdf" />

The Resource Description Framework (RDF) is really just the best data model we have for describing stuff that’s on the Web, and the type of links between resources that are on (and off) the Web. Personally, I prefer to look at RDF as Turtle which is pretty easily done with Dave Beckett’s handy rapper utility (`aptitude install raptor-utils` if you are following from home).

rapper -o turtle http://www.bbc.co.uk/nature/species/Pinus_longaeva

The key bits of the RDF are the description of the Great Basin bristlecone pine:

<http://www.bbc.co.uk/nature/species/Pinus_longaeva>
    rdfs:seeAlso <http://www.bbc.co.uk/nature/species> ;
    foaf:primaryTopic <http://www.bbc.co.uk/nature/species/Pinus_longaeva#species> .

<http://www.bbc.co.uk/nature/species/Pinus_longaeva#species>
    dc:description "Great Basin bristlecone pines are restricted to the mountain ranges of California, Nevada and Utah and have a remarkable ability to survive in this extremely harsh and challenging environment. They grow extremely slowly, and are some of the oldest living organisms in the world. With some aged at almost 5,000 years these amazing trees can reveal information about Earth's climate variations. Amazingly, the leaves, or needles, can remain green for over 45 years." ;
    wo:class <http://www.bbc.co.uk/nature/class/Pinopsida#class> ;
    wo:family <http://www.bbc.co.uk/nature/family/Pinaceae#family> ;
    wo:genus <http://www.bbc.co.uk/nature/genus/Pinus#genus> ;
    wo:growsIn <http://www.bbc.co.uk/nature/habitats/Mountain#habitat>, <http://www.bbc.co.uk/nature/habitats/Temperate_coniferous_forest#habitat> ;
    wo:kingdom <http://www.bbc.co.uk/nature/kingdom/Plant#kingdom> ;
    wo:name <http://www.bbc.co.uk/nature/species/Pinus_longaeva#name> ;
    wo:order <http://www.bbc.co.uk/nature/order/Pinales#order> ;
    wo:phylum <http://www.bbc.co.uk/nature/phylum/Pinophyta#phylum> ;
    a wo:Species ;
    rdfs:label "Great Basin bristlecone pine" ;
    owl:sameAs <http://dbpedia.org/resource/Pinus_longaeva> ;
    foaf:depiction <http://open.live.bbc.co.uk/dynamic_images/naturelibrary_640_credits/downloads.bbc.co.uk/earth/naturelibrary/assets/p/pi/pinus_longaeva/pinus_longaeva_1.jpg> .

And then the description of the clip that is related to the topic of Great Basin bristlecone pine:

<http://www.bbc.co.uk/programmes/p005fs5p#programme>
    dc:title "Ancient bristlecones" ;
    po:subject <http://www.bbc.co.uk/nature/species/Pinus_longaeva#species> ;
    a po:Clip .

And we can follow our nose and fetch a description of the Ancient bristelcones clip:

rapper -o turtle http://www.bbc.co.uk/programmes/p005fs5p

Which tells us lots of stuff, like that it’s a documentary part of the science and nature genre, gives us a synopsis, and even links the clip to the episode and series it is a part of:

<http://www.bbc.co.uk/programmes/p005fs5p#programme>
    dc:title "Ancient bristlecones" ;
    po:format <http://www.bbc.co.uk/programmes/formats/documentaries#format> ;
    po:genre <http://www.bbc.co.uk/programmes/genres/factual/scienceandnature#genre>, <http://www.bbc.co.uk/programmes/genres/factual/scienceandnature/natureandenvironment#genre> ;
    po:long_synopsis """Bristlecone pines live at the limit of life, above 3,000m in the mountains of  western America. Almost continuous freezing temperatures and savage winds make life so tough, that these bristlecones only grow for six weeks of the year.

Everything is about conserving energy.They hardly ever shed their needles which can last more than 30 years. After centuries of being blasted by storms a full grown tree still survives with only a strip of bark a few inches wide.

These trees live life at such a slow pace they can reach a great age. Some are over 5,000 years old. It has been said of the bristlecones that to live here is to take a very long time to die.""" ;
    po:medium_synopsis "Living above 3,000 metres, North America's bristlecones cope with freezing temperatures and battering winds by only growing for six weeks of the year. But seeing as they may live for more than 5,000 years, that's still a fair bit of growing in a single lifetime. Slowly but surely does it..." ;
    po:short_synopsis "The world's oldest trees have survived 5,000 years of harsh conditions." ;
    po:version <http://www.bbc.co.uk/programmes/p005fs5r#programme> ;
    a po:Clip .

<http://www.bbc.co.uk/programmes/b00lbpcy#programme>
    po:clip <http://www.bbc.co.uk/programmes/p005fs5p#programme> ;
    a po:Series .

<http://www.bbc.co.uk/programmes/b00p90d6#programme>
    po:clip <http://www.bbc.co.uk/programmes/p005fs5p#programme> ;
    a po:Episode .

Conspicuously missing from this description is something like:

<http://www.bbc.co.uk/programmes/p005fs5p#programme>
    dcterms:subject <http://dbpedia.org/resource/Pinus_longaeva> .

But presumably it’s hiding underneath the covers in the Programmes database, and that’s what lets them link stuff up?

Also very interesting was Georgi Kobilarov’s description of Uberblic. Since Georgi helped create dbpedia and is now consulting with the BBC, it seems like uberblic is positioning itself to provide a platform for the BBC to have it’s own local cache of the world of Linked Data. Having a local curated view of the world of linked data is something Dan Chudnov identified as a real need at the first Linked Data workshop at code4lib 2009 for caching and proxying linked data…so it is really cool to see solutions starting to appear in this space…and for them to be adopted by institutions like the BBC.

Georgi demo’d how an edit on wikipedia would be immediately reflected in the structured data available from uberblic. It was a real time update, and extremely impressive. It looks like part of the uberblic strategy is to crawl BBC’s web site and other pockets of Linked Data to enable the sort of linking across web properties that Tom described. I’d also surmise given the realtime nature of this that Georgi is bypassing dbpedia dumps and using the Wikipedia changes atom feed in conjunction with extractors that were built as part of the dbpedia project. But I’d love to know more of the mechanics of the update. It also would be interesting to know if uberblic has a notion of versions.

The really powerful message that the BBC is helping promote is this idea that good websites are APIs. Tom mentioned Paul Downey’s notion that Web APIs Are Just Web Sites. It’s a subtle but extremely important point that I learned primarily working closely with Dan Krech for a year or so. It’s an unfortunate side effect of lots market driven talk about web2.0, web3.0 and Linked Data in general that this simple REST message gets lost. We took it seriously in the design of the “API” at the Library of Congress’ Chronicling America. It’s also something I tried to talk about later in the week at dev8d when I had to quickly put a presentation together:

The slides probably won’t make much sense on their own, but the basic message was that we often hear about Linked Data in terms of pushing all your data to some triple store so you can start querying it with SPARQL and doing inferencing, and suddenly you’re going to be sitting pretty, totally jacked up on the Semantic Web.

If you are like me, you’ve already got databases where things are modeled, and you’ve created little web apps that have extracted information from the databases and put them on the web as HTML docs for people around the world to read (queue some mid 1990s grunge music). Expecting people to chuck away the applications and technology stacks they have simply to say they do Linked Data is wishful thinking. What’s missing is a simple migration strategy that would allow web publishers to easily recognize the value in publishing the contents of their database as Linked Data, and how it complements the HTML (and XML, JSON) publishing they are currently doing. My advice to folks at dev8d boiled down to:

  • Keep modelling your stuff how you like
  • Identify your stuff with Cool URIs in your webapps
  • Link your stuff together in HTML
  • Link to machine friendly formats (RSS, Atom, JSON, etc)
  • Use RDF to make your database available on the web using vocabularies other people understand.
  • Start thinking about technologies like SPARQL that will let you query pools and aggregated views of your data.
  • Consider joining the public-lod discussion list and joining the conversation.

I got some nice comments afterwards from Graham Klyne, Hugh Glaser, Adrian Stevenson and Mark Phillips so I felt pretty happy…granted most of the hard line Linked Data folks had already left a couple days earlier.

So some really exciting stuff is going on at the BBC. They are using Linked Data in a practical way that benefits their enterprise in real ways. I’m crossing my fingers and hoping that the value of what is going on here is recognized, and the various cuts that are going on won’t affect any of the fine work they are doing on improving the Web.

For more information check out the Semantic Web Case Study they folks at the BBC wrote summarizing their approach for the W3C.

]]>
http://inkdroid.org/journal/2010/03/02/a-middle-way-for-linked-data-at-the-bbc/feed/ 1
web documents and axioms for linked data http://inkdroid.org/journal/2010/02/22/web-documents-and-axioms-for-linked-data/ http://inkdroid.org/journal/2010/02/22/web-documents-and-axioms-for-linked-data/#comments Tue, 23 Feb 2010 02:06:38 +0000 ed http://inkdroid.org/journal/?p=1676 A few months ago I took part in a discussion on the pedantic-web list, which started out as a relatively simple question about FOAF usage, and quickly evolved into a conversation about terms people use when talking about Linked Data, and more generally the Web.

I ended up having a very helpful off-list email exchange with Richard Cyganiak (one of the architects of the Linked Data pattern) about some trouble I’ve had understanding what Information Resources and Documents are in the context of Web Architecture. The trouble I had was in determining whether or not a collection of physical newspaper pages I was helping put on the web were Information Resources or not. I needed to know because I wanted to identify the newspaper pages with URIs, and describe them as Linked Data…and the resolvability of these URIs was largely dependent on how I chose to answer the question.

Richard ended up offering up some advice that I’ve since found very useful, and I thought I would transcribe some of it down here just in case you might find it useful as well. My apologies to you (and Richard) if some of this seems out of context. It may really only be useful for people who are in the digital library domain, but perhaps it’s useful elsewhere.

On the subject of what is a Document Richard offered up this way at looking at what are Web Documents:

The Web is a new, blank information space that is, by definition, disjoint from anything else that exists in the world. By setting up and configuring a web server, you make things pop up in that information space (by creating resolvable URIs). By definition, the things that pop up in the information space are a different beast from anything that existed before. They are web pages. They are *not* the same as things that exist outside of the space, like files on your hard disk, or newspaper articles.

I would avoid the term “document” when talking about representations. Representations are those ephemeral things that go over the wire. A representation is a “byte streams with a media type (and possibly other meta data)”. When I use the term “HTML document”, I mean a resource, identified by a URI, that has (only) HTML representations.

Richard encouraged me to think in terms of Web Documents and not generic Documents. I was getting tripped up by considering Newspaper Pages as Documents…which of course they are in the general sense, but characterized this way it became clear that the Newspaper Pages are not Web Documents. This view on Web Documents is supported in the Cool URIs for the Semantic Web that he co-authored.

Richard also included some axioms that underpin how he thinks about resources in the Linked Data view:

I’m using a few rules that I think should be considered axioms of web architecture:

First, if something exists independently from the Web, then it cannot be a Web Document. (hence two resources, one for the newspaper page and one for the web page)

Second, only Web Documents can have representations (hence the need to describe the newspaper page in a web page, rather than directly providing representations of the newspaper page).

I understand these rules as axioms, that is, they should be followed because they make the system work best, not because they somehow follow from the nature of the world (they don’t).

The pragmatist in me particularly liked how these aren’t supposed to have anything to do with the Real World, but are just ways of thinking about the Web to make it work better. Finally Richard offered some advice on how to reconcile the REST and Linked Data views on identity:

I make sense of the REST worldview like this: In typical REST, all the URIs *always* identify web documents. The REST folks might claim that they identify other things, like users or items for sale or places on the earth, but actually they just identify a document that is *about* that thing. The thing itself doesn’t have an identifier. This is perfectly fine for building certain kinds of systems, so the REST guys actually get away with pretending that the URI identifies the thing. But this doesn’t allow you to do certain things, like using domain-independent vocabularies for metadata and coreference, and you get into deep trouble if you want to use this for describing *web pages* rather than *newspaper pages*.

I hope I haven’t take any liberties quoting my conversation with Richard out of context like this. I mainly wanted to transcribe Richard’s points (which perhaps he has made elsewhere) so that I could revisit them, without having to dig through my email archive … Comments welcome!

]]>
http://inkdroid.org/journal/2010/02/22/web-documents-and-axioms-for-linked-data/feed/ 4
data.australia.gov.au and rdfa http://inkdroid.org/journal/2010/01/29/data-australia-gov-au-and-rdfa/ http://inkdroid.org/journal/2010/01/29/data-australia-gov-au-and-rdfa/#comments Fri, 29 Jan 2010 14:36:57 +0000 ed http://inkdroid.org/journal/?p=1656 In my previous blog post I was trying to demonstrate the virtues of data.gov.uk making the descriptions of their datasets available as RDFa. Just this morning I learned from Mark Birbeck that the folks down under at data.australia.gov.au did this last October!

For example this page describing a dataset for public Internet locations has this RDF metadata inside it:

<http://data.australia.gov.au/80> cc:attributionName "http://www.centrelink.gov.au/"@en-au ;
     cc:attributionURL <http://www.centrelink.gov.au/> ;
     dc:coverage.geospatial "Australia"@en-au ;
     dc:coverage.temporal "Not specified"@en-au ;
     dc:creator "Centrelink"@en-au ;
     dc:date.modified "2009-08-31"^^xsd:date ;
     dc:date.published "2009-08-31"^^xsd:date ;
     dc:description """<p xml:lang="en-au" xmlns="http://www.w3.org/1999/xhtml">Location of Centrelink Offices</p>
"""^^rdf:XMLLiteral ;
     dc:identifier "80"@en-au ;
     dc:keywords "<a href=\"http://data.australia.gov.au/tag/social-security\" rel=\"tag\" xml:lang=\"en-au\" xmlns=\"http://www.w3.org/1999/xhtml\">Social Security</a>"^^rdf:XMLLiteral ;
     dc:license "<a href=\"http://creativecommons.org/licenses/by/2.5/au/\" rel=\"licence\" xml:lang=\"en-au\" xmlns=\"http://www.w3.org/1999/xhtml\"><img alt=\"Creative Commons License\" class=\"licence\" src=\"http://i.creativecommons.org/l/by/2.5/au/88x31.png\"/>Creative Commons - Attribution 2.5 Australia (CC-BY)</a>"^^rdf:XMLLiteral ;
     dc:source "<a href=\"http://www.centrelink.gov.au/\" rel=\"dc:source\" xml:lang=\"en-au\" xmlns=\"http://www.w3.org/1999/xhtml\"/>"^^rdf:XMLLiteral ;
     dc:subject "<a href=\"http://data.australia.gov.au/catalogue/community\" rel=\"category tag\" title=\"View all posts in Community\" xml:lang=\"en-au\" xmlns=\"http://www.w3.org/1999/xhtml\">Community</a>,  <a href=\"http://data.australia.gov.au/catalogue/employment\" rel=\"category tag\" title=\"View all posts in Employment\" xml:lang=\"en-au\" xmlns=\"http://www.w3.org/1999/xhtml\">Employment</a>,  <a href=\"http://data.australia.gov.au/catalogue/government\" rel=\"category tag\" title=\"View all posts in Government\" xml:lang=\"en-au\" xmlns=\"http://www.w3.org/1999/xhtml\">Government</a>"^^rdf:XMLLiteral ;
     dc:title "Location of Centrelink Offices"@en-au ;
     dc:type <http://purl.org/dc/dcmitype/Text> ;
     agls:jurisdiction "[Commonwealth of] Australia (AU)"@en-au ;

<http://www1.australia.gov.au/datasets/Federal/Centrelink/Location%20of%20Centrelink%20offices%2031_08_09/centrelink_offices_31_08_2009.CSV> dc:format "CSV"@en-au .

Now this data isn’t without problems: notice the XML literals as objects in the assertions involving subject, keyword, license and source? But it’s a Beta after all, and lots of us are learning this as we go, so Australia deserves a ton of credit. One really nice thing they are doing is making assertions about the format and URL location of the dataset itself. It would be even better if the dataset description was linked up with the dataset files using oai-ore or some other vocabulary.

In about 5 minutes I adapted the simplistic data.gov.uk crawler to crawl the data.australia.gov.au data. There aren’t as many datasets, so the crawler only pulled down 1725 triples (minus the xhtml triples)…but perhaps I missed some in my simplistic crawl.

Seeing both the data.gov.uk and data.australia.gov.au efforts to make dataset descriptions available makes me wonder if it could be useful for the W3C eGov Working Group to provide some lightweight guidance on how to make dataset descriptions available: what sorts of vocabularies to use, the kinds of assertions that are important, etc. It’s hard not to daydream of trying to provide an aggregated view of both pools of data, which is kept in synch using the web, and which perhaps could pull down aggregated datasets and archive them, etc. Perhaps a little spot checking tool that took at look at your HTML and let you know if it can work as a dataset description would be useful too?

]]>
http://inkdroid.org/journal/2010/01/29/data-australia-gov-au-and-rdfa/feed/ 0
data.gov.uk and rdfa http://inkdroid.org/journal/2010/01/26/data-gov-uk-and-rdfa/ http://inkdroid.org/journal/2010/01/26/data-gov-uk-and-rdfa/#comments Tue, 26 Jan 2010 22:41:19 +0000 ed http://inkdroid.org/journal/?p=1602 The recent public release of the UK Government’s data.gov.uk site got picked up by the press last week in articles at The Guardian, Prospect Magazine and elswhere. These have been supplemented by some more technical discussions at ReadWriteWeb, Open Knowledge Foundation, Talis, Jeni Tennison’s blog, and some helpful emails from Leigh Dodds (Talis) and Jonathan Gray (Open Knowledge Foundation) on the w3c egovernment discussion list.

One thing that I haven’t seen mentioned so far in public (which I just discovered today) is that data.gov.uk is using RDFa to expose metadata about the datasets in a machine readable way. What this means is that in an HTML page for a dataset like this there are some extra HTML attributes like about, property, rel that have been thoughtfully used to express some structured metadata about the dataset, which can be extracted from the HTML and expressed say as Turtle:

<http://data.gov.uk/id/dataset/agricultural_market_reports> dct:coverage "Great Britain (England, Scotland, Wales)"@en ;
     dct:created "2009-12-04"@en ;
     dct:creator "Department for Environment, Food and Rural Affairs"@en ;
     dct:isReferencedBy <http://data.gov.uk/wiki/index.php/Package:agricultural_market_reports> ;
     dct:license "Crown Copyright"@en ;
     dct:source <http://statistics.defra.gov.uk/esg/publications/amr/default.asp>, <https://statistics.defra.gov.uk/esg/publications/amr/default.asp> ;
     dct:subject
         <http://data.gov.uk/data/tag/agriculture>,
         <http://data.gov.uk/data/tag/agriculture-and-environment>,
         <http://data.gov.uk/data/tag/environment>,
         <http://data.gov.uk/data/tag/farm-business>,
         <http://data.gov.uk/data/tag/farm-businesses>,
         <http://data.gov.uk/data/tag/farming> .

In fact since data.gov.uk has a nice paging mechanism that lists all the datasets it’s not hard to write a little script that scrapes all the metadata for the datasets (35,478 triples) right out of the web pages.

I also noticed via Stéphane Corlosquet today that data.gov.uk is using the Drupal open-source content management system. To what extent Drupal7’s new RDFa features are being used to layer in this RDFa isn’t clear to me. But it is an exciting development. It’s exciting because data.gov.uk is a great example of how to bubble up data that’s typically locked away in databases of some kind into the HTML that’s out on the web for people to interact with, and for crawlers to crawl and re-purpose.

For example I can now write a utility to check the status of the external dataset links, to make sure they are they are there (200 OK). The complete results by URL can be summarized by rolling up by status code:

Status Code Number of Datasets
200 2977
404 106
502 23
503 14
[Errno socket error] [Errno -2] Name or service not known 8
500 3
nonnumeric port: ” 1
[Errno socket error] [Errno 110] Connection timed out 1
400 1

Or I can generate a list of dataset subjects (eventhough it’s already available I guess). Here’s the top 25:

Subject Number of Datasets
health 645
care 427
child 398
population 341
children 295
school 273
health-and-social-care 271
health-well-being-and-care 205
economy 202
economics-and-finance 189
census 188
education 176
communities 154
benefit 153
road 144
children-education-and-skills 121
people-and-places 111
government-receipts-and-expenditure 110
education-and-skills 110
housing 108
environment 107
tax 107
life-in-the-community 106
employment 105
tax-credit 96

I realize it’s early days but here are a few things it would be fun to see at data.gov.uk:

  • add some RDFa and SKOS or CommonTag in tag pages like education: this would allow things to be hooked up a bit more explicitly, tags to be given nice labels, and encourage the reuse of the tagging vocabulary within and outside data.gov.uk
  • link the dataset descriptions to the dataset resources themselves (the pdfs, excel spreadsheets, etc) that are online using a vocabulary like the Open Archives Reuse and Exchange and/or POWDER. This would allow for the harvesting and aggregation not only of the metadata, but the datasets as well.

I imagine much of this sort of hacking around can be enabled by querying the data.gov.uk SPARQL endpoint. But it hasn’t been very clear to me exactly what data is behind there. And there is something comforting about being able to crawl the open web to find the information that’s there in open to view.

]]>
http://inkdroid.org/journal/2010/01/26/data-gov-uk-and-rdfa/feed/ 5
5 Tunes for Gillian http://inkdroid.org/journal/2010/01/11/5-tunes-for-gillian/ http://inkdroid.org/journal/2010/01/11/5-tunes-for-gillian/#comments Mon, 11 Jan 2010 19:24:52 +0000 ed http://inkdroid.org/journal/?p=1593 Kesa’s good friend Gillian from college days in NOLA sent around an email asking for people’s favorite five songs of last year.

For some reason picking individual songs is hard for me. I guess because I rarely put on a song, and almost always put on an album–as antiquated as that sounds. I do occasionally listen to suggestions on last.fm or random songs in my player-du-jour — but then I don’t really remember the song names.

Anyhow here’s the list I cobbled together, with links out to youtube (that’ll probably break in 28 hrs):

]]>
http://inkdroid.org/journal/2010/01/11/5-tunes-for-gillian/feed/ 1
Hacking O’Reilly RDFa http://inkdroid.org/journal/2009/12/22/hacking-oreilly-rdfa/ http://inkdroid.org/journal/2009/12/22/hacking-oreilly-rdfa/#comments Tue, 22 Dec 2009 18:17:03 +0000 ed http://inkdroid.org/journal/?p=1535 I recently learned from Ivan Herman’s blog that O’Reilly has begun publishing RDFa in their online catalog of books. So if you go and install the RDFa Highlight bookmarklet and then visit a page like this and click on the bookmarklet you’ll see something like:



Those red boxes you see are graphical depictions of where metadata can be found interleaved in the HTML. In my screenshot you can maybe barely see an assertion involving the title being displayed:

<urn:x-domain:oreilly.com:product:9780596516499.IP> dc:title "Natural Language Processing with Python"

But there is actually quite a lot of metadata hiding in the page, which can be found by running the page through the RDFa Distiller (quickly skim over this if your eyes glaze over when you see Turtle):

@prefix dc: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix frbr: <http://vocab.org/frbr/core#> .
@prefix gr: <http://purl.org/goodrelations/v1#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<urn:x-domain:oreilly.com:product:9780596516499.IP> a frbr:Expression ;
     dc:creator <urn:x-domain:oreilly.com:agent:pdb:3343>, <urn:x-domain:oreilly.com:agent:pdb:3501>, <urn:x-domain:oreilly.com:agent:pdb:3502> ;
     dc:issued "2009-06-12"^^xsd:dateTime ;
     dc:publisher "O'Reilly Media"@en ;
     dc:title "Natural Language Processing with Python"@en ;
     frbr:embodiment <urn:x-domain:oreilly.com:product:9780596516499.BOOK>, <urn:x-domain:oreilly.com:product:9780596803346.SAF>, <urn:x-domain:oreilly.com:product:9780596803391.EBOOK> . 

<http://customer.wileyeurope.com/CGI-BIN/lansaweb?procfun+shopcart+shcfn01+funcparms+parmisbn(a0130):9780596516499+parmqty(p0050):1+parmurl(l0560):http://oreilly.com/store/> a gr:Offering ;
     gr:includesObject
         [ a gr:TypeAndQuantityNode ;
             gr:ammountOfThisGood "1"^^xsd:float ;
             gr:hasPriceSpecification
                 [ a gr:UnitPriceSpecification ;
                     gr:hasCurrency "GBP"@en ;
                     gr:hasCurrencyValue "34.50"^^xsd:float
                 ] ;
             gr:typeOfGood <urn:x-domain:oreilly.com:product:9780596516499.BOOK>
         ] . 

<http://my.safaribooksonline.com/9780596803346> a gr:Offering ;
     gr:includesObject
         [ a gr:TypeAndQuantityNode ;
             gr:ammountOfThisGood "1"^^xsd:float ;
             gr:typeOfGood <urn:x-domain:oreilly.com:product:9780596803346.SAF>
         ] . 

<https://epoch.oreilly.com/shop/cart.orm?p=BUNDLE&prod=9780596516499.BOOK&prod=9780596803391.EBOOK&bundle=1&retUrl=http%3A%252F%252Foreilly.com%252Fstore%252F> a gr:Offering ;
     gr:includesObject
         [ a gr:TypeAndQuantityNode ;
             gr:ammountOfThisGood "1"^^xsd:float ;
             gr:includesObject
                 [ a gr:TypeAndQuantityNode ;
                     gr:ammountOfThisGood "1"^^xsd:float ;
                     gr:hasPriceSpecification
                         [ a gr:UnitPriceSpecification ;
                             gr:hasCurrency "None"@en ;
                             gr:hasCurrencyValue "49.49"^^xsd:float
                         ] ;
                     gr:typeOfGood <urn:x-domain:oreilly.com:product:9780596803391.EBOOK>
                 ] ;
             gr:typeOfGood <urn:x-domain:oreilly.com:product:9780596516499.BOOK>
         ] . 

<https://epoch.oreilly.com/shop/cart.orm?prod=9780596516499.BOOK> a gr:Offering ;
     gr:includesObject
         [ a gr:TypeAndQuantityNode ;
             gr:ammountOfThisGood "1"^^xsd:float ;
             gr:hasPriceSpecification
                 [ a gr:UnitPriceSpecification ;
                     gr:hasCurrency "USD"@en ;
                     gr:hasCurrencyValue "44.99"^^xsd:float
                 ] ;
             gr:typeOfGood <urn:x-domain:oreilly.com:product:9780596516499.BOOK>
         ] . 

<https://epoch.oreilly.com/shop/cart.orm?prod=9780596803391.EBOOK> a gr:Offering ;
     gr:includesObject
         [ a gr:TypeAndQuantityNode ;
             gr:ammountOfThisGood "1"^^xsd:float ;
             gr:hasPriceSpecification
                 [ a gr:UnitPriceSpecification ;
                     gr:hasCurrency "USD"@en ;
                     gr:hasCurrencyValue "35.99"^^xsd:float
                 ] ;
             gr:typeOfGood <urn:x-domain:oreilly.com:product:9780596803391.EBOOK>
         ] . 

<urn:x-domain:oreilly.com:agent:pdb:3343> a foaf:Person ;
     foaf:homepage <http://www.oreillynet.com/pub/au/3614> ;
     foaf:name "Steven Bird"@en . 

<urn:x-domain:oreilly.com:agent:pdb:3501> a foaf:Person ;
     foaf:homepage <http://www.oreillynet.com/pub/au/3615> ;
     foaf:name "Ewan Klein"@en . 

<urn:x-domain:oreilly.com:agent:pdb:3502> a foaf:Person ;
     foaf:homepage <http://www.oreillynet.com/pub/au/3616> ;
     foaf:name "Edward Loper"@en . 

<urn:x-domain:oreilly.com:product:9780596803346.SAF> a frbr:Manifestation ;
     dc:type <http://purl.oreilly.com/product-types/SAF> . 

<urn:x-domain:oreilly.com:product:9780596803391.EBOOK> a frbr:Manifestation ;
     dc:identifier <urn:isbn:9780596803391> ;
     dc:issued "2009-06-12"^^xsd:dateTime ;
     dc:type <http://purl.oreilly.com/product-types/EBOOK> . 

<urn:x-domain:oreilly.com:product:9780596516499.BOOK> a frbr:Manifestation ;
     dc:extent """512"""@en ;
     dc:identifier <urn:isbn:9780596516499> ;
     dc:issued "2009-06-19"^^xsd:dateTime ;
     dc:type <http://purl.oreilly.com/product-types/BOOK> .

So that’s a lot of data. The nice thing about rdf is that you can look at the vocabularies that are being used to get an idea of the rough shape of the underlying data. Just looking at the namespace prefixes we can see that O’Reilly has chosen to use the following vocabularies:

I was curious so I wrote a little crawler (41 lines of Python+rdflib) to collect all the metadata from the O’Reilly Catalog pages. Yes all the pages! It ended up pulling down 92,101 triples, which I’ve made available as rdf/xml and ntriples files.

A nice side effect of having the data as a big ntriples file is you can do unix pipe tricks with sort, cut, uniq like this to get some ballpark numbers on what types of resources are in the rdf graph:

ed@curry:~/Projects/oreilly-crawler$ rdfsum catalog.nt
   6803 <http://purl.org/goodrelations/v1#TypeAndQuantityNode>
   5861 <http://purl.org/goodrelations/v1#Offering>
   4564 <http://purl.org/goodrelations/v1#UnitPriceSpecification>
   4065 <http://vocab.org/frbr/core#Manifestation>
   2100 <http://vocab.org/frbr/core#Expression>
   2023 <http://xmlns.com/foaf/0.1/Person>

Another nice thing about pulling the RDFa down with rdflib is you end up with a little berkeleydb triple store which you can query with SPARQL, say to pull out all the authors and titles:

    SELECT ?title ?author
    WHERE {
      ?title_uri dct:title ?title .
      ?title_uri dct:creator ?author_uri .
      ?author_uri foaf:name ?author .
    }

And adding a little bit of networkx judo you can get an xmas-friendly graph of authors (the green dots are books and the red ones are authors ; I limited author labels to authors who had written more than 2 books).

Admittedly this is not very readable, but I imagine someone with more network visualization skillz could do something nicer in short order. There’s a lot that could be done with the data. This exercise was mainly just to demonstrate how layering some new stuff into your HTML can really open up doors for how people use your website. Clearly O’Reilly did some deep thinking about what data they had, and what vocabularies they wanted to model it with. But once they’d done that they probably just had to go add 50 lines to an HTML template somewhere, and it was published (props to David Brunton for this turn of phrase). It’s a really good sign that a tech publisher with the stature of O’Reilly is giving this method of data publishing a try.

My only suggestion (for anyone at O’Reilly who might be reading) would be that it would be nice if they used HTTP URLs instead of URNs for People, Works and Expressions. I understand why they did it: using URNs eases deployment somewhat since you don’t have to worry about httpRange-14 stuff. But I think they could easily use a hash URI instead of an URN, and make it easy for people to link to their stuff in other data. The Cool URIs For the Semantic Web has some other patterns they might want to consider, but simply adding a hash to their existing page URIs should do the trick. So for example, consider if OpenLibrary wanted to link their notion of of a book to O’Reilly’s notion of a book with owl:sameAs. If they used they URN they’d have:

<http://openlibrary.org/b/OL23978297M> owl:sameAs <urn:x-domain:oreilly.com:product:9780596516499.IP> .

but if O’Reilly identified their expressions with a URL they would enable something like:

<http://openlibrary.org/b/OL23978297M> owl:sameAs <http://oreilly.com/catalog/9780596516499#expression> .

This may seem like a minor point, but it’s really important to be able to follow your nose on the web–particularly in Linked Data. If a piece of software ran across the O’Reilly URL in a chunk of OpenLibrary RDF, the program could HTTP GET it, and learn more stuff about the book. But if it got the URN it wouldn’t really know how to fetch a representation for that resource without some special case logic that mapped the URN to a URL. There is a reason why Tim Berners-Lee included the following as the second of his design principles for Linked Data:

Use HTTP URIs so that people can look up those names.

Anyhow, hats off to O’Reilly for putting RDFa into practice. I hope the rest of the publishing (and library world) take note. If you are looking to learn more about RDFa Ben Adida and Mark Birbeck’s RDFa Primer: Bridging the Human and Data Webs is a really nice intro.

]]>
http://inkdroid.org/journal/2009/12/22/hacking-oreilly-rdfa/feed/ 1
thank you wikipedia http://inkdroid.org/journal/2009/12/15/thank-you-wikipedia/ http://inkdroid.org/journal/2009/12/15/thank-you-wikipedia/#comments Tue, 15 Dec 2009 19:34:16 +0000 ed http://inkdroid.org/journal/?p=1518 Wikipedia Affiliate ButtonI just donated to Wikipedia because I use it everyday. I work as a software developer at the Library of Congress. I’m not ashamed to admit that I’ve spent the last 10 years filling in gaps in my computer science, math and philosophy knowledge. Working in libraries makes this sort of self-education process easier because of all the access to books, journals and whatnot. But Wikipedia has made this process much more fun and collaborative. I don’t think I could do my job without it.

I also am a Linked Data enthusiast, and appreciate the essential role that Wikipedia plays in data sets like dbpedia, yago and freebase in bootstrapping Linked Data around the world. Seeing Wikipedia pages float to the top of Google search results really brought home to me how important it is that we can use URLs as names for things in the world, and gather a shared understanding of what they are.

If you use Wikipedia I encourage you to take a moment to say thank you as well.

]]>
http://inkdroid.org/journal/2009/12/15/thank-you-wikipedia/feed/ 0
MARCetplace http://inkdroid.org/journal/2009/11/10/marcetplace/ http://inkdroid.org/journal/2009/11/10/marcetplace/#comments Tue, 10 Nov 2009 19:41:36 +0000 ed http://inkdroid.org/journal/?p=1431 Last Saturday I passed the time while waiting in line at the DMV by reading the recently released Study of the North American MARC Records Marketplace. The analysis of the survey results seem to focus on the role of the Library of Congress in the marketplace, which is understandable given that LC funded the report. But there seems to be a real effort to look at LC’s role in the broader MARCetplace (sorry I couldn’t resist).

Anyhow, I jotted down some random notes and questions in the margins, and
figured I’d add them here before my notes got tossed in the circular file.

So I found this kind of surprising at the time:

7 participating distributors report that they do not acquire MARC records from external sources, but the rest do. Of those external sources, LC was predominant, followed by OCLC, LC record resellers, Library and Archives Canada, and the British National Library. Approximately 14% of respondents acquire a significant portion of their records via Z39.50 protocols and various web crawlers.

p. 19

Should I be surprised that there are more LC subscribers than OCLC subscribers
among the 70 distributors participating in the survey? I am surprised.

Much has changed since this law was formulated. First, LC took on a community oriented role by underwriting the CIP program, which accounted for 53,000 new titles in 2008. Second, for the past 25 years or so, LC records have been distributed electronically. This has not only lowered the cost of distribution, but has made the records easily transferable from one institution to another, often without payment. One result is that LC records are significantly underpriced, since the cost of production is not included. Another is that an entire industry has developed around free (or at least very cheap) MARC records. Consider that an LC record for a single title might appear in thousands of library catalogs, while its MARC Distribution Service lists only 74 customers, 30 of them foreign. Most copies of LC records are obtained either free (via its Z39.50 servers and WebOPAC) or purchased from OCLC or vendors who supply those records in conjunction with the materials they sell. In short, many libraries and vendors benefit from a product for which production costs are not recovered.

pp. 26-27

It would’ve been nice to see how much money it costs to distribute MARC data from the LC FTP site compared with how much money LC gets through its MARC subscription program. The report points out elsewhere that LC catalogs items through the CIP program that it ends up discarding. So they aren’t technically part of the operating cost of the library–if you don’t consider the Copyright Office part of the Library of Congress. The last time I looked at the LC organization chart the Copyright Office was part of LC. Furthermore, unless I missed it there is no indication of how many records there are in that category. Extrapolating from the 74 customers and the current price of the subscription service (21,905) it would appear that LC gets approximately $1,620,970.00 a year in revenue from its distribution of the MARC data. It’s difficult for me to imagine the generation of CIP records for items that LC discards added to the cost of operating an FTP site to equal this number.

Another major distribution channel involves direct downloads from LC’s Voyager database. At present, LC offers four separate interfaces:

  • A Web OPAC for bib records that supports 875 simultaneous users
  • A Web OPAC for authority records that supports 500 simultaneous users
  • Z39.50 direct access for users with Z39.50 clients, which supports 340 simultaneous users
  • Z39.50 gateway interface that supports up to 250 simultaneous users

In total, these search interfaces process about 500,000 searches each business day. While not every search leads to a download, the volume of searches is a clear indication of interest. Major users, to the degree that can be determined, include school libraries and small publics, who may not be OCLC members. In addition, vendors, open database providers, and firms such as Amazon regularly seek these records.

p. 35

Wow, half a million searches a day, that’s bigger than I would’ve thought. It would be interesting to see how many actual MARC downloads there are through these services, and also to see a breakdown across services. Ironically, I think providing piecemeal access to records, and supporting these search interfaces such as Z39.50 have quite a high cost in practice, and that simply making bulk downloads available for free to the public via FTP or what have your would do a lot to mitigate these costs.

Lastly the findings with respect to copy cataloging were really interesting.

In looking at the median numbers of original catalogers reported, we estimated that well over 30,000 professional catalogers are at work in North America. In the earlier example, we suggested that if each of those catalogers were to produce one record per work day, that would provide the capacity to create 6.8 million records per year.

p. 36

I probably missed it, but the report doesn’t seem to estimate how much backlogged material there is in the United States. Presumably it is lower than 6.8 million? It is kind of staggering to think how much untapped potential there is for original cataloging by professional catalogers in the United States. I lay the blame for the lack of original cataloging at the doorstep of archaic and arcane systems, data formats, and rules for content generation. The barrier to entry is just too high. Unfortunately the barrier to entry for getting the bibliographic data that is generated using tax payers money is too high as well.

These are obviously my own rambling thoughts and not those of my employer, or anyone else I work with for that matter.

]]>
http://inkdroid.org/journal/2009/11/10/marcetplace/feed/ 2
cloaking and fulltext http://inkdroid.org/journal/2009/11/10/cloaking-and-fulltext/ http://inkdroid.org/journal/2009/11/10/cloaking-and-fulltext/#comments Tue, 10 Nov 2009 15:16:50 +0000 ed http://inkdroid.org/journal/?p=1432 It’s comforting to know that California Digital Library are selectively serving up fulltext content in HTML from their institutional repository for search engines to chew on. For example, compare the output of:

curl http://escholarship.org/uc/item/2896686x

with:

curl --header "User-Agent: Googlebot/2.1 (+http://www.google.com/bot.html)" http://escholarship.org/uc/item/2896686x

You should see full-text content for the article in the latter and not in the former:

...
qt2896686x repo "Wholly Visionary": the American Library Association, the Library of Congress, and the Card Distribution Program wholly visionary the american library association the library of congress and the card distribution program 2009 2009 2009 2009-04-01 2009-04-01 20090401 yee yy::Yee, Martha M Yee, Martha M American Library Association American Library Association Library of Congress Library of Congress card distribution program card distribution program shared cataloging shared cataloging cooperative cataloging cooperative cataloging national bibliography national bibliography cataloging rules and standards cataloging rules and standards library history united states library history united states This paper offers a historical review of the events and institutional influences in the nineteenth century that led to the ...

The advantage to doing this is that when I was searching for a quote from Title 2, Chapter 5, Section 150 of the US Code:

The Librarian of Congress is authorized to furnish to such institutions or individuals as may desire to buy them

I found Martha Yee’s paper “Wholly Visionary”: the American Library Association, the Library of Congress, and the Card Distribution Program as the 5th hit in the search results.

We do this at the Library of Congress as well in Chronicling America to make the OCR text of historic newspaper pages available to search engines, while not burdening the UI search interface with all the (much noisier) textual content. Compare:

curl http://chroniclingamerica.loc.gov/lccn/sn84026749/1908-04-09/ed-1/seq-11/

with:

curl --header "User-Agent: Googlebot/2.1 (+http://www.google.com/bot.html)" http://chroniclingamerica.loc.gov/lccn/sn84026749/1908-04-09/ed-1/seq-11/

However we’ve got a ticket in our tracking system to revisit this practice in light of Google themselves frowning on the practice of ‘cloaking’:

Cloaking refers to the practice of presenting different content or URLs to users and search engines. Serving up different results based on user agent may cause your site to be perceived as deceptive and removed from the Google index.

We were thinking of returning the OCR text in all the responses and putting it in a background pane of some kind that can be selected. But this will most likely increase the size of the HTTP response, and may significantly impact the load time. As more and more fulltext content moves online it would be nice to have a pattern digital libraries could follow for minting URIs for books, articles, etc while still making the fulltext content available to UserAgents that can effectively use it.

Google hasn’t dropped Chronicling America’s pages from its index yet, which is a good sign. After running across similar patter at CDL I’m wondering if it’s OK to continue doing what we are doing. What do you think?

Update: Leigh Dodds let me know in twitter that much of the content gets into Google Scholar via cloaking.

]]>
http://inkdroid.org/journal/2009/11/10/cloaking-and-fulltext/feed/ 6