A few months ago I took part in a discussion on the pedantic-web list, which started out as a relatively simple question about FOAF usage, and quickly evolved into a conversation about terms people use when talking about Linked Data, and more generally the Web.
I ended up having a very helpful off-list email exchange with Richard Cyganiak (one of the architects of the Linked Data pattern) about sometrouble I’ve had understanding what Information Resources and Documents are in the context of Web Architecture. The trouble I had was in determining whether or not a collection of physical newspaper pages I was helping put on the web were Information Resources or not. I needed to know because I wanted to identify the newspaper pages with URIs, and describe them as Linked Data…and the resolvability of these URIs was largely dependent on how I chose to answer the question.
Richard ended up offering up some advice that I’ve since found very useful, and I thought I would transcribe some of it down here just in case you might find it useful as well. My apologies to you (and Richard) if some of this seems out of context. It may really only be useful for people who are in the digital library domain, but perhaps it’s useful elsewhere.
On the subject of what is a Document Richard offered up this way at looking at what are Web Documents:
The Web is a new, blank information space that is, by definition, disjoint from anything else that exists in the world. By setting up and configuring a web server, you make things pop up in that information space (by creating resolvable URIs). By definition, the things that pop up in the information space are a different beast from anything that existed before. They are web pages. They are not the same as things that exist outside of the space, like files on your hard disk, or newspaper articles.
I would avoid the term “document” when talking about representations. Representations are those ephemeral things that go over the wire. A representation is a “byte streams with a media type (and possibly other meta data)”. When I use the term “HTML document”, I mean a resource, identified by a URI, that has (only) HTML representations.
Richard encouraged me to think in terms of Web Documents and not generic Documents. I was getting tripped up by considering Newspaper Pages as Documents…which of course they are in the general sense, but characterized this way it became clear that the Newspaper Pages are not Web Documents. This view on Web Documents is supported in the Cool URIs for the Semantic Web that he co-authored.
Richard also included some axioms that underpin how he thinks about resources in the Linked Data view:
I’m using a few rules that I think should be considered axioms of web architecture:
First, if something exists independently from the Web, then it cannot be a Web Document. (hence two resources, one for the newspaper page and one for the web page)
Second, only Web Documents can have representations (hence the need to describe the newspaper page in a web page, rather than directly providing representations of the newspaper page).
I understand these rules as axioms, that is, they should be followed because they make the system work best, not because they somehow follow from the nature of the world (they don’t).
The pragmatist in me particularly liked how these aren’t supposed to have anything to do with the Real World, but are just ways of thinking about the Web to make it work better. Finally Richard offered some advice on how to reconcile the REST and Linked Data views on identity:
I make sense of the REST worldview like this: In typical REST, all the URIs always identify web documents. The REST folks might claim that they identify other things, like users or items for sale or places on the earth, but actually they just identify a document that is about that thing. The thing itself doesn’t have an identifier. This is perfectly fine for building certain kinds of systems, so the REST guys actually get away with pretending that the URI identifies the thing. But this doesn’t allow you to do certain things, like using domain-independent vocabularies for metadata and coreference, and you get into deep trouble if you want to use this for describing web pages rather than newspaper pages.
I hope I haven’t take any liberties quoting my conversation with Richard out of context like this. I mainly wanted to transcribe Richard’s points (which perhaps he has made elsewhere) so that I could revisit them, without having to dig through my email archive … Comments welcome!
One thing that I haven’t seen mentioned so far in public (which I just discovered today) is that data.gov.uk is using RDFa to expose metadata about the datasets in a machine readable way. What this means is that in an HTML page for a dataset like this there are some extra HTML attributes like about, property, rel that have been thoughtfully used to express some structured metadata about the dataset, which can be extracted from the HTML and expressed say as Turtle:
In fact since data.gov.uk has a nice paging mechanism that lists all the datasets it’s not hard to write a little script that scrapes all the metadata for the datasets (35,478 triples) right out of the web pages.
I also noticed via Stéphane Corlosquet today that data.gov.uk is using the Drupal open-source content management system. To what extent Drupal7’s new RDFa features are being used to layer in this RDFa isn’t clear to me. But it is an exciting development. It’s exciting because data.gov.uk is a great example of how to bubble up data that’s typically locked away in databases of some kind into the HTML that’s out on the web for people to interact with, and for crawlers to crawl and re-purpose.
For example I can now write a utility to check the status of the external dataset links, to make sure they are they are there (200 OK). The complete results by URL can be summarized by rolling up by status code:
Number of Datasets
[Errno socket error] [Errno -2] Name or service not known
nonnumeric port: ’’
[Errno socket error] [Errno 110] Connection timed out
I realize it’s early days but here are a few things it would be fun to see at data.gov.uk:
add some RDFa and SKOS or CommonTag in tag pages like education: this would allow things to be hooked up a bit more explicitly, tags to be given nice labels, and encourage the reuse of the tagging vocabulary within and outside data.gov.uk
link the dataset descriptions to the dataset resources themselves (the pdfs, excel spreadsheets, etc) that are online using a vocabulary like the Open Archives Reuse and Exchange and/or POWDER. This would allow for the harvesting and aggregation not only of the metadata, but the datasets as well.
I imagine much of this sort of hacking around can be enabled by querying the data.gov.uk SPARQL endpoint. But it hasn’t been very clear to me exactly what data is behind there. And there is something comforting about being able to crawl the open web to find the information that’s there in open to view.
Kesa’s good friend Gillian from college days in NOLA sent around an email asking for people’s favorite five songs of last year.
For some reason picking individual songs is hard for me. I guess because I rarely put on a song, and almost always put on an album–as antiquated as that sounds. I do occasionally listen to suggestions on last.fm or random songs in my player-du-jour – but then I don’t really remember the song names.
Anyhow here’s the list I cobbled together, with links out to youtube (that’ll probably break in 28 hrs):
I recently learned from Ivan Herman’s blog that O’Reilly has begun publishing RDFa in their online catalog of books. So if you go and install the RDFa Highlight bookmarklet and then visit a page like this and click on the bookmarklet you’ll see something like:
Those red boxes you see are graphical depictions of where metadata can be found interleaved in the HTML. In my screenshot you can maybe barely see an assertion involving the title being displayed:
<urn:x-domain:oreilly.com:product:9780596516499.IP> dc:title "Natural Language Processing with Python"
But there is actually quite a lot of metadata hiding in the page, which can be found by running the page through the RDFa Distiller (quickly skim over this if your eyes glaze over when you see Turtle):
I just donated to Wikipedia because I use it everyday. I work as a software developer at the Library of Congress. I’m not ashamed to admit that I’ve spent the last 10 years filling in gaps in my computer science, math and philosophy knowledge. Working in libraries makes this sort of self-education process easier because of all the access to books, journals and whatnot. But Wikipedia has made this process much more fun and collaborative. I don’t think I could do my job without it.
I also am a Linked Data enthusiast, and appreciate the essential role that Wikipedia plays in data sets like dbpedia, yago and freebase in bootstrapping Linked Data around the world. Seeing Wikipedia pages float to the top of Google search results really brought home to me how important it is that we can use URLs as names for things in the world, and gather a shared understanding of what they are.
If you use Wikipedia I encourage you to take a moment to say thank you as well.
Last Saturday I passed the time while waiting in line at the DMV by reading the recently releasedStudy of the North American MARC Records Marketplace. The analysis of the survey results seem to focus on the role of the Library of Congress in the marketplace, which is understandable given that LC funded the report. But there seems to be a real effort to look at LC’s role in the broader MARCetplace (sorry I couldn’t resist).
Anyhow, I jotted down some random notes and questions in the margins, and figured I’d add them here before my notes got tossed in the circular file.
So I found this kind of surprising at the time:
7 participating distributors report that they do not acquire MARC records from external sources, but the rest do. Of those external sources, LC was predominant, followed by OCLC, LC record resellers, Library and Archives Canada, and the British National Library. Approximately 14% of respondents acquire a significant portion of their records via Z39.50 protocols and various web crawlers.
Should I be surprised that there are more LC subscribers than OCLC subscribers among the 70 distributors participating in the survey? I am surprised.
Much has changed since this law was formulated. First, LC took on a community oriented role by underwriting the CIP program, which accounted for 53,000 new titles in 2008. Second, for the past 25 years or so, LC records have been distributed electronically. This has not only lowered the cost of distribution, but has made the records easily transferable from one institution to another, often without payment. One result is that LC records are significantly underpriced, since the cost of production is not included. Another is that an entire industry has developed around free (or at least very cheap) MARC records. Consider that an LC record for a single title might appear in thousands of library catalogs, while its MARC Distribution Service lists only 74 customers, 30 of them foreign. Most copies of LC records are obtained either free (via its Z39.50 servers and WebOPAC) or purchased from OCLC or vendors who supply those records in conjunction with the materials they sell. In short, many libraries and vendors benefit from a product for which production costs are not recovered.
It would’ve been nice to see how much money it costs to distribute MARC data from the LC FTP site compared with how much money LC gets through its MARC subscription program. The report points out elsewhere that LC catalogs items through the CIP program that it ends up discarding. So they aren’t technically part of the operating cost of the library–if you don’t consider the Copyright Office part of the Library of Congress. The last time I looked at the LC organization chart the Copyright Office was part of LC. Furthermore, unless I missed it there is no indication of how many records there are in that category. Extrapolating from the 74 customers and the current price of the subscription service (21,905) it would appear that LC gets approximately $1,620,970.00 a year in revenue from its distribution of the MARC data. It’s difficult for me to imagine the generation of CIP records for items that LC discards added to the cost of operating an FTP site to equal this number.
Another major distribution channel involves direct downloads from LC’s Voyager database. At present, LC offers four separate interfaces:
A Web OPAC for bib records that supports 875 simultaneous users
A Web OPAC for authority records that supports 500 simultaneous users
Z39.50 direct access for users with Z39.50 clients, which supports 340 simultaneous users
Z39.50 gateway interface that supports up to 250 simultaneous users
In total, these search interfaces process about 500,000 searches each business day. While not every search leads to a download, the volume of searches is a clear indication of interest. Major users, to the degree that can be determined, include school libraries and small publics, who may not be OCLC members. In addition, vendors, open database providers, and firms such as Amazon regularly seek these records.
Wow, half a million searches a day, that’s bigger than I would’ve thought. It would be interesting to see how many actual MARC downloads there are through these services, and also to see a breakdown across services. Ironically, I think providing piecemeal access to records, and supporting these search interfaces such as Z39.50 have quite a high cost in practice, and that simply making bulk downloads available for free to the public via FTP or what have your would do a lot to mitigate these costs.
Lastly the findings with respect to copy cataloging were really interesting.
In looking at the median numbers of original catalogers reported, we estimated that well over 30,000 professional catalogers are at work in North America. In the earlier example, we suggested that if each of those catalogers were to produce one record per work day, that would provide the capacity to create 6.8 million records per year.
I probably missed it, but the report doesn’t seem to estimate how much backlogged material there is in the United States. Presumably it is lower than 6.8 million? It is kind of staggering to think how much untapped potential there is for original cataloging by professional catalogers in the United States. I lay the blame for the lack of original cataloging at the doorstep of archaic and arcane systems, data formats, and rules for content generation. The barrier to entry is just too high. Unfortunately the barrier to entry for getting the bibliographic data that is generated using tax payers money is too high as well.
These are obviously my own rambling thoughts and not those of my employer, or anyone else I work with for that matter.
You should see full-text content for the article in the latter and not in the former:
qt2896686x repo "Wholly Visionary": the American Library Association, the
Library of Congress, and the Card Distribution Program wholly visionary the american
library association the library of congress and the card distribution program 2009
2009 2009 2009-04-01 2009-04-01 20090401 yee yy::Yee, Martha M Yee, Martha M
American Library Association American Library Association Library of Congress
Library of Congress card distribution program card distribution program shared
cataloging shared cataloging cooperative cataloging cooperative cataloging national
bibliography national bibliography cataloging rules and standards cataloging rules and
standards library history united states library history united states This paper offers a
historical review of the events and institutional influences in the nineteenth century
that led to the
The advantage to doing this is that when I was searching for a quote from Title 2, Chapter 5, Section 150 of the US Code:
The Librarian of Congress is authorized to furnish to such institutions or individuals as may desire to buy them
We do this at the Library of Congress as well in Chronicling America to make the OCR text of historic newspaper pages available to search engines, while not burdening the UI search interface with all the (much noisier) textual content. Compare:
However we’ve got a ticket in our tracking system to revisit this practice in light of Google themselves frowning on the practice of ‘cloaking’:
Cloaking refers to the practice of presenting different content or URLs to users and search engines. Serving up different results based on user agent may cause your site to be perceived as deceptive and removed from the Google index.
We were thinking of returning the OCR text in all the responses and putting it in a background pane of some kind that can be selected. But this will most likely increase the size of the HTTP response, and may significantly impact the load time. As more and more fulltext content moves online it would be nice to have a pattern digital libraries could follow for minting URIs for books, articles, etc while still making the fulltext content available to UserAgents that can effectively use it.
Google hasn’t dropped Chronicling America’s pages from its index yet, which is a good sign. After running across similar patter at CDL I’m wondering if it’s OK to continue doing what we are doing. What do you think?
Update: Leigh Dodds let me know in twitter that much of the content gets into Google Scholar via cloaking.
I’ll be the first to admit the tone and content of my last post was a bit off kilter. I guess it was pretty clear immediately from the title of the post. Chalk it up to a second night of insomnia; and also to my unrealistic and probably unnecessary goal of bringing the Atom/REST camp in closer alignment with the RDF/LinkedData camp … at least in my own brain if not on the web.
So, ever the pragmatist, Ian Davis called my bluff a bit on some of the crazier stuff I said:
I know Peter Keane took a stab at this over the summer. But I couldn’t find sample output lying around on the web, so I marked up one by hand to serve as a strawman. So here’s the turtle for the LCSH concept “World Wide Web”:
Maybe I botched something? It could use a GRDDL stylesheet I suppose. At least the Atom validates. I really am a bit conflicted posting any of this here because there is so much about the Linked Data community that I like, and want to be a part of. But I’m finding it increasingly difficult to see a Linked Data future where RDF/XML is deployed all over. Instead I bleakly expect we’ll see more fragmentation, and dueling idioms/cultures … and I’m trying to see if perhaps things aren’t as bleak as they seem by grasping at what the groups have in common. Maybe John Cowan’s idea (in the comments) of coming up with an RDF serialization that is valid Atom wasn’t so bad after all? My apologies to any Linked Data folks who have helped me in the past who may have been rubbed the wrong way by my last blog post.
Update:Sean Palmer clued me in to some earlier work he has done in the area of Atom and RDF, the Atom Extensibility Framework. And Niklas Lindström let me know of some thinking he’s done on the topic that is grounded in some work he has been doing for legal information systems in Sweden.
I highly recommend giving it a read if you are interested in web services, REST, Linked Data, and simple things you can do to open up access to data. The practicality of the advice is clearly gleaned from the experience of an actual implementation over at recovery.berkeley.edu where they kick the tires on their ideas.
Erik’s blog has a succinct summary of the paper’s findings, which for me boils down to:
any data source that is frequently updated must have simple ways for synchronizing data
Web syndication is a widely deployed mechanism for presenting a list of updated web resources. The authors make a pretty strong case for Atom because of its pervasive use of identifiers for content, extensibility, rich linking semantics, paging, the potential for write-enabled services, install base, and generally just good old Resource Oriented Architecture a.k.a. REST.
Because of my interest in Linked Data the paragraph that discusses why RDF/XML wasn’t chosen as a data format is particularly interesting:
The approach described in this report, driven by a desire for openness and accessibility, uses the most widely established technologies and data formats to ensure that access to reporting data is as easy as possible. Recently, the idea of openly accessible data has been promoted under the term of “linked data”, with recent recommendations being centered around a very specific choice of technologies and data models (all centered around Semantic Web approaches focusing on RDF for data representation and centralized data storage). While it is possible to use these approaches for building Web applications, our recommendation is to use better established and more widely supported technologies, thereby lowering the barrier-to-entry and choosing a simpler toolset for achieving the same goals as with the more sophisticated technologies envisioned for the Semantic Web.
It could be argued that the growing amount of RDF/XML in the Linked Data web make it a contender for Atom’s install base–especially when you consider RSS1.0. However I think the main point the authors are making is that the tools for working with XML documents far outnumber the tools that are available for processing RDF/XML graphs. Furthermore, most programmers I know are more familiar with the processing model and standards associated with XML documents (DOM, XSLT, XPath, XQuery) compared with RDF graphs (Triples, Directed Graph, GRDDL, SPARQL). Maybe this says more about the people I know … and if I were to jump into the biomedical field I’d feel different. But perhaps the most subtle point is that whether or not developers know it, Atom expresses a Graph model just like RDF/XML … but it does it in a much more straightforward, familiar document-centric way.
Of course the debate of whether RDF needed to be a part of Linked Data or not rippled through the semantic web community a few months ago–and there’s little chance of resolving any of those issues here. In the pure-land of RDF model theory the choice between Atom and RDF/XML is a bit of a false dilemma since RDF/XML is minimally processable with, well, XML tools … and idioms like GRDDL allow Atom to be explicitly represented as an RDF Graph. And in fact, REST and Content Negotiation would allow both serializations to co-exist nicely in the context of a single web application. However, I’d argue that this point isn’t a particularly easy thing to explain, and it certainly isn’t terrain that you would want to navigate in documentation on the recovery.gov website. The choice of whether RDF belongs in Linked Data or not has technical and business considerations – but I’m increasingly seeing it as a cultural issue,that perhaps doesn’t really even need resolving.
Even Tim Berners-Lee recognizes that there are quite large hurdles to modeling all government data on the Linked Data web in RDF and querying it with SPARQL. It’s a bit unrealistic to expect the Federal Government to start modeling and storing their enterprise data in a fundamentally new and somewhat experimental way in order to support what amounts to arbitrary database queries from anyone on the web. If that’s what the Linked Data brand is I’m not buying it. That being said, I see a great deal of value in the RDF data model (the giant global graph), especially as a tool for seeing how your data fits the contours of the web.
The important message that Erik, Eric and Raymond’s paper communicates is that the Federal Government should be focused on putting data out on the web in familiar ways, using sound web architecture practices that have allowed the web to grow and evolve into the wonderful environment it is today. Atom is a flexible, simple, commonly supported, well understood XML format for letting downstream applications know about newly published web resources. If the Federal Government is serious about the long term sustainability of efforts like recovery.gov and data.gov they should focus on enabling an ecosystem of visualization applications created by third parties, rather than trying to produce those applications themselves. I hope the data.gov folks also run across this important work. Thanks to Sunlight Foundation for funding the folks at Berkeley.