wikipedia external links: a redis database

As part of my continued meandering linkypedia v2 experiments I created a Redis database of high level statistics about host names and top-level-domain names in external links from Wikipedia articles. Tom Morris happened to mention he has been loading the external links as well (thanks Alf), so I thought I’d make the redis database dump available to anyone that is interested in looking at it. If you want to give it a whirl try this out:

% wget http://inkdroid.org/data/wikipedia-extlinks.rdb
% sudo aptitude install redis-server
% sudo mv wikipedia-extlinks.rdb /var/lib/redis/dump.rdb
% sudo chown redis:redis /var/lib/redis/dump.rdb
% sudo /etc/init.d/redis-server restart
% sudo pip install redis # or easy_install (version in apt is kinda old)
% python 
Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41) 
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import redis
>>> r = redis.Redis()
>>> r.zrevrange("hosts", 0, 25, True)
[('toolserver.org', 2360809.0), ('www.ncbi.nlm.nih.gov', 508702.0), ('dx.doi.org', 410293.0), ('commons.wikimedia.org', 408986.0), ('www.imdb.com', 398877.0), ('www.nsesoftware.nl', 390636.0), ('maps.google.com', 346997.0), ('books.google.com', 323111.0), ('news.bbc.co.uk', 214738.0), ('tools.wikimedia.de', 181215.0), ('edwardbetts.com', 168102.0), ('dispatch.opac.d-nb.de', 166322.0), ('web.archive.org', 165665.0), ('www.insee.fr', 160797.0), ('www.iucnredlist.org', 155620.0), ('stable.toolserver.org', 155335.0), ('www.openstreetmap.org', 154127.0), ('d-nb.info', 141504.0), ('ssd.jpl.nasa.gov', 137200.0), ('www.youtube.com', 133827.0), ('www.google.com', 131011.0), ('www.census.gov', 124182.0), ('www.allmusic.com', 117602.0), ('maps.yandex.ru', 114978.0), ('news.google.com', 102111.0), ('amigo.geneontology.org', 95972.0)]
>>> r.zrevrange("hosts:edu", 0, 25, True)
[('nedwww.ipac.caltech.edu', 28642.0), ('adsabs.harvard.edu', 25699.0), ('animaldiversity.ummz.umich.edu', 21747.0), ('www.perseus.tufts.edu', 20438.0), ('genome.ucsc.edu', 20290.0), ('cfa-www.harvard.edu', 14234.0), ('penelope.uchicago.edu', 9806.0), ('www.bucknell.edu', 8627.0), ('www.law.cornell.edu', 7530.0), ('biopl-a-181.plantbio.cornell.edu', 5747.0), ('ucjeps.berkeley.edu', 5452.0), ('plato.stanford.edu', 5243.0), ('www.fiu.edu', 5004.0), ('www.volcano.si.edu', 4507.0), ('calphotos.berkeley.edu', 4446.0), ('www.usc.edu', 4345.0), ('ftp.met.fsu.edu', 3941.0), ('web.mit.edu', 3548.0), ('www.lpi.usra.edu', 3497.0), ('insects.tamu.edu', 3479.0), ('www.cfa.harvard.edu', 3447.0), ('www.columbia.edu', 3260.0), ('www.yale.edu', 3122.0), ('www.fordham.edu', 2963.0), ('www.people.fas.harvard.edu', 2908.0), ('genealogy.math.ndsu.nodak.edu', 2726.0)]
>>> r.zrevrange("tlds", 0, 25, True)
[('com', 11368104.0), ('org', 7785866.0), ('de', 1857158.0), ('gov', 1767137.0), ('uk', 1489505.0), ('fr', 1173624.0), ('ru', 897413.0), ('net', 868337.0), ('edu', 793838.0), ('jp', 733995.0), ('nl', 707177.0), ('pl', 590058.0), ('it', 486441.0), ('ca', 408163.0), ('au', 387764.0), ('info', 296508.0), ('br', 276599.0), ('es', 242767.0), ('ch', 224692.0), ('us', 179223.0), ('at', 163397.0), ('be', 132395.0), ('cz', 92683.0), ('eu', 91671.0), ('ar', 89856.0), ('mil', 87788.0)]
>>> r.zscore("hosts", "www.bbc.co.uk")
56245.0

Basically there are a few sorted sets in there:

  • “hosts”: all the hosts sorted by the number of externallinks
  • “hosts:%s”: where %s is a top level domain (“com”, “uk”, etc)
  • “tlds”: all the tlds sorted by the number of externallinks
  • “wikipedia”: the wikipedia langauges sorted by total number of externallinks

I’m not exactly sure how portable redis databases are but I was able to move it between a couple Ubuntu machines and Ryan successfully looked at it on a Gentoo box he had available. You’ll need roughly 300MB of RAM available. I must say I was impressed with redis and in particular sorted sets for this stats collection task. Thanks to Chris Adams for pointing me in the direction of redis in the first place.


Wikipedia 10

Wikipedia is turning ten years old on January 15th, and celebratory gatherings are going around the globe, including one in my area (Washington DC) on January 22 at the National Archives.

Like you, I’ve been an accidental user of Wikipedia when searching for a topic to learn more about. Over time I have started actively searching on Wikipedia for things, linking to Wikipedia articles (in email and HTML), and even saying thank you with a small donation. I’ve only attended one of the local DC chapter events before, but am definitely planning on attending the DC event to meet other people who value Wikipedia as a resource.

Perhaps also like you (since you are reading this blog) I also work in a cultural heritage organization, well a library to be precise. I wasn’t able to attend the recent Galleries, Libraries, Archives, Museums & Wikimedia conference at the British Museum last November. But I have been listening to the audio that they kindly provided recently with great interest. If you are interested in the role that cultural heritage organizations can play on Wikipedia, and the Web in general definitely check it out if you’ve got a long commute (or time to kill) and a audio device of some kind. There are lots of museums, galleries, archives and libraries in the DC area, so I’m hoping that this event on the 22nd will be an opportunity for folks to get together across institutional lines to talk about how they are engaging with the Wikipedia community. Who knows maybe it could be a precursor to a similar to GLAM-WIKI here in DC?

I’m planning on doing a lightning talk about this side/experimental project I’ve been working on called linkypedia. The basic idea is to give web publishers (and specifically cultural heritage organizations like libraries, museums, archives, etc) an idea of how their content is being used as primary resource material on Wikipedia. The goal is to validate the work that these institutions have done to make this content available, and for them to do more…and also to engage with the Wikipedia community. Version 1 of the (opensource) software is running at here on my minimal Linode VPS. But I’m also working on version 2, which will hopefully scale a bit better, and provide a more global (not just English Wikipedia) and real time picture of how your stuff is being used on Wikipedia. Part of the challenge is figuring out how to pay for it, given the volume of external links in the major language Wikipedias. I’m hoping a tip-jar and thoughtful use of Amazon EC2 will be enough.

If you are interested in learning more about the event on the 22nd check out Katie Filbert’s recent post to the Sunlight Labs Discussion List, and the (ahem) wiki page to sign up! Thanks Mark for letting me know about the birthday celebration last week in IRC. Sometimes with all the discussion lists, tweets, and blogs things like this slip by without me noticing them. So a direct prod in IRC helps :-)


on preserving bookmarks

While it’s not exactly clear what the future of Delicious is, the recent news about Yahoo closing the doors or selling the building prompted me to look around at other social bookmarking tools, and to revisit some old stomping grounds.

Dan Chudnov has been running Unalog since 2003 (roughly when Delicious started). In fact I can remember Dan and Joshua Schacter having some conversations about the idea of social bookmarking as both of the services co-evolved. So my first experience with social bookmarking was on Unalog, but a year or so later I ended up switching to Delicious in 2004 for reasons I can’t quite remember. I think I liked some of the tools that had sprouted up around Delicious, and felt a bit guilty for abandoning Unalog.

Anyhow, I wanted to take the exported Delicious bookmarks and see if I could get them into Unalog. So I set up a dev Unalog environment, created a friendly fork of Dan’s code, and added the ability to POST a chunk of JSON:

curl --user user:pass \
         --header "Content-type: application/json" \
         --data '{"url": "http://example.com", "title": "Example"}' \
         http://unalog.com/entry/new

Here’s a fuller example of the JSON that you can supply:

{
      "url": "http://zombo.com",
      "title": "ZOMBO",
      "comment": "found this awesome website today",
      "tags": "website awesome example",
      "content": "You can do anything at Zombo.com. The only limit is yourself. Etc...",
      "is_private": true
    }

The nice thing about Unalog is that if you supply it (content), Unalog will index the text of the resource you’ve bookmarked. This allows you to do a fulltext search over your bookmarked materials.

So yeah, to make a long story a bit shorter, I created a script that reads in the bookmarks from a Delicious bookmark export (an HTML file) and pushes them up to a Unalog instance. Since the script GETs the bookmark URL to send Unalog the content to index you also get a log which contains the HTTP Status Code that provides the fodder for a linkrot report like:

200 OK 4546
404 Not Found 367
403 Forbidden 141
DNS failure 81
Connection refused 28
500 Internal Server Error 19
503 Service Unavailable 10
401 Unauthorized 9
410 Gone 5
302 Found 5
400 Bad Request 4
502 Bad Gateway 2
412 Precondition Failed 1
300 Multiple Choices 1
201 Created 1

That was 5,220 bookmarks total collected over 5 years–which initially seemed low, until I did the math and figured I did 3 bookmarks a day on average. If we lump all the non-200 OK responses, that amounts to 13% linkrot. At first blush this seems significantly different compared to the research done by Spinelli from 2003 (thanks Mike) which reported 28% linkrot. I would’ve expected the Spinelli results to be better than my haphazard bookmark collection since he was sampling academic publications on the Web. But he was also sampling links from the 1995-1999 period, while I had links from 2005-2010. I know this is mere conjecture, but maybe we’re learning to do things better on the web w/ Cool URIs. I’d like to think so at least. Maybe a comparison with some work done by folks at HP and Microsoft would provide some insight.

At the very least this was a good reminder of how important this activity of pouring data from one system into another is to digital preservation. What Greg Janée calls relay-supporting preservation.

Most of all I want to echo the comments of former Yahoo employee Stephen Hood who wrote recently about the value of this unique collection of bookmarks to the web community. If for some reason Yahoo were to close the doors on Delicious it would be great if they could donate the public bookmarks to the Web somehow, either via a public institution like the Smithsonian or the Library of Congress (full disclosure I work at the Library of Congress in a Digital Preservation group), or to an organization dedicated to the preservation of the Web like the International Internet Preservation Consortium of which LC, other National Libraries, and the Internet Archive are members.


dcat:distribution considered helpful

The other day I happened to notice that the folks at data.gov.uk have started using the Data Catalog Vocabulary in the RDFa they have embedded in their dataset webpages. As an example here is the RDF you can pull out of the HTML for the Anonymised MOT tests and results dataset. Of particular interest to me is that the dataset description now includes an explicit link to the actual data being described using the dcat:distribution property.

     <http://data.gov.uk/id/dataset/anonymised_mot_test> dcat:distribution
         <http://www.dft.gov.uk/data/download/10007/DOC>,
         <http://www.dft.gov.uk/data/download/10008/ZIP>,
         <http://www.dft.gov.uk/data/download/10009/GZ>,
         <http://www.dft.gov.uk/data/download/10010/GZ>,
         <http://www.dft.gov.uk/data/download/10011/GZ>,
         <http://www.dft.gov.uk/data/download/10012/GZ>,
         <http://www.dft.gov.uk/data/download/10013/GZ>,
         <http://www.dft.gov.uk/data/download/10014/GZ>,
         <http://www.dft.gov.uk/data/download/10015/GZ>,
         <http://www.dft.gov.uk/data/download/10016/GZ>,
         <http://www.dft.gov.uk/data/download/10017/GZ>,
         <http://www.dft.gov.uk/data/download/10018/GZ>,
         <http://www.dft.gov.uk/data/download/10019/GZ>,
         <http://www.dft.gov.uk/data/download/10020/GZ>,
         <http://www.dft.gov.uk/data/download/10021/GZ>,
         <http://www.dft.gov.uk/data/download/10022/GZ> .

Chris Gutteridge happened to see a Twitter message of mine about this, and asked what consumes this data, and why I thought it was important. So here’s a brief illustration. I reran a little python program I have that crawls all of the data.gov.uk datasets, extracting the RDF using rdflib’s RDFa support (thanks Ivan). Now there are 92,550 triples (up from 35,478 triples almost a year ago).

So what can you do with the this metadata about datasets? I am a software developer working in the area where digital preservation meets the web. So I’m interested in not only getting the metadata for these datasets, but also the datasets themselves. It’s important to enable 3rd party, automated access to datasets for a variety of reasons; but the biggest one for me can be summarized with the common-sensical: Lots of Copies Keep Stuff Safe.

It’s kind of a no-brainer, but copies are important for digital preservation, when the unfortunate happens. The subtlety is being able to know where the copies of a particular dataset are in the enterprise, in a distributed system like the Web, and the mechanics for relating them together. It’s also important for scholarly communication, so that researchers can cite datasets and follow citations of other research to the actual dataset it is based upon. And lastly aggregation services that collect datasets for dissemination on a particular platform, like data.gov.uk, need ways to predictably sweep domains for datasets that needs to be collected.

Consider this practical example: as someone interested in digital preservation I’d like to be able to know what format types are used within the data.gov.uk collection. Since they have used the dcat:distribution property to point at the referenced dataset, I was able to write a small Python program to crawl the datasets and log the media type and HTTP status code along the way, to generate some results like:

media type datasets
text/html 5898
application/octet-stream 1266
application/vnd.ms-excel 874
text/plain 234
text/csv 220
application/pdf 167
text/xml 81
text/comma-separated-values 51
application/x-zip-compressed 36
application/vnd.ms-powerpoint 33
application/zip 31
application/x-msexcel 28
application/excel 21
application/xml 18
text/x-comma-separated-values 14
application/x-gzip 13
application/x-bittorrent 12
application/octet_stream 12
application/msword 10
application/force-download 10
application/x-vnd.oasis.opendocument.presentation 9
application/x-octet-stream 9
application/vnd.excel 9
application/x-unknown-content-type 6
application/xhtml+xml 6
application/vnd.msexcel 5
application/vnd.google-earth.kml+xml kml 5
application/octetstream 4
application/csv 3
vnd.openxmlformats-officedocument.spreadsheetml.sheet 2
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 2
application/octet-string 2
image/jpeg 1
image/gif 1
application/x-mspowerpoint 1
application/vnd.google-earth.kml+xml 1
application/powerpoint 1
application/msexcel 1

Granted some of these aren’t too interesting. The predominance of text/html is largely an artifact of using dcat:distribution to link to the splash page for the dataset, not to the dataset itself. This is allowed by the dcat vocabulary … but dcat’s approach kind of assumes that the domain of the assertion is suitably typed as a dcat:Download, dcat:Feed or dcat:WebService. I personally think that dcat has some issues that make it a bit more difficult to use than I’d like. But it’s extremely useful that data.gov.uk are kicking the tires on the vocabulary, so that kinks like this can be worked out.

The application/octet-stream media-type (and its variants) are also kind of useless for these purposes, since it basically says the dataset is made of bits. It would be more helpful if the servers in these cases could send something more specific. But it ought to be possible to use something like JHOVE or DROID to do some post-hoc analysis of the bitstream to figure out just what this data is, if it is valid etc.

The nice thing about using the Web to publish these datasets and their descriptions is that this sort of format analysis application could be decoupled from the data.gov.uk web publishing software itself. data.gov.uk becomes a clearinghouse for information and whereabouts of datasets, but a format verification service can be built as an orthogonal application. I think it basically fits the RESTful style of Curation Microservices being promoted by the California Digital Library:

Micro-services are an approach to digital curation based on devolving curation function into a set of independent, but interoperable, services that embody curation values and strategies. Since each of the services is small and self-contained, they are collectively easier to develop, deploy, maintain, and enhance. Equally as important, they are more easily replaced when they have outlived their usefulness. Although the individual services are narrowly scoped, the complex function needed for effective curation emerges from the strategic combination of individual services.

One last thing before you are returned to your regular scheduled programming. You may have noticed that the URI for the dataset being described in the RDF is different from the URL for the HTML view for the resource. For example:

http://data.gov.uk/id/dataset/anonymised_mot_test

instead of:

http://data.gov.uk/dataset/anonymised_mot_test

This is understandable given some of the dictums about Linked Data and trying to separate the Information Resource from the Non-Information Resource. But it would be nice if the URL resolved via a 303 redirect to the HTML as the Cool URIs for the Semantic Web document prescribes. If this is going to be the identifier for the dataset it’s important that it resolves so that people and automated agents can follow their nose to the dataset. I think this highlights some of the difficulties that people typically face when deploying Linked Data.


iogdc ramblings

Yesterday I was at the first day of the International Open Government Data Conference in Washington DC. It was an exciting day, with a great deal of enthusiasm being expressed by luminaries like Tim Berners-Lee, Jim Hendler , Beth Noveck, and Vivek Kundra for enabling participatory democracy by opening up access to government data. Efforts like data.gov, data.gov.uk, data.govt.nz, data.australia.gov.au to aggregate egov datasets from their jurisdictions were well represented, although it would’ve been great to hear more from places like Spain, Sweden as well as groups like the Sunlight Foundation and Open Knowledge Foundation … but there are two more days to go. Here are my reflections so far from the first day:

Licensing

New Zealand is embracing the use of Creative Commons licenses to release their datasets onto the web. Their NZGOAL project got cabinet approval for using CC licenses in June of this year. They are now doing outreach within government agencies, and building tools to help data owners put these license into play, so that data can go out on the web. Where I work at the Library of Congress, the general understanding is that our data is public domain (in the US) … except when its not. For example some of the high resolution images in the Prints and Photographs Catalog aren’t available outside the physical buildings of the Library of Congress, due to licensing concerns. So I’m totally envious of New Zealand’s coordinated efforts to iron out these licensing issues.

Centralization/Decentralization

Vivek Kundra and Alan Mallie of the data.gov touted the number of datasets that they are federating access to. But it remains unclear exactly how content is federated, and how datasets flow from agencies into data.gov itself. Perhaps some of these details are included in the v1.0 release of the data.gov Concept of Operations (which Kundra announced). An excellent question posed to Berners-Lee and Kundra concerned what role centralized and distributed approaches play in publishing data. While there is value in one-stop-shopping where you can find data aggregated in one place, Berners-Lee really stressed that the web grew because it was distributed. Aggregated collections of datasets like data.gov need to be able to efficiently pull data from places where it is collected. We need to use the web effectively to enable this.

Legacy Data

There are tons of datasets waiting to be put on the web. Steve Young of the EPA described a few datasets such as the Toxics Release Inventory, which has the goal to:

provide communities with information about toxic chemical releases and waste management activities and to support informed decision making at all levels by industry, government, non-governmental organizations, and the public.

This data has been collected for 22 years after the Emergency Planning and Right to Know Act. Young emphasized how important it is that this data be used in applications, and combined with other datasets. The data is available for download directly from the EPA, and is also available on data.gov. It would’ve been interesting to learn more about the mechanics of how the EPA gets data onto data.gov ; and how updates can flow.

But a really important question came from Young’s colleague at the EPA (sorry I didn’t note her name). She asked about how the data in their relational databases could be made available on the web. Should they simply dump the database? Or is there something else they could do? Young said that it’s early days, but he hoped that Linked Data might have some answers. The issues came up later in the day at the Is the Semantic Web Ready Yet panel. There was a question about how to make Linked Data relevant to folks whose focus is Enterprise data. In my opinion Linked Data advocates over emphasize the importance of using RDF and SPARQL (standards), and converting all the data over without completely understanding how invasive these solutions are. Not enough is done to show enterprise data folks, who typically think in terms of relational databases, what they can do to put their lovingly crafted and hugged data on the web. Consider a primary key in a database: what does it identify, what relations does that thing have with other things? Why not use that key in constructing a URL for that thing, and link things together using the URLs? Then other people could use your URLs as well in their own data. I think the drumbeat to use SPARQL and triple stores often misses explaining this fundamental baby step that data owners could take. As Derek Willis said (on the 2nd day, when I’m writing this), people want to use your data, but not your database…people want to browse your data using their web browser. Assigning URLs to the important stuff in your databases is the first important step to make with Linked Data.

Community

Robert Schaefer of the Applied Physics Lab at Johns Hopkins University pointed out that enabling virtual communities around our data is an essential part of making data available and usable. In my opinion this is the true potential of platform, data aggregator sites like data.gov…they can allow users of government datasets to share what they have done, and learn from each other. Efforts like Civic Commons also promise to be places where this collaboration can take place. The communities may be born inside or outside of government, but they inevitably must include both. The W3C Egov effort might also be a good place to collaborate on standards possibly.


me and my homepage

Thanks for all the positive feedback to my last post about using URLs to identify resources. Tom Heath (one of the founding fathers of the Linked Data meme/pattern) suggested that discussions about this topic are harmful, so of course I have to continue the conversation … even if I don’t have much new to say. Hell, I just wanted an excuse to re-publish another one of Paul Downey’s lovely REST Tarot Cards that he is doing for NaNoDrawMo 2010, and get some more hits on my backwater blog :-)

Anyhow, so Tom said:

Joe Developer … has to take a bit of responsibility for writing sensible RDF statements. Unfortunately, people like Ed seeming to conflate himself and his homepage (and his router and its admin console) don’t help with the general level of understanding. I’ve tried many times to explain to someone that I am not my homepage, and as far as I know I’ve never failed. In all this frantic debate about the 303 mechanism, let’s not abandon certain basic principles that just make sense.

I’m glad Tom is able to explain this stuff about Information Resources better than me. I think I was probably one of the people he explained it to at some point. I understand the principles that Tom and other Linked Data advocates are promulgating well enough to throw together some Linked Data implementations at the Library of Congress, such as the Library of Congress Subject Headings and Chronicling America which put millions of resources online using the principles that got documented in Cool URIs for the Semantic Web.

How do you know if someone understood something you said? Normally by what they do in response to what you say, right? The rapid growth of the Linked Data cloud is a testament to the Linked Data folks ability to effectively communicate with software developers. No question. But lets face it, the principles of web architecture have seen way more adoption right? The successes that Linked Data have enjoyed so far have been a result of grounding the Semantic Web vision in the mechanics of the web we have now. And my essential point is that they didn’t go far enough in making it easier.

So, yeah…I’m not my homepage. As someone would’ve said to me in grade school: “No shit Sherlock” :-) Although, our blogs sure seem to be having a friendly argument with each other at the moment (thanks Keith). What is a homepage anyhow? The Oxford English Dictionary defines a homepage as:

A document created in a hypertext system (esp. on the World Wide Web) which serves either as an introductory page for a visitor to a web site, or as a focus of information on a particular topic, and which usually contains hypertext links to related documents and other web sites.

So my homepage is a hypertext document with a particular focus, in this case the person Ed Summers. If you are at your desk, and fire up your browser, and type in the URL for my homepage you would get an HTML document. If you were on the train, and typed in the same URL into the browser on your mobile device you might get a very different HTML document optimized for rendering on a smaller screen. This is how the web was designed to work, albeit a bit ex post facto (which is why it is awesome). A URL identifies a Resource, a Resource can be anything, when you request that Resource using HTTP you get back a Representation of the current state of the Resource. The Representation that you get back is determined by the way it was requested: in this case the User-Agent of the browser determined what HTML I got back.

It’s very easy to look down over your bi-focals and say things like “surely Ed realizes he is not his homepage”. But if we’re going to go there, it kind of begs the question, what is a homepage … and who am I? Identity is hard. Tom should be pretty familiar with how hard identity as his instructions on using owl:sameAs to link resources together proves to be a bit harder in practice than in theory.

But let’s not go there. Who really wants to think about stuff like that when you are building a webapp that makes reusable machine readable data available?

My contention is that this whole line of discussion is largely academic, and gets in the way of actually putting resource descriptions out on the web. The reality is that people can and do use http://inkdroid.org/ as an identifier for me, Ed Summers. It is natural, and effortless, and doesn’t require someone with a PhD in Knowledge Representation to understand it. If I want to publish some RDF that says:

<http://inkdroid.org/> a foaf:Person .

I can do that. It’s my website, and I decide what the resources on it are. If someone puts that URL into their browser and gets some HTML that’s cool. If someone’s computer program likes RDF and gets back some “application/rdf+xml”, all the better. If a script wants to nibble on some JSON, sure here’s some “application/json” for ya. If someone wants to publish RDF about me, and use http://inkdroid.org/ as the identifier to hang their descriptions off of, I say, go right ahead. It’s an Open Web still right (oh please say it still is).

And best of all, if someone wants different URLs for themselves and their homepage, that’s fine too. The Linked Data we have deployed by following the rules to the best of our ability is still legit. It’s all good. I don’t mind following rules, but ultimately they have to make sense…to me. And this website is all about me, right? :-)


routers, webcams and thermometers

If you have a local wi-fi network at home you probably use something like this Linksys wireless router on the left, to let your laptop and other devices connect to the Internet. When you bought it and plugged it in you probably followed the instructions and typed “http://192.168.1.1/” into your web browser and visited a page to configure the router: settings its name, admin password, etc.

Would you agree that this router sitting on top of your TV, or wherever it is, is a real world thing? It’s not some abstract concept of a router: you can pick it up, turn it off and on, take it apart and try to put it back together again. And the router is identified with a URL: http://192.168.1.1. When your web browser resolves the URL for your router it gets back some HTML, that lets you see the router’s current state, and make modifications to it. You don’t get the router itself. That would be silly right?

In terms of REST, the router is a Resource that has a URL Identifier, which when resolved returns an HTML Representation of the Resource. But you don’t really have to think about it much at all, because it’s intuitively part of how you use the web every day.

In fact the Internet is strewn with online devices that have embedded web servers in them. A 5 year old BoingBoing article More Googleable Unsecured Webcams shows how you can drop a web search for inurl:“view/index.shtml” into Google, and get back thousands of webcams from around the world. You can zoom and pan these cameras using your web browser. These are URLs for real world cameras. When you put the URL in your browser you don’t get the camera itself, that’s crazy talk; instead you get some HTML describing the camera’s current state, and some form controls for changing its position. Again all is well in the REST world, where the camera is the Resource identified with a URL, and your browser receives a Representation of the Resource.

If you are an Arduino hacker you might follow some instructions to build an online thermometer. You wire up the temperature sensor, and configure the Arduino to listen for HTTP requests at a particular IP address. You can then visit a URL in your web browser, and the server returns a Representation of the current temperature. It doesn’t return the Arduino board, the thermometer, or the thermodynamic state of its environment…that’s crazy talk. It returns a Representation of the temperature.

So imagine I want to give myself a URL, say http://inkdroid.org. Is this so different than the camera, the router and the thermometer? Sure, I don’t have a web server embedded in me. But even if I did nobody would expect it to return me would they? Just as in the other cases, people would expect a Representation of me to be returned. Heck, there are millions of OpenID URLs deployed for people already. But this argument is used time, and time again in the Semantic Web, Linked Data community to justify the need for elaborate, byzantine, hard to explain HTTP behavior when making RDF descriptions of real world things available. The pattern has been best described in the Cool URIs for the Semantic Web W3C Note. I understand it. But if you’ve ever had to explain it to a web developer not already brainwashed^w familar with the pattern you will understand that it is hard to explain convincingly. It’s even harder to implement correctly, since you are constantly asking yourself nonsensical questions like: “is this a Information Resource” when you are building your application.

I was pleased to see Ian Davis’ recent well articulated posts about whether the complicated HTTP behavior is essential for deploying Linked Data. I know I am biased because I was introduced to much of the Semantic Web and Linked Data world when Ian Davis and Sam Tunnicliffe visited the Library of Congress three years ago. I agree with Ian’s position: the current situation with the 303 redirect is potentially wasteful, error prone and bordering on the absurd…and the Linked Data community could do a lot to make it easier to deploy Linked Data. At its core, Ian’s advice in Guide to Publishing Linked Data Without Redirects does a nice job of making Linked Data publishing seem familiar to folks who have used HTTP’s content-negotiation features to enable internationalization, or building RESTful web services. A URL for a resource that has a more constrained set of representations, allows for Agent Driven Negotiation in situations where custom tuning the Accept header in the client isn’t convenient and practical. Providing a pattern for linking these resources together with something like wrds:describedby and/or the describedby relation that’s now available in RFC 5988 is helpful for people building REST APIs and Linked Data applications.

At the end of the day, it would be useful if the W3C could de-emphasize httpRange-14, simplify the Architecture of the World Wide Web (by removing the notion of Information Resources), and pave the cowpaths we already are seeing for Real World Objects on the Web. It would be great to have a W3C document that guided people on how to put URIs for things on the web, that fit with how people are already doing it, and made intuitive sense. We’re already used to things like our routers, cameras and thermometers being on the web, and my guess is we’re going to see much, much more of it in the coming years. I don’t think a move like this would invalidate documents like Cool URIs for the Semantic Web, or make the existing Linked Data that is out there somehow wrong. It would simply lower the bar for people who want to publish Linked Data, who don’t necessarily want to go through the process of using URIs to distinguish non-Information Resources from Information Resources.

If the W3C doesn’t have the stomach for it, I imagine we will see the IETF lead the way, or for innovation to happen elsewhere as with HTML5.


Linked Library Data at the Deutschen Nationalbibliothek

Just last week Lars Svensson from the Deutschen Nationalbibliothek (German National Library aka DNB) made a big announcement that they have released their authority data as Linked Data for the world to use. What this means is that there are now unique URLs (and machine readable data at the other end of them) for:

The full dataset that the DNB has made available for download amounts to 38,849,113 individual statements (aka triples). Linked Data enthusiasts that are used to thinking in terms of billions of triples might not even blink when seeing these numbers. But it is important to remember that these data assets have been curated by a network of German, Austrian and Swiss libraries, for close to a hundred of years, as they documented (and continue to document) all known German-language publications.

The simple act of making each of these authority records URL addressable, means that they can now meaningfully participate in the global information space some call the Web of Data. It’s true, the records were available as part of the DNB’s Online Catalog before they were released as Linked Data. What’s new is that the DNB has commited to using persistent URLs to identify these records, using a new host name d-nb.info in combination with their own record identifiers. This means that people can persistently link to these DNB resources in their own web applications and data. Another subtle thing, and really the heart of what Linked Data pattern offers, is the ability to use the same URL to retrieve the record as structured metadata. The important thing about having machine readable data is it allows other applications to easily re-purpose the information, much like libraries have done traditionally by shipping around batches of Machine Readable Cataloging (MARC) records. Here’s a practical example:

The URL http://d-nb.info/gnd/119053071 identifies the author Herta Müller, who won the Nobel Prize for Literature in 2009. If you load that URL in your web browser by clicking on it, you should see a web page (HTML) for the authority record describing Herta Müller. But if a web client requests that same URL asking for RDF it will (via a redirect) get the same authority record as RDF. RDF is more a data model than a particular file format, so it has a variety of serializations … The server at d-nb.info returns RDF/XML, and they have made their data dumps available in N-Triples…but I’m kind of fond of the Turtle serialization which is kind of JSON-ish, and makes the RDF a bit more readable. Here is the RDF (as Turtle) for Herta Müller that the DBN makes available:

(???) gnd: <http://d-nb.info/gnd/> . (???) rdaGr2: <http://RDVocab.info/ElementsGr2/> . (???) foaf: <http://xmlns.com/foaf/0.1/> . (???) owl: <http://www.w3.org/2002/07/owl#> .


triadomany

I fully admit that there is not uncommon craze for trichotomies. I do not know but the psychiatrists have provided a name for it. If not, they should … it might be called triadomany. I am not so afflicted; but I find myself obliged, for truth’s sake, to make such a large number of trichotomies that I could not [but] wonder if my readers, especially those of them who are in the way of knowing how common the malady is, should suspect, or even opine, that I am a victim of it … I have no marked predilection for trichotomies in general.

Charles S. Peirce quoted in The Sign of Three, edited by Umberto Eco and Thomas A. Sebeok.



It’s hard not to read a bit of humor and irony into this quote from Peirce. My friend Dan Chudnov observed once that all this business with RDF and Linked Data often seems like fetishism. RDF colored glasses are kind of hard to take off when you are a web developer and have invested a bit of time in understanding the Semantic Web Linked Data vision. I seem to go through phases of interest with the triples: ebbs and flows. Somehow it’s comforting to read of Peirce’s predilections for triples at the remove of a couple hundred years.

Seeing the Linked Open Data Cloud for the first time was a revelation of sorts. It helped me understand concretely how the Web could be used to assemble a distributed, collaborative database. That same diagram is currently being updated to include new datasets. But a lot of Linked Data has been deployed since then … and a lot of it has been collected as part of the annual Billion Triple Challenge.

It has always been a bit mysterious to me how nodes get into the LOD Cloud, so I wondered how easy it would be create a cloud from the 2010 Billion Triple Challenge dataset. It turns out that with a bit of unix pipelining and the nice ProtoVis library it’s not too hard to get something “working”. It sure is nice to work in an environment with helpful folk who can set aside a bit of storage and compute time for experimenting like this, without having to bog down my laptop for a long time.

If you click on the image you should be taken to the visualization. It’s kind of heavy on JavaScript processing, so a browser like Chrome will probably render it best.

But as Paul Downey pointed out to me in Twitter:



Paul is so right. I find myself drawn to these graph visualizations for magical reasons. I can console myself that I did manage to find a new linked data supernode that I didn’t know about before: bibsonomy–which doesn’t appear to be in the latest curated view of the Linked Open Data Cloud. And I did have a bit of fun making the underlying data available as rdf/xml and Turtle using the Vocabulary of Interlinked Datasets (VoID). And I generated a similar visualization for the 2009 data. But it does feel a bit navel-gazy, so a sense of humor about the enterprise is often a good tonic. I guess this is the whole point of the Challenge, to get something generally useful (and not navel-gazy) out of the sea of triples.

Oh and Sign of Three is an excellent read so far :-)


edu, gov and tlds in en.wikipedia external links

Some folks over at Wikipedia Signpost asked if they could use some of the barcharts I’ve been posting here recently. They needed the graphs to be released with a free license, which was a good excuse to slap a Creative Commons Attribution 3.0 license on all the content here at inkdroid. I’m kinda ashamed I didn’t think of doing this before…

I was also asked how easy it would be to generate the .gov and .edu graphs, as well as top level domains. I already had the hostnames exported, so it was just a few greps, sorts and uniqs away. I’ve included the graphs and the full data files below. My friend Dan Chudnov suggested that plotting this data on a logarithmic scale would probably look better. I think he’s probably right, but I just haven’t gotten around to doing that yet. It’s definitely something I will keep in mind for an app that allowed this slicing and dicing of the Wikipedia external links data.

top 100 .edu hosts in en.wikipedia external links


download

top 100 .gov hosts in en.wikipedia external links


download

top 100 top level domains in en.wikipedia external links


download