a bit about scruffiness


Terry Winograd at CHI 2006 (by boltron)

I just finished reading Understanding Computers and Cognition by Terry Winograd and Fernando Flores and have to jot down some quick notes & quotes before I jump in and start reading it again … yeah, it’s that good.

Having gone on Rorty and Wittgenstein kicks recently, I was really happy to find this book while browsing the Neats vs Scruffies Wikipedia article a few months ago. It seems to combine this somewhat odd interest I have in pragmatism and writing software. While it was first published in 1986, it’s still very relevant today, especially in light of what the more Semantic Web heavy Linked Data crowd are trying to do with ontologies and rules. Plus it’s written in clear and accessible language, which is perfect for the arm-chair compsci/philosopher-type … so it’s ideal for a dilettante like me.

While only 207 pages long, the breadth of the book is kind of astounding. The philosophy of Heidegger figures prominently…in particular his ideas about throwness, breakdowns and readiness to hand which emphasize the importance of concernful activity over rationalist, representations of knowledge.

Heidegger insists that it is meaningless to talk about the existence of objects and their properties in the absence of concernful activity, with its potential for breaking down.

The work of the biologist Humberto Maturana forms the second part of the theoretical foundation of the book. The authors draw upon Maturana’s ideas about structural coupling to emphasize the point that:

The most successful designs are not those that try to fully model the domain in which they operate, but those that are ‘in alignment’ with the fundamental structure of that domain, and that allow for modification and evolution to generate new structure coupling.

And the third leg in the chair is John Searle’s notion of speech acts which emphasizes the role of commitment and action, or the social role of language in dealing with meaning.

Words correspond to our intuition about “reality” because our purposes in using them are closely aligned with our physical existence in a world and our actions within it. But the coincidence is the result of our use of language within a tradition … our structure coupling within a consensual domain. Language and cognition are fundamentally social … our ability to think and to give meaning to language is rooted in our participation in a society and a tradition.


Fernando Flores (by Sebastián Piñera)

So the really wonderful thing that this book does here is take this theoretical framework (Heidegger, Maturana & Searle) and apply it to the design of computer software. As the preface makes clear, the authors largely wrote this book to dismantle popular (at the time) notions that computers would “think” like humans. While much of this seems anachronistic today, we still see similar thinking in some of the ways that the Semantic Web is described, where machines will understand the semantics of data, using ontologies that model the “real world”.

There is still a fair bit of talk about getting the ontologies just right so that they model the world properly, and then running rule driven inference engines over the instance data, to “learn” more things. But what is often missing is a firm idea of what actual tools will use this new data. How will these tools be used by people acting in a particular domain? Like The Modeler, practitioners in the Linked Data and Semantic Web community often jump to modeling a domain, and trying to get it to match “reality” before understanding what the field of activity we want to support is…what we are trying to have the computer help us do … what new conversations we want the computer to enable with other people.

In creating tools we are designing new conversations and connections. When a change is made, the most significant innovation is the modification of the conversation structure, not the mechanical means by which the conversation is carried out. In making such changes we alter the overall pattern of conversation, introducing new possibilities or better anticipating breakdowns in the previously existing ones … When we are aware of the real impact of design we can more consciously design conversation structures that work.

It’s important to note here that these are conversations between people, who are acting in some domain, and using computers as tools. It’s the social activity that grounds the computer software, and not some correspondence that the software shares with reality or truth. I guess this is a subtle point, and I’m not doing a terrific job of elucidating it here, but if your interest is piqued definitely pick up a copy of the book. Over the past 5 years I’ve been lucky to work with several people who intuitively understand how important the social setting and alignment are to successful software development–but it’s nice to have the theoretical tools as ballast when the weather gets rough.

Another really surprising part of the book (given that it was written in 1986) is the foreshadowing of the agile school of programming:

… the development of any computer-based system will have to proceed in a cycle from design to experience and back again. It is impossible to anticipate all of the relevant breakdowns and their domains. They emerge gradually in practice. System development methodologies need to take this as a fundamental condition of generating the relevant domains, and to facilitate it through techniques such as building prototypes early in the design process and applying them in situations as close as possible to those in which they will eventually be used.

Compare that with the notion of iterative development that’s now prevalent in software development circles. I guess it shouldn’t be that surprising since the roots of extend back quite a ways. But still, it was pretty eerie seeing how on target Winograd and Flores could be still, particularly in the field of computing which has changed so rapidly in the last 25 years.

update: Kendall Clark has an interesting post that addresses some of the concerns about semantic web technologies.
update: Ryan Shaw recommended some more reading material in this vein.


DOIs as Linked Data

Last week Ross Singer alerted me to some pretty big news for folks interested in Library Linked Data: CrossRef has made the metadata for 46 million Digital Object Identifiers (DOI) available as Linked Data. DOIs are heavily used in the publishing space to uniquely identify electronic documents (largely scholarly journal articles). CrossRef is a consortium of roughly 3,000 publishers, and is a big player in the academic publishing marketplace.

So practically what this means is that all the places in the scholarly publishing ecosystem where DOIs are present (caveat below), it’s now possible to use the Web to retrieve metadata associated with that electronic document. Say you’ve got a DOI in the database backing your institutional repository:

doi:10.1038/171737a0

you can use the DOI to construct a URL:

http://dx.doi.org/10.1038/171737a0

and then do an HTTP GET (what your Web browser is doing all the time as you wander around the Web) to ask for metadata about that document:

curl –location –header “Accept: text/turtle” http://dx.doi.org/10.1038/171737a0

At which point you will get back some Turtle flavored RDF that looks like:

<http://dx.doi.org/10.1038/171737a0> a <http://purl.org/ontology/bibo/Article> ; <http://purl.org/dc/terms/title> “Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid” ; <http://purl.org/dc/terms/creator> <http://id.crossref.org/contributor/f-h-c-crick-367n8iqsynab1>, <http://id.crossref.org/contributor/j-d-watson-367n8iqsynab1> ; <http://prismstandard.org/namespaces/basic/2.1/doi> “10.1038/171737a0” ;
<http://prismstandard.org/namespaces/basic/2.1/endingPage> “738” ; <http://prismstandard.org/namespaces/basic/2.1/startingPage> “737” ; <http://prismstandard.org/namespaces/basic/2.1/volume> “171” ; <http://purl.org/dc/terms/date> “1953-04-25Z”^^<http://www.w3.org/2001/XMLSchema#date> ; <http://purl.org/dc/terms/identifier> “10.1038/171737a0” ; <http://purl.org/dc/terms/isPartOf> <http://id.crossref.org/issn/0028-0836> ; <http://purl.org/dc/terms/publisher> “Nature Publishing Group” ; <http://purl.org/ontology/bibo/doi> “10.1038/171737a0” ; <http://purl.org/ontology/bibo/pageEnd> “738” ; <http://purl.org/ontology/bibo/pageStart> “737” ; <http://purl.org/ontology/bibo/volume> “171” ; <http://www.w3.org/2002/07/owl#sameAs> <doi:10.1038/171737a0>, <info:doi/10.1038/171737a0> .


snac hacks

A few months ago Brian Tingle posted some exciting news that the Social Networks and Archival Context (SNAC) project was releasing the data that sits behind their initial prototype:

As a part of our work on the Social Networks and Archival Context Project, the SNAC team is please to release more early results of our ongoing research.

A property graph of correspondedWith and associatedWith relationships between corporate, personal, and family identities is made available under the Open Data Commons Attribution License in the form of a graphML file. The graph expresses 245,367 relationships between 124,152 named entities.

The graphML file, as well as the scripts to create and load a graph database from EAC or graphML, are available on google code [5]

We are still researching how to map from the property graph model to RDF, but this graph processing stack will likely power the interactive visualization of the historical social networks we are developing.

The SNAC project have aggregated archival finding aid data for manuscript collections at the Library of Congress, Northwest Digital Archives, Online Archive of California and Virginia Heritage. They then used authority control data from NACO/LCNAF, Getty Union List of Artist Names Online (ULAN) and VIAF to knit these archival finding aids using the Encoded Archival Context – Corporate bodies, Persons, and Families (EAC-CPF).

I wrote about SNAC here about 9 months ago, and how much potential there is in the idea of visualizing archival collections across institutions, along the axis of identity. I had also privately encouraged Brian to look into releasing some portion of the data that is driving their prototype. So when Brian delivered I felt some obligation to look at the data and try to do something with it. Since Brian indicated that the project was interested in an RDF serialization, and Mark had pointed me at Aaron Rubenstein’s arch vocabulary, I decided to take a stab at converting the GraphML data to some Arch flavored RDF.

So I forked Brian’s mercurial repository, and wrote a script that parses the GraphML XML that Brian provided, and writes RDF (using arch:correspondedWith, arch:primaryProvenanceOf, arch:appearsWith) to a local triple store using rdflib. Since RDF has URLs cooked in pretty deep, part of this conversion involved reverse-engineering the SNAC URLs in the prototype, which wasn’t terribly clean, but it seemed good enough for demonstration purposes.

Once I had those triples (877,595 of them) I learned from Cory Harper that the SNAC folks had matched up the archival identities with entries in the Virtual International Authority File. The VIAF URLs aren’t present in their GraphML data (GraphML is not as expressive as RDF), but they are available in the prototype HTML, which I had URLs for. So, again, in the name of demonstration and not purity, I wrote a little scraper that would use the reverse-engineered SNAC URL to pull down the VIAF id. I tried to be respectful and not do this scraping in parallel, and to sleep a bit between requests. A few days of running and I had 40,237 owl:sameAs assertions that linked the SNAC URLs with the VIAF URLs.

With the VIAF URLs in hand I thought it would be useful to have a graph of only the VIAF related resources. It seemed like a VIAF centered graph of archival information could demonstrate something we’ve been talking about some in the Library Linked Data W3C Incubator Group: that Linked Data actually provides a technology that lets the archival and bibliographic description communities cross-pollinate and share. This is the real insight of the SNAC effort, that these communities have a lot in common, in that they both deal with people, places, organizations, etc. So I wrote another little script that created a named graph within the larger triple store, and used the owl:sameAs assertions to do some brute force inferencing, to generate triples relating VIAF resources with Arch.

I realize that Turtle isn’t probably the most compelling example of the result, but in the absence of an app (maybe more on that forthcoming) that uses it, it’ll have to do for now. So here are the assertions for Vannevar Bush, for the Linked Data fetishists out there:

@prefix foaf <http://xmlns.com/foaf/0.1/> .
@prefix arch <http://purl.org/archival/vocab/arch#> .

<http://viaf.org/viaf/15572358/#foaf:Person>
    a foaf:Person ;
    foaf:name "Bush, Vannevar, 1890-1974." ;
    arch:appearsWith <http://viaf.org/viaf/21341544/#foaf:Person>, 
        <http://viaf.org/viaf/30867998/#foaf:Person>, 
        <http://viaf.org/viaf/5076979/#foaf:Person>, 
        <http://viaf.org/viaf/6653121/#foaf:Person>, 
        <http://viaf.org/viaf/79397853/#foaf:Person> ;
    arch:correspondedWith <http://viaf.org/viaf/13632081/#foaf:Person>,
        <http://viaf.org/viaf/16555764/#foaf:Person>, 
        <http://viaf.org/viaf/18495018/#foaf:Person>, 
        <http://viaf.org/viaf/20482758/#foaf:Person>, 
        <http://viaf.org/viaf/20994992/#foaf:Person>, 
        <http://viaf.org/viaf/32065073/#foaf:Person>, 
        <http://viaf.org/viaf/41170431/#foaf:Person>, 
        <http://viaf.org/viaf/44376228/#foaf:Person>, 
        <http://viaf.org/viaf/46092803/#foaf:Person>, 
        <http://viaf.org/viaf/49966637/#foaf:Person>, 
        <http://viaf.org/viaf/51816245/#foaf:Person>, 
        <http://viaf.org/viaf/52483290/#foaf:Person>, 
        <http://viaf.org/viaf/54269107/#foaf:Person>, 
        <http://viaf.org/viaf/54947702/#foaf:Person>, 
        <http://viaf.org/viaf/56705976/#foaf:Person>, 
        <http://viaf.org/viaf/63110426/#foaf:Person>, 
        <http://viaf.org/viaf/64014369/#foaf:Person>, 
        <http://viaf.org/viaf/64087560/#foaf:Person>, 
        <http://viaf.org/viaf/6724310/#foaf:Person>, 
        <http://viaf.org/viaf/71767943/#foaf:Person>, 
        <http://viaf.org/viaf/75645270/#foaf:Person>, 
        <http://viaf.org/viaf/76361317/#foaf:Person>, 
        <http://viaf.org/viaf/77126996/#foaf:Person>, 
        <http://viaf.org/viaf/77903683/#foaf:Person>, 
        <http://viaf.org/viaf/8664807/#foaf:Person>, 
        <http://viaf.org/viaf/92419478/#foaf:Person> ;
    arch:primaryProvenanceOf <http://hdl.loc.gov/loc.mss/eadmss.ms001043>, 
        <http://hdl.loc.gov/loc.mss/eadmss.ms007098>, 
        <http://hdl.loc.gov/loc.mss/eadmss.ms010024>,
        <http://hdl.loc.gov/loc.mss/eadmss.ms998004>, 
        <http://hdl.loc.gov/loc.mss/eadmss.ms998007>, 
        <http://hdl.loc.gov/loc.mss/eadmss.ms998009>, 
        <http://nwda-db.wsulibs.wsu.edu/findaid/ark:/80444/xv42415>,
        <http://www.oac.cdlib.org/findaid/ark:/13030/kt5b69p6zq>, 
        <http://www.oac.cdlib.org/findaid/ark:/13030/kt8w1014rz> ;
    owl:sameAs <http://socialarchive.iath.virginia.edu/xtf/view?docId=Bush+Vannevar+1890-1974-cr.xml> .

I’ve made a full dump of the data I created available if you are interested in taking a look. The nice thing is that the URIs are already published on the web, so I didn’t need to mint any identifiers myself to publish this Linked Data. Although I kind of played fast and loose with the SNAC URIs for people since they don’t do the httpRange-14 dance. It’s interesting that it doesn’t seem to have immediately broken anything. It would be nice if the SNAC Prototype URIs were a bit cleaner I guess. Perhaps they could use some kind of identifier instead of baking the heading into the URL?

So maybe I’ll have some time to build a simple app on top of this data. But hopefully I’ve at least communicated how useful it could be for the cultural heritage community to share web identifiers for people, and use them in their data. RDF also proved to be a nice malleable data model for expressing the relationships, and serializing them so that others could download them. Here’s to the emerging (hopefully) Giant Global GLAM Graph!


geeks bearing gifts


Trojan Horse in Stuttgart by Stefan Kühn

I recently received some correspondence about the EZID identifier service from the California Digital Library. EZID is a relatively new service that aims to help cultural heritage institutions manage their identifiers. Or as the EZID website says:

EZID (easy-eye-dee) is a service that makes it simple for digital object producers (researchers and others) to obtain and manage long-term identifiers for their digital content. EZID has these core functions:

Create a persistent identifier: DOI or ARK

  • Add object location
  • Add metadata
  • Update object location
  • Update object metadata

I have some serious concerns about a group of cultural institutions relying on a single service like EZID for managing their identifier namespaces. It seems too much like a single point of failure. I wonder, are there any plans to make the software available, and to allow multiple EZID servers to operate as peers?

URLs are a globally deployed identifier scheme that depend upon HTTP and DNS. These technologies have software implementations in many different computer languages, for diverse operating systems. Bugs and vulnerabilities associated with these software libraries are routinely discovered and fixed, often because the software itself is available as open source, and there are “many eyes” looking at the source code. Best of all, you can put a URL into your web browser (which are now ubiquitous), and view a document that is about the identified resource.

In my humble opinion, cultural heritage institutions should make every effort to work with the grain of the Web, and taking URLs seriously is a big part of that. I’d like to see more guidance for cultural heritage institutions on effective use of URLs, what Tim Berners-Lee has called Cool URIs, and what the Microformats and blogging community call permalinks. When systems are being designed or evaluated for purchase, we need to think about the URL namespaces that we are creating, and how we can migrate them forwards. Ironically, identifier schemes that don’t fit into the DNS and HTTP landscape have their own set of risks; namely that organizations become dependent on niche software and services. Sometimes it’s prudent (and cost effective) to seek safety in numbers.

I did not put this discussion here to try to shame CDL in any way. I think the EZID service is well intentioned, clearly done in good spirit, and already quite useful. But in the long run I’m not sure it pays for institutions to go it alone like this. As another crank (I mean this with all due respect) Ted Nelson put it:

Beware Geeks Bearing Gifts.

On the surface the EZID service seems like a very useful gift. But it comes with a whole set of attendant assumptions. Instead of investing time & energy getting your institution to use a service like EZID, I think most cultural heritage institutions would be better off thinking about how they manage their URL namespaces, and making resource metadata available at those URLs.


xhtml, wayback

The Internet Archive gave the Wayback Machine a facelift back in January. It actually looks really nice, but I noticed something kinda odd. I was looking for old archived versions of the lcsh.info site. Things work fine for the latest archived copies:

But during part of lcsh.info’s brief lifetime the site was serving up XHTML with the application/xhtml+xml media type. Now Wayback rightly (I think) remembers the media type, and serves it up that way:

ed@curry:~$ curl -I http://replay.waybackmachine.org/20081216020433/http://lcsh.info/
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
X-Archive-Guessed-Charset: UTF-8
X-Archive-Orig-Connection: close
X-Archive-Orig-Content-Length: 6497
X-Archive-Orig-Content-Type: application/xhtml+xml; charset=UTF-8
X-Archive-Orig-Server: Apache/2.2.8 (Ubuntu) DAV/2 SVN/1.4.6 PHP/5.2.4-2ubuntu5.4 with Suhosin-Patch mod_wsgi/1.3 Python/2.5.2
X-Archive-Orig-Date: Tue, 16 Dec 2008 02:04:31 GMT
Content-Type: application/xhtml+xml;charset=utf-8
X-Varnish: 1458812435 1458503935
Via: 1.1 varnish
Date: Wed, 09 Mar 2011 23:09:47 GMT
X-Varnish: 903390921
Age: 0
Via: 1.1 varnish
Connection: keep-alive
X-Cache: MISS

But to add navigation controls and branding, Wayback also splices in its own HTML into the display, which unfortunately is not valid XML. And since the media type and doctype trigger standards mode in browsers, the pages render in Firefox like this:

And in Chrome like this:

Now I don’t quite know what the solution should be here. Perhaps the HTML that is spliced in should be valid XML. Or maybe Wayback should just serve up the HTML as text/html. Or maybe this is a good use case for frames (gasp). But I imagine it will similarly afflict any other XHTML that was served up as application/xhtml+xml when Heretrix crawled it.

Sigh. I sure am glad that HTML5 is arriving on the scene and XHTML is riding off into the sunset. Although it’s kind of the Long Goodbye given Internet Archive has archived it.

Update: just a couple hours later I got an email that a fix for this was deployed. And sure enough now it works. I quickly eyeballed the response and didn’t see what the change was. Thanks very much Internet Archive!


on "good" repositories

Chris Rusbridge kicked off an interesting thread on JISC-REPOSITORIES with the tag line What makes a good repository? Implicit in this question, and perhaps the discussion list, is that he is asking about “digital” repositories, and not the brick n’ mortar libraries, archives, etc that are arguably also repositories.

The question of what a repository is, is pretty much verboten in the group I work in. This is kind of amusing since I work in a group whose latest name (in a string of names, and no-names) is the Repository Development Center. Well, maybe saying it’s “verboten” is putting it a bit too strongly. It’s not as if the “repository” is equivalent to He Who Shall Not Be Named or anything. It’s just that the word means so many things, to so many different people, and encompasses so much of what we do, that it’s hardly worth talking about. At our best (IMHO) we focus on what staff and researchers want to do with digital materials, and building out services that help them do that. Getting wrapped around the axle about what set of technologies we are using, and whether they model data in a particular way, is putting the cart before the horse.

As Dan penned one April 1st: “if you seek a pleasant repository, look about you”. I guess this largely depends on where you are sitting. But seriously, if there’s one thing that the Trustworthy Repositories Audit & Certification: Criteria and Checklist, the Ten Principles for Digital Preservation Repositories, and the Blue Ribbon Task Force on Sustainable Digital Preservation and Access make abundantly clear (after you’ve clawed out your eyes) it’s that the fiscal and social dimension of repositories are a whole lot more important in the long run than the technical bits of how a repository is assembled in the now. I’m a software developer, and by nature I reach for technical solutions to problems, but in my heart of hearts I know it’s true.

Back to Chris’ question. Perhaps the “digital” is a red-herring. What if we consider his question in light of traditional libraries? This got me thinking: could Ranganthan and his Five Laws of Library Science serve as a touchstone? Granted, bringing Ranganathan into library discussions is a bit of a cliché. But asking ethical questions like the “goodness” of something is a great excuse to dip into the canon. So put on your repository colored glasses, which magically substitute Repository Object for Book, and …

Repository Objects Are For Use

We can build repositories that function as dark archives. But it kind of rots your soul to do it. It rots your soul because no matter what awesome technologies you are using to enable digital preservation in the repository, the repository needs to be used by people. If it isn’t used, the stuff rots. And the digital stuff rots a whole lot faster than the physical materials. Your repository should have a raison d’être. It should be anchored in a community of people that want to use the materials that it houses. If it doesn’t the repository is likely to suck not be good.

Every Reader His/Her Repository Object

Depending on their raison d’être (see above) repositories are used by a wide variety of people: researchers, administrators, systems developers, curators, etc. It does a disservice to these people if the repository doesn’t support their use cases. A researcher probably doesn’t care when fixity checks were last performed, and an administrator generating a report on fixity checks doesn’t care about how an repository object was linked to and tagged in Twitter. Does your repository allow these different views, for different users to co-exist for the same object? Does it allow new classes of users to evolve?

Every Repository Object Its Reader

Are the objects in your repository discoverable? Are there multiple access pathways to them? For example, can someone do a search in Google and wind up looking at an item in your repository? Can someone link to it from a Wikipedia article? Can someone do a search within your repository to find an object of interest? Can they browse a controlled vocabulary or subject guide to find it? Are repository objects easily identified and found by automated agents like web crawlers and software components that need to audit them? Is it easy to extend, enhance and refine your description of what the repository object is as new users are discovered?

Save the Time of the Reader

Is your repository collection meaningfully on the Web? If it isn’t, it should be, because that’s where a lot of people are doing research today…in their web browser. If it can’t be open access on the web, that’s OK … but the collection and its contents should be discoverable so that someone can arrange an onsite visit. For example, can a genealogist do a search for a person’s name in a search engine and end up in your repository? Or do they have to know to come to your application to type in a search there? Once they are in your repository can they easily limit their search along familiar dimensions such as who, what, why, when, and where? Is it easy for someone to bookmark a search, or an item for later use. Do you allow your repository objects to be reused in other contexts like Facebook, Twitter, Flickr, etc which put the content where people are, instead of expecting them to come to you?

The Repository is a Growing Organism

This is my favorite. Can you keep adding numbers and types of objects, and scale your architecture linearly? Or are you constrained in how large the repository can grow? Is this constraint technical, social and/or financial? Can your repository change as new types or numbers of users (both human and machine) come into existence? When the limits of a particular software stack are reached, is it possible to throw it away and build another without losing the repository objects you have? How well does your repository fit into the web ecosystem? As the web changes do you anticipate your repository will change along with it? How can you retire functionality and objects; to let them naturally die, with respect, and make way for the new?

So …

I guess there are more questions here than answers. I hadn’t thought of framing repository questions in terms of Ranganathan’s laws before, but I imagine it has occurred to other people before. They still seem to be quite good principles to riff on, even in the digital repository realm–at least for a blog post. If you happen to run across similar treatment elsewhere I would appreciate hearing about them.


release early, release often

The National Digital Newspaper Program (NDNP) went live with a new JavaScript viewer today (as well as a lot of other stylistic improvements) in the Beta area portion of Chronicling America.

Being able to smoothly zoom in on images, and go into fullscreen mode is really reminiscent (for me) of the visceral experience of using a microfilm reader. We joked about adding whirring sound effects when moving between pages. But you’ll be glad to know we showed restraint :-) It’s all kind of deeply ironic given the Web’s roots in the Memex.

Hats off to Dan Krech. Risa Ohara and Chris Adams for really digging into things like dynamically rendering tiles from our JPEG2000 access copies using Python (more maybe on that later).

I hacked together a brief video demonstration (above) of looking up the day the American Civil War ended (April 9, 1865) in the New York Daily Tribune, to show off the viewer. One thing I forgot to do was go into headless mode (F11 w/ Firefox, Chrome, etc), which amplifies the effect somewhat.

Aside from the improvements on the site, this is a real milestone for the project and (I believe) the Library of Congress generally, since it is a ‘beta’ preview of what we would like to replace the existing site with. Given the nature of what they do, libraries are typically fairly conservative and slow moving organizations. So knowing how to use a beta/experimental area has proven to be a challenge. Hopefully a little space for experimentation will pay off. I don’t think we could’ve gotten this far without the help of our fearless leader in all things tech, David Brunton.

If you have ideas, feel free to leave feedback using the little widget on the lower-right of pages at Chronicling America, or using our new mailing list that is devoted to the Open Source software project that makes Chronicling America available. Open Source too, imagine that!


OCLC's mapFAST and CORS

Yesterday at Code4lib 2011 Karen Coombs gave a talk where (among other things) she demonstrated mapFAST that lets you find relevant subject headings for a given location, and then click on a subject heading and find relevant books on the topic. Go check out the archived video of her talk (note you’ll have to jump 39 minutes or so into the stream). Karen mentioned that the demo UI uses the mapFAST REST/JSON API. The service lets you construct a URL like this to get back subjects for any location you can identify with lat/lon coordinates:

http://experimental.worldcat.org/mapfast/services?geo={lat},{lon}";crs=wgs84&radius={radius-in-meters}&mq=&sortby=distance&max-results={num-results}"

For example:

ed@curry:~$ curl -i 'http://experimental.worldcat.org/mapfast/services?geo=39.01,-77.01;crs=wgs84&radius=100000&mq=&sortby=distance&max-results=1'
HTTP/1.1 200 OK
Date: Wed, 09 Feb 2011 14:07:39 GMT
Server: Apache/2.0.63 (Unix)
Access-Control-Allow-Origin: *
Transfer-Encoding: chunked
Content-Type: application/json

{
  "Status": {
    "code": 200, 
    "request": "geocode"
  }, 
  "Placemark": [
    {
      "point": {
        "coordinates": "39.0064,-77.0303"
      }, 
      "description": "", 
      "ExtendedData": [
        {
          "name": "NormalizedName", 
          "value": "maryland silver spring woodside park"
        }, 
        {
          "name": "Feature", 
          "value": "ppl"
        }, 
        {
          "name": "FCode", 
          "value": "P"
        }
      ], 
      "id": "fst01324433", 
      "name": "Maryland -- Silver Spring -- Woodside Park"
    }
  ], 
  "name": "FAST Authority Records"
}

Recently I have been reading Mark Pilgirm’s wonderful Dive into HTML5 book, so I got it into my head that it would be fun to try out some of the geo-location features in modern browsers to display subject headings that are relevant for wherever you are. A short time later I now have a simple HTML/JavaScript HTML5 application (dubbed subjects-here) that does just that. The application itself is really just a toy. Part of Karen’s talk was emphasizing the importance of using more than just text in Library Applications…and subjects-here kind of misses that key point.

What I wanted to highlight is the text in red in the HTTP response above:

Access-Control-Allow-Origin: *

The Access-Control-Allow-Origin HTTP header is a Cross-Origin Resource Sharing (CORS) header. If you’ve developed JavaScript applications before, you probably have run into situations where you wanted to load some JavaScript from a service elsewhere on the web. But you were prevented from doing this by Same Origin Policy, which prevents your JavaScript code from talking to a website that is different from the one it loaded from. So normally you hack around this by creating a proxy for that web service in your own application, which is a bit of work. Sometimes license agreements frown on you re-exposing their service, so you have to jump through a few more hoops to make sure it’s not an open proxy for the web service.

Enter CORS.

What the folks at OCLC did was add a Access-Control-Origin header to their JSON response. This basically means that my JavaScript served up at inkdroid.org is able to run in your browser and talk to the server at experimental.worldcat.org. OCLC has decided to allow this, to make their Web Service easier to use. So to create subjects-here I didn’t have to write a single bit of server side code, it’s just static HTML and JavaScript:

function main() {
    if (Modernizr.geolocation) {
        navigator.geolocation.getCurrentPosition(lookup_subjects);
    }
    else {
        display_error();
    }
}

function lookup_subjects(position) {
    lat = parseFloat(position.coords.latitude);
    lon = parseFloat(position.coords.longitude);
    url = "http://experimental.worldcat.org/mapfast/services?geo=" + lat + "," + lon + ";crs=wgs84&radius=100000&mq=&sortby=distance&max-results=15";
    $.getJSON(url, display_subjects);
}

function display_subjects(data) {
    // putting results into the DOM left as exercise to the reader
}

Nice and simple right? The full code is on GitHub, which seemed a bit superfluous since there is no server-side piece (it’s all in the browser). So the big wins are:

  • OCLC gets to see who is actually using their web service, not who is proxying it.
  • I don’t have to write some proxy code.

The slight drawbacks are:

  • My application has a runtime dependency on experimental.worldcat.org, but it kinda did already when I was proxying it.
  • Most modern browsers support CORS headers, but not all of them. So you would need to evaluate whether that matters to you.

I guess this is just a long way of saying USE CORS!! and help make the web a better place (pun intended).

Update: and also, it is a good example where something like GeoJSON and OpenSearch Geo could’ve been used to help spread common patterns for data on the Web. Thanks to Sean Gillies for pointing that out.

Update: and Chris is absolutely right, JSONP is another pattern in the Web Developer community that is a bit of a hack, but is an excellent fallback for older browsers.


@andypowe11

Andy Powell has a post over on the eFoundations blog about some metadata guidelines he and Pete Johnston are working on for the UK Resource Discovery Taskforce. I got to rambling in a text area on his blog, but I guess I wrote too much, or included too many external URLs, so I couldn’t post it in the end. So I thought I’d just post it here, and let trackback do the rest.

So uh, please s/you/Andy/g in your head as you are reading this …

A bit of healthy skepticism, from a 15-year vantage point is definitely warranted. Bearing in mind that often times its hard to move things forward without taking a few risks. I imagine constrained fiscal resources could also be a catalyst to improving access to data flows that cultural heritage institutions participate in, or want to participate in. I wonder if it would be useful to factor in the money that organizations can save by working together better?

As I’ve heard you argue persuasively in the past, the success of the WWW as a platform for delivery of information is hard to argue with. One of the things that the WWW did right (from the beginning) was focus the technology on people actually doing stuff…in their browsers. It seems really important to make sure whatever this metadata is, that users of the Web will see it (somehow) and will be able to use it. Ian Davis’ points in Is the Semantic Web Destined to be a Shadow are still very relevant today I think.

My friend Dan Krech calls this an “alignment problem”. So I was really pleased to see this in the vision document:

Agreed core standards for metadata for
the physical objects and digital objects in
aggregations ensuring the aggregations
are open to all major search engines

Aligning with the web is a good goal to have. Relatively recent service offerings from Google and Facebook indicate their increased awareness of the utility of metadata to their users. And publishers are recognizing how important they are for getting their stuff before more eyes. It’s a kind of virtuous cycle I hope.

This must feel like it has been a long time in coming for you and Pete. Google’s approach encourages a few different mechanisms: RDFa, Microdata and Microformats. Similarly, Google Scholar parses a handful of metadata vocabularies present in the HTML head element. The web is a big place to align with I guess.

I imagine there will be hurdles to get over, but I wonder if your task-force could tap into this virtuous cycle. For example it would be great if cultural heritage data could be aggregated using techniques that big Search companies also use: e.g. RDFa, microformats and microdata; and sitemaps and Atom for updates. This would assume a couple things: publishers could allow (and support) crawling, and that it would be possible to build aggregator services to do the crawling. An important step would be releasing the aggregated content in an open way too. This seems to be an approach that is very similar to what I’ve heard Europeana is doing…which may be something else to align with.

I like the idea of your recommendations providing a sliding scale, for people to get their feet wet in providing some basic information, and then work their way up to the harder stuff. Staying focused on what sorts of services moving up the scale provides seems to be key. Part of the vision document mentions that the services are intended for staff. There is definitely a need for administrators to manage these systems (I often wonder what sort of white-listing functionality Google employs with its Rich Snippets service to avoid spam). But keeping the ultimate users of this information in mind is really important.

Finally I’m a bit curious about the use of ‘aggregations’ in the RLUK vision. Is that some OAI-ORE terminology percolating through?


wikixdc

Wikipedia’s 10th Birthday Party at the National Archives in Washington DC on Saturday was a lot of fun. Far and away, the most astonishing moment for me came early in the opening remarks by David Ferriero, the Archivist of the United States, when he stated (in no uncertain terms) that he was a big fan of Wikipedia, and that it was often his first go-to for information. Not only that, but when discussion about a bid for a DC WikiMania (the Wikipedia Annual Conference) came up later in the morning, Ferriero suggested that the National Archives would be willing to host it if it came to pass. I’m not sure if anything actually came of this later in the day–a WikiMania in DC would be incredible. It was just amazing to hear the Archivist of the United States be supportive of Wikipedia as a reference source…especially as stories of schools, colleges and universities rejecting Wikipedia as a source are still common. Ferriero’s point was even more poignant with several high schoolers in attendance. Now we all can say:

If Wikipedia is good enough for the Archivist of the United States, maybe it should be good enough for you.

Another highlight for me was meeting Phoebe Ayers, who is a reference librarian at UC Davis, member of the Wikimedia Foundation Board of Trustees, and author of How Wikipedia Works. I strong armed Phoebe into signing my copy (I bought this copy on Amazon after it was de-accessioned from Cuyahoga County Public Library in Parma, Ohio ). Phoebe has some exciting ideas for creating collaborations between libraries and Wikipedia, which I think fit quite well into the Galleries, Libraries, Archives and Museuems (GLAM) effort within Wikipedia. I think she is still working on how to organize the effort.

Later in the day we heard how the National Archives is thinking of following the lead of the British Museum and establishing a Wikipedian in Residence. Liam Wyatt, the first Wikipedian in Residence, put a human face on Wikipedia for the British Museum, and familiarized museum staff with editing Wikipedia, through activities like the Hoxne Challenge. Having a Wikipedia in Residence at the National Archives (and who knows maybe the Smithsonian and the Library of Congress) would be extremely useful I think.

In a similar vein, Sage Ross spoke at length about the Wikipedia Ambassador Program. The Ambassador Program is a formal way for folks to represent Wikipedia in academic settings (universities, high schools, etc). Ambassadors can get training in how to engage with Wikipedia (editing, etc) and can help professors and teachers who want to integrate Wikipedia into their curriculum, and scholarly activities.

I got to meet Peter Benjamin Meyer of the Bureau of Labor Statistics, who has some interesting ideas for aggregating statistical information from federal statistical sources, and writing some bots that will update article info-boxes for places in the United States. The impending release of the 2010 US Census Data has the Wikipedia community discussing the best way to update the information that was added by a bot for the 2000 census. It seemed like Peter might be able to piggy back some of his efforts on this work that is going on at Wikipedia for the 2010 Census.

Jyothis Edthoot an Oracle employee and Wikipedia Steward gave me a behind the scenes look at the tools he and others in Counter Vandalism Unit use to keep Wikipedia open for edits from anyone in the world. I also got to meet Harihar Shankar from Herbert van de Sompel’s team at Los Alamos National Lab, and to learn more about the latest developments with Memento, which he gave a lightning talk about. I also ran into Jeanne Kramer-Smyth of the World Bank, and got to hear about their efforts to provide meaningful access to their document collections to web crawlers using their metadata.

I did end up giving a lightning talk about Linkypedia (slides on the left). I was kind of rushed, and I wasn’t sure that this was exactly the right audience for the talk (being mainly Wikipedians instead of folks from the GLAM sector). But it helped me think through some of the challenges in expressing what Linkypedia is about, and who it is for. All in all it was a really fun day, with a lot of friendly folks interested in the Wikipedia community. There must’ve been at least 70 people there on a very cold Saturday–a promising sign of good things to come for collaborations between Wikipedia and the DC area.