Archive for the ‘opensource’ Category

SKOS in the Context of Semantic Web Deployment

Wednesday, April 30th, 2008

If you happen to be in the DC area on May 8th and are interested in linked data and the practical application of semantic web technologies like RDF and OWL please join us at the Library of Congress for a presentation by Alistair Miles, key developer of SKOS, and semantic web practitioner at the University of Oxford.

Below is the announcement, I hope you can make it. Oh, and if you are really interested in this stuff we’re having some brown bag sessions later in the afternoon that you are welcome to attend, just email me at ehs [at] pobox [dot] com.

The Simple Knowledge Organization System (SKOS), in the Context of Semantic Web Deployment, Alistair Miles, University of Oxford May 8th 10am6th 11:30am, 2008, Montepelier Room, Madison Building, Library of Congress (map) .

Links are valuable. Links between documents, between people, between ideas, between data. Data is now a first class Web citizen, and the Web is expanding as more of these valuable networks are deployed within its fabric. Well-established knowledge organization systems like the Library of Congress Subject Headings will play a major role within these networks, as hubs, connecting people with information and providing a firm foundation for network growth as many new routes to the discovery of information emerge through the collective action of individuals. Or will they?

This talk introduces the Simple Knowledge Organization System (SKOS), a soon-to-be-completed W3C standard for publishing thesauri, classification schemes and subject headings as linked data in the Web. This talk also presents SKOS in the context of the W3C’s Semantic Web Activity, and in particular the work of the W3C’s Semantic Web Deployment Working Group where other specifications are being developed for publishing linked data in the Web, for embedding linked data in Web pages, and for managing Semantic Web vocabularies. Finally, this talk takes a mildly inquisitive look at the value propositions for linked data in the Web, and how LCSH might be deployed in the Web for better information discovery.

Alistair’s background is in the development of Web technologies for scientific applications. He was a research associate in the e-Science department of the Rutherford Appleton Laboratory, where he was introduced to Semantic Web technologies and first developed SKOS. He has recently moved to the University of Oxford to work on linking fruit fly genomics research data, and he hopes everything he knows about the Semantic Web will turn out to be useful after all.

tabulator and google reader notifier oddness

Monday, March 17th, 2008

If you’ve ever tried installing the Tabulator (Tim Berners-Lee’s experimental linked-data browser) and not seen it work you may have run into the same problem as me.

On a hunch I guessed that there might be some weird interaction with another Firefox plugin — so I went through all 15 of them, disabling each one and restarting Firefox to see if Tabulator would start working. Sure enough, after I disabled Google Reader Notifier the Tabulator worked fine.

I dropped a message to public-semweb-ui, but figured it couldn’t hurt to add this here for other linked-data nerds casting about in google with the same problem.

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.12) Gecko/20080207 Ubuntu/7.10 (gutsy) Firefox/2.0.0.12
Tabulator v0.8.2
Google Reader Notifier v0.4.5

rockin’ the plastic

Tuesday, August 14th, 2007

High 5, more dead than alive
Rockin’ the plastic like a man from a casket

Yeah, the blog is back after getting routed by the LinuXploit Crew. The whole episode was really rewarding actually. I learned what projects I work on that need to be hosted elsewhere at more stable locations–that are likely to outlive my pathetic musings. I (re)learned how important good friends are (ksclarke, dchud, wtd, jaf, gabe, rsinger) in a pinch. And I watched in awe as the Wordpress 2.2 upgrade actually worked on my pathologically old instance (which I suspect to have been the front door). Oh, and it was a good excuse to ditch gentoo for ubuntu.

Spring cleaning came a bit late this year I guess. Thanks LinuXploit Crew!

DemoCampDC

Friday, January 5th, 2007

DemoCampDC is an adaptation of BarCamp to provide an informal mechanism for sharing technology shtuff in the DC area. If you are interested and in the DC area please add your name to the list of attendees and stay tuned.

evergreen

Friday, December 22nd, 2006


In case you missed it linux.com is running an article by Michael Stutz on Evergreen, an open source integrated library system developed by the state of Georgia to support a consortium of 44 different libraries. (Thanks for the link Adam)

Hanging out with miker_ and bradl in irc and having open-ils in my feed reader makes me take this sort of work for granted sometimes…and Michael’s article made me wake up and marvel at how truly remarkable the work they’ve done is.

The evergreen folks are hosting this years code4libcon where I’m supposed to be doing a presentation on the Atom Publishing Protocol. It’s a low cost/pragmatic alternative to the usual library technology conference options–and will be a good opportunity to buy these Evergreeners a beer. I hope to see you there.

rsinger++

Thursday, September 28th, 2006

So Ross beat out 11 other projects to win the OCLC Research Software Contest for his next generation OpenURL resolver umlaut. Second place went to to Jesse Andrews’ BookBurro–so the competition was fierce this year. Much more so than last year when there were 4 contestants.

Those of us who hang out in #code4lib got to hear about this project when it was just a glimmer in his eye…and had front row seats for hearing about the development as it progressed. Essentially umlaut is an openurl router that’s able to consult online catalogs (via SRU), other OpenURL resolvers (SFX), Amazon, Google, Yahoo, Connotea, CiteULike and OAI-PMH. It’s all written in Ruby and RubyOnRails.

I feel particularly proud because Ross is enough of a mad genius to have found a use for some ruby gems I wrote for doing sru, oai-pmh and querying OCLC’s xisbn service.

Speaking of which we’ve been collaborating recently on a little ruby gem for querying OCLC’s OpenURL Resolver Registry. This registry essentially makes it easy to determine what the appropriate OpenURL resolver is given a particular IP address. So you could theoretically rewrite your fulltext URLs so that they were geospatially aware. For example:

  require "resolver_registry"
 
  client = ResolverRegistry::Client.new
  institution = client.find('130.207.50.91')
  print institution.resolver.base_address

If you want to take a look direct your svn client like so:

svn co http://rsinger.library.gatech.edu/svn/openurl_registry/

I imagine it’ll get released to rubyforge sometime shortly.

ruby-oai v0.0.3

Tuesday, September 19th, 2006

v0.0.3 of ruby-oai was just released to RubyForge. The big news is that this release allows you to use libxml for parsing thanks to the efforts of Terry Reese. Terry is building a RubyOnRails metasearch application at OSU and, well, felt the need for speed.

After committing the branch he was working on I ran some performance tests of my own. I ran a vanilla ListRecords request against dspace, eprints and american memory oai-pmh servers using both the rexml (default) and libxml backend parsers. Here are the results

server parser real user sys
dspace rexml 0m3.632s 0m2.008s 0m0.044s
libxml 0m1.900s 0m0.212s 0m0.032s
  1.732s (+48%) 1.796s (+89%) 0.012s (+27%)
 
eprints rexml 0m19.807s 0m1.984s 0m0.036s
libxml 0m19.344s 0m0.236s 0m0.024s
  0.463s (+2%) 1.748s (+88%) 0.012s (+33%)
 
american-memory rexml 0m12.991s 0m5.424s 0m0.052s
libxml 0m7.420s 0m0.324s 0m0.032s
  5.571s (+43%) 5.104s (+94%) 0.02s (+38%)

Those percentage values are speed improvements. Thanks Terry :-)

the importance of making packages

Tuesday, September 19th, 2006

If you are interested in such things Ian Bicking has a nice posting about why breaking up a project into smaller packages of functionality is important. His key point is that the boundaries between packages actually help in establishing and maintaining decoupled modules within your application.

…when someone claims their framework is all spiffy and decoupled, but they just don’t care to package it as separate pieces… I become quite suspicious. Packaging doesn’t fix everything. And it can introduce real problems if you split your packages the wrong way. But doing it right is a real sign of a framework that wants to become a library, and that’s a sign of Something I’d Like To Use.

So why is decoupling important? Creating distinct modules of code with prescribed interfaces helps ensure that a change inside one module doesn’t have a huge ripple effect across an entire project codebase. In addition to using packaging to create boundaries between components the Law of Demeter is a pretty handy technique for reducing coupling in object oriented systems. It amounts to ensuring that a given method only invokes methods on objects that are: itself, in its parameters, objects that itself creates, or component objects. The LoD seems to be a good practice at the local level, but packaging helps at a macro/design level. One of the most powerful and fun parts of packaging is coming up with good names and metaphors for your packages and components. Having fun and meaningful names for packages provides coherence to a project, and allows developers to talk about an application. Eric Evans has some nice chapters in his Domain Driven Design about coming up with what he calls a domain language whose aim is to:

To create a supple, knowledge-rich design calls for a versatile, shared team language, and a lively experimentation with language that seldom happens on software projects.

It’s important…and naming distinct packages well helps build a good domain language.

I suppose it’s implicit in making something a code library–but one of the other major benefits of splitting a larger project up into smaller packages is that you encourage reuse. The bit of functionality that you decided to bundle up separately can be used as a dependency in a different project–perhaps even by a different person or organization. This seems to me to be a hallmark of good open source software.

Most popular languages these days have established ways of making packages available, downloadable and installable while expressing the dependencies between them. Perl has CPAN, PHP has PEAR, Ruby has gems and RubyForge, Python has eggs and EasyInstall, Java has maven, Lisp has asdf. Even some applications like Trac, RubyOnRails and Drupal encourage the creation of smaller packages (modules or plugins) by having a well defined api for adding extensions. And that’s not even getting into the various ways operating systems make packages available…

The truly hard part about packaging for me isn’t really technical. Most packaging solutions allow you to manage dependencies, versioning, installation and removal. As Ian says, its the decision of where to draw the lines between packages that is hard. It’s hard because you have to guess before you start coding–and often during the process of coding you realize that the dividing lines between packages begin to blur. This is why having distinct packages is so important because you are forced to stare at the blurriness and encouraged to fix it…instead of creating the infamous big ball of mud.

An interesting counterpoint to trying to figure out the dividing lines before hand is to try to design from the outside in, and extract reusable components from the result. The very successful RubyOnRails web framework was extracted from a working application (Basecamp). In a lot of ways I think Test Driven Design encourages this sort of outside-in thinking as well. Extracting usable components from a ball of mud is nigh impossible though…at least for me. I would be interested to know how much of the Rails components were anticipated by the designers as they were creating BaseCamp. It takes a great deal of discipline and jazz-like anticipation to be able to improvise a good design. That or, you have to build in time to prototype something with an eye to taking what you’ve learned to do it right.

open standards

Friday, September 15th, 2006

Folks who are interested in libraries and technology are often drawn to the issue of open standards. Using open standards is very important to libraries for a variety of reasons that Ed Corrado summarizes nicely.

This week my podcast reader picked up an excellent interview with Danese Cooper of the Open Source Initiative where she talks about the Open Standard Requirement which was introduced a few months ago. It provides a new perspective on the same issue from outside of the library community.

Essentially the OSR amounts to 5 guidelines for identifying a truly open standard. These guidelines are different though because they focus on what makes a standard open for an implementor. Whether the standard was created by an open process or not is really outside of scope. The important thing is how easy it is for a software developer to write software that uses the standard. A nice feature of the OSR is that the guidelines would fit on an index card. Here’s my regurgitation of them:

  1. The spec can’t omit details needed for implementation
  2. The standard needs to be freely/publicly available
  3. All patents involved in the spec need to be royalty free
  4. Clicking through a license agreement is not necessary
  5. The spec can’t be dependent on a standard that is not open as well

Danese was quick to point out that these are simply guidelines and not rules. For example Unicode fails on 2. since you have to pay for a copy of the spec. But in this case printing the standard is a publishing feat–given all the glyphs and their number. It’s not unusual that the book would cost money. So this guideline could be waived if the OSI folks agreed.

Rather than the OSI going and applying these rules to all known standards the idea is that standards bodies could claim self-compliance–and as developers implement the standard the compliance will be ascertained.

The guidelines themselves and the process of being fine tuned/hammered on–and they are looking for volunteers…

set your data free … with unapi

Monday, August 28th, 2006

Dan, Jeremy, Peter, Michael, Mike, Ross and I wrote an article in the latest Ariadne introducing the lightweight web protocol unAPI. Essentially unAPI is an easy way to include references to digital objects in your HTML which can then be predictably retrieved by a machine…yes ‘machine’ includes JavaScript running in a browser :-) Dan and a really nice cross section of developers around the world have been working on this spec for over a year now and I think it could be poised to play an important role in the emerging open data movement.

Imagine you have a citation database which is searchable via the web. The search results include hits. Wouldn’t it be nice to align your human viewable results with machine readable representations so that people could write browser hacks and the like to remix your application data?

As far as I can tell there are a few options available to help you do this (apart from doing something ad-hoc).

  1. use a citation microformat and mark up your HTML predictably so that it can be recognized and parsed
  2. use GRDDL to map your HTML to RDF via an XLST profile.
  3. embed RDF in your HTML essentially using an RDF microformat.
  4. OpenURL and/or COinS to link in page IDs to OpenURL servers.
  5. use unAPI and include a unapi server url (familiar autodiscovery like RSS/Atom), and identifiers (simple element attributes) and write a simple server side script that emits xml for a given identifier.

I like microformats a lot and I think a citation format will eventually get done. But it’s been a long time coming and there’s no indication it’s going to get done any time soon. What’s more unAPI is bigger than just citation data–and it allows you to publish all kinds of rich data objects without waiting for a community to ratify a particular representation in HTML.

Options 2 and 3 use RDF which I actually like quite a bit as well. GRDDL implies a GRDDL aware browser which would be cool but is a bit heavy weight. XSLT will require clean XHTML–or pipelines to clean it. Embedding RDF in HTML using microformat techniques is compelling because you can theoretically process the RDF data similarly–whereas unAPI doesn’t require any particular kind of machine readable format (apart from HTML). Actually there’s nothing stopping you from using unAPI to link human viewable objects with RDF representations. The advantage unAPI has here is you can learn RDF if you want to, but you don’t have to learn RDF to get going with unAPI today.

Option 4 leverages work done in the library community on citation linking. OpenURL routers are widely deployed in libraries around the world and COinS is a quasi-microformat for putting OpenURL context objects into your HTML so that they can be extracted and fired off at an OpenURL server. OpenURL is a relatively complex and subtle standard which can do a lot more than just citation linking. Compared to OpenURL/COinS unAPI allows for ease of implementation in languages like JavaScript and provides a simple introspection mechanism for discovering what formats a particular resource is available in. AFAIK this can’t be done simply using OpenURL/COinS. If I’m wrong, comments should be open. I would argue that the sheer power and flexibility of OpenURL paradoxically make it hard to understand…and that unAPI in Dan’s adherence to a one-page-spec is more limited and simple. Less is more…

So if this piques your interest read the article. It does a much better job of describing the origins of the work, where it’s headed, has examples and links out to sites/tools that use unAPI today. I must admit I wrote very little of the article, and mostly contributed text snippets and screenshots of the unAPI validator I wrote, which uses my unapi ruby gem.