history and genealogy at semwebdc

spine CC BY 2.0

Last week’s Washington DC Semantic Web Meetup focused on History and Genealogy Semantics. It was a pretty small, friendly crowd (about 15-20) that met for the first time at the Library of Congress. The group included folks from PBS, the National Archives, the Library of Congress, and the Center for History and New Media–as well as some regulars from the Washington DC SGML/XML Users Group.

Brian Eubanks gave a presentation on what the Semantic Web, Linked Data and specifically RDF and Named Graphs have to offer genealogical research. He took us on a tour through a variety of websites, such as Land Records Database at the Bureau of Land Management, Ancestry.com, Footnote and Google Books and made a strong case for using RDF to link these sorts of documents with a family tree.

As more and more historic records make their way online as Web Documents with URIs, RDF becomes an increasingly useful data model for providing provenance and source information for a family tree. On sites like Ancestry.com it is important to understand the provenance of genealogical assertions, since Ancestry.com allows you to merge other people’s family trees into your own, based on likely common ancestors. In situations like this researchers need to be able to evaluate the credibility or truthfulness of other people’s trees–and being able to source the family tree links to the documents that support them is an essential part of the equation.

Along the way Brian let people know about a variety of vocabularies that are available for making assertions that are of value to genealogical research:

rdfcal : for Events
BIO : for biographical information
Relationship : for describing the links between people
FOAF : for describing people
TriG : for identifying the assertions that a researcher makes and linking them to a given document

The beautiful thing about RDF for me, is that it’s possible to find and use these vocabularies in concert, and I’m not tempted to create the-greatest-genealogy-vocabulary that does it all. In addition, Brian pointed out that sites like dbpedia and geonames are great sources of names (URIs) for people, places and events that can be used in building descriptions. Brian has started the History and Genealogy Semantics Working Group which has an open membership, and encourages anyone with interest in this area to join. While writing this post I happened to run across a Wikipedia page about Family Tree Mapping, which indicated that some genealogical software already supports geocoding family trees. As usual it seems like the geo community is leading the way in making semantics on the web down to earth and practical.

I followed Brian by giving a brief talk about the Chronicling America, which is the web front-end for data collected by National Digital Newspaper Program, which in turn is a joint project of the Library of Congress and the National Endowment for the Humanities. After giving a brief overview of the program, I described how we were naturally led to using Linked Data and embracing a generally RESTful approach by a few factors:

a need to create persistent Cool URIs for newspaper titles, issues and pages so that people could reference them.
the desire to make views available for the institutions around the United States that supply us with data
the need to make our data available as a participant in the Digging Into Data Challenge
a desire to kick the tires on the relatively new Open Archives Initiative Object Reuse and Exchange vocabulary for describing aggregations of resources on the Web so that they can be meaningfully harvested

One thing that I learned during Brian’s presentation is that sites like Footnote are not only going around digitizing historic collections for inclusion in their service, but they also give their subscribers a rich editing environment to search and annotate document text. These annotations are exactly the sort of stuff that would be perfect to represent as and RDF graph, if you wanted to serialize the data. In fact the NSF funded Open Annotation Collaboration project is exploring patterns and emerging best practices in this area. I’ve had it in the back of my mind that allowing users to annotate page content in Chronicling America would be a really nice feature to have. If not at chroniclingamerica.loc.gov proper, then perhaps showing how it could be done by a 3rd party using the API. To some extent we’re already seeing annotation happening in Wikipedia, where people are creating links to newspaper pages and titles in their entries, which we can see in the referrer information in our web server logs. Update: and I just learned that wikipedia themselves provide a service that allows you to discover entries that have outbound links to a particular site, like chroniclingamerica.loc.gov.

Speaking of the API (which really is just REST) if you are interested in learning more about it check out the API Document that Dan Chudnov prepared. I also made my slides available, hopefully the speaker notes provide a bit more context for what I talked about when showing images of various things.

Afterwards a bunch of us headed across the street to have a drink. I was really interested to hear from Sam Deng that (like the group I work in at LC) PBS are big Python and Django shop. We’re going to try to get a little brown bag lunch going on between PBS and LC to talk about their use of Django on Amazon EC2, as well as software like Celery for managing asynchronous task queues.

Also, after chatting with Glenn Clatworthy of PBS, I learned that he has been experimenting with making Linked Data views available for their programs. It was great to hear Glenn describe how assigning each program a URI, and leveraging the nature of the web would make a perfect fit for distributing data in the PBS enterprise. It makes me think that perhaps having a session on what the BBC are doing with Linked Data would be timely?