level 0 linked archival data


Depósito del Archivo de la Fundación
Sierra-Pambley

TLDR; lets see if we can share structured archival data better by adding HTML <link> elements that point at our EAD XML files.

A few weeks ago I attended a small meeting of DC museums, archives and libraries that were discussing what Linked Data means for Archives. Hillel Arnold and I took collaborative notes in Pirate Pad. For a good part of the time we went around the room talking about how we describe archival collections with various workflows using Encoded Archival Description (EAD), and how this was mostly working (or not).

Some good work has already been done imagining how Linked Data can transform archival description by the LOCAH (now Linking Lives) as well as the Social Networks and Archival Context project. I think tools like Editors’ Notes, CWRC Writer, and Google’s Research Pane could provide really useful models for how the work of an archivist could benefit from linking to external resources such as Wikipedia, dbpedia, VIAF, etc. But we really didn’t talk about that in too much detail. The focus instead was on various tools people used in their EAD workflows: Archivists’ Toolkit, Oxygen, ExistDB, Access databases, etc … and the hope that Archives Space could possibly improve matters. We did touch briefly on what it means to make finding aids available on the Web, but not in a very satisfactory way.

I was really struck by how everyone was using EAD, even if their tools were different. I was also left with the lingering suspicion that not much of this EAD data was linked to from the HTML presentation of the finding aid. After some conversations it was also my understanding that even after 20 years of work on EAD, there is not a listing of websites that make EAD finding aids available. It seems particularly sad that institutions have invested a lot of time and effort in putting EAD into practice, and yet we still aren’t really sharing them very well with each other.

So in a bit of a fit of frustration I did some hacking to see if I could use Google and ArchiveGrid to identify websites that serve up finding aids either as HTML or as EAD XML. I wanted to:

  1. Get a list of websites that made HTML and EAD XML finding aids available. We can rely on Google to index the Web, but maybe we could index the archival web a bit better ourselves if we had a better understanding of where the EAD data was available. The idea is that this initial list could be used to bootstrap a list of websites making EAD finding aids available in the Wikipedia entry for EAD.
  2. To see which websites have HTML representations that link to an EAD XML representation. The rationale here is to encourage a very simple best practice for linking to structured archival data when it is available. More on that below.

I was able to identify 201 hosts that served up finding aids either as HTML or XML. You should be able to see them here in this spreadsheet. I also collected URLs for finding aids (both HTML and XML) that I was able to locate, which can be seen in this JSON file.

With the URLs in hand I wrote a little script to examine which of the 156 hosts serving up HTML representations of finding aids had a link to an XML EAD document. I looked for a very simple kind of link that was popularized by the RSS and Atom syndication community for autodiscovery of blog feeds. A <link> tag that has a rel attribute of alternate and a type attribute set to application/xml. Out of the 156 websites serving up HTML representations of finding aids I could only find two websites that used this link pattern: Princeton University and Emory University.

For example if you view the HTML source for the Einstein Collection finding aid at Princeton you’ll see this link:


Similarly the finding aid for the Salman Rushdie collection at Emory University has this link:


As the title of this blog post suggests, I’m calling this pattern level 0 linked data. Linked Data purists would probably say this isn’t Linked Data at all since it doesn’t involve an RDF serialization. And I guess they would be right. But it does express a graph of HTML and EAD data that is linked, and it serves a real need. If you are interested in Linked Data and archives I encourage you to add these links to your HTML finding aids today.

So why is are these links important?

The main reason is they are found in HTML documents, which are the representations that matter most on the Web. HTML documents are read by people. They are hypertext documents that link to and from other places on an archives website and elswewhere on the Web at large. They are well understood technically by the Web development community…if you hire a developer they might have strong feelings about using PHP or Ruby, but they will know HTML backwards and forwards. They are crawled and indexed by search engine bots so that researchers around the world can discover our collections. They are cited in social environments like Twitter, Facebook, blog posts, etc. We have a responsibility to create stable homes (URLs) for our archival descriptions that fit into the Web.

The other reason is these links are important is that they make our investment in EAD visible on the Web for anyone who is looking. Nobody but ArchiveGrid actively crawl EAD XML data. They are the only ones that can find them, because they have been told where they are. If we did a better job of advertising the availability of our EAD documents I think we would see more tools and services around them. ArchiveGrid is a good example of the sort of tool that could be built on top of a web of EAD data. But what about archival collections in your local area? Perhaps it would be useful to have a service that let you look across the archival holdings of institutions in a consortium you belong to. Or perhaps you might want to create an alerting service that lets researchers know what new archival collections are being made available. Or maybe you need to collaborate with archives in a specific domain, and need tools that provide a custom experience for that distributed collection. I imagine there would be lots of ideas for apps if there were just a teensy bit more thought put into how finding aids (both the HTML and the XML) are put on the Web, and how we shared information about their availability.

Going forward I think HTML5 microdata and RDFa present some excellent opportunities for Linked Data representations of finding aids. Especially when you consider some of the vocabulary development being done around them; as well as some of the work being done by Tim Sherratt on using linked data to create new user experiences around archival data. But if your institution has already invested in creating EAD documents I think trying this link pattern with data you already have could be a good first step towards introducing linked data into your archive. I hope it is a first baby step that archives can take in merging some of the structured data found in the EAD XML document into the HTML they publish about their collections.

I’m planning on getting the list of EAD publishers into the Wikipedia article for EAD, and putting out a call for others to add their website if it is missing. I also think that a simple crawling and aggregation service that use the links in some fashion could also encourage more linking. A lot of this blog post has been mental preparation for my involvement in an IMLS funded project run out of Tufts that will be looking at Linked Archival Metadata, which is about to be kicked off this winter. If you’ve read this far, and have any thoughts or suggestions about this I’d enjoy hearing them either here, on Twitter or via email.


who creates the LCNAF (part 2)

I ended my A Look at Who Creates the LCNAF post with a hunch that the Library of Congress Name Authority File is increasingly supported by particpants in the Name Authority Cooperative (NACO) rather than by the Library of Congress themself. It didn’t occur to me until a few days later that I missed a pretty obvious opportunity to graph the number of records created by LC compared with all the other members of the collective. So, here it is:

It looks like this has been a trend since about 1996 or so. I think it validates the cooperative aspect of the PCC and NACO. Not that it needs any validating. It’s just nice to see libraries and librarians working together to build something. I guess the name Library of Congress Name Authority File is also increasingly ironic…

Update: thanks to Kevin Ford (who emailed me privately) it seems that LC has been quite aware of this trend, and highlighted the event in 1996 when NACO members began contributing more records than LC with a press release.


Always Already New

Always Already New: Media, History, And The Data Of CultureAlways Already New: Media, History, And The Data Of Culture by Lisa Gitelman
My rating: 3 of 5 stars

I enjoyed this book, mainly for the author’s technique of exploring what media means in our culture by using two examples, separated in time: the phonograph and the Internet. She admits that in some ways this amounts to comparing apples to oranges, and there is definitely a creative tension in the book. Gitelman’s emphasis is not that media technologies change society and culture, but that a technology is introduced and is in turn shaped by its particular social and historical context, which then reshapes society and culture.

I define media as socially realized structures of communication, where structures include both technological forms and their associated protocols, and where communication is a cultural practice, a ritualized collocation of different people on the same mental map, sharing or engaged with popular ontologies of representation. As such, media are unique and complicated historical subjects.

It’s tempting to talk about media technologies as if their ultimate use is somehow inevitable. For example, Gitelman discusses how the initial commercial placement of the phonograph centered largely around the idea that it would transform dictation and the office. Early demonstrations intended to increase sales of the device focused on recording and playback, rather than simply playback. They didn’t initially see the market for recorded music, which would so transform the device. To some extent we’ve cynically come to expect this out of marketing and “evangelism” about media technologies all the time. But this mode of thinking is also present in purely technical discussions, which don’t account for the placement of the technology in a particular social context.

Getting a sense of the social context you are in the middle of, as opposed to one you one you are historically removed from, presents some challenges. I think this difficulty is more evident in the second part of the book which focuses on the Internet and the World Wide Web against a backdrop of libraries and bibliography. Like many others I imagine, my knowledge of JCR Licklider’s influence on the development of ARPAnet, and the Internet was largely culled from Where Wizards Stay Up Late. I had no idea, until reading Always Already New, that Licklider contracted with the Council on Library Resources (now Council on Library and Information Resources) to write a report Libraries of the Future on the topic of how computing would change libraries.

I enjoyed the discussion of the role that the Request for Comment (RFC) played on the Internet. How these documents that were initially shared via the post, helped bootstrap the technologies that would create the Internet that allowed them to be shared as electronic documents or text. I didn’t know about the RFC-Online project that Jon Postel started right before his death, to recover the earliest RFCs that had been already lost. Gitelman’s study of linking, citation and “publishing” on the Web was also really enjoyable, mainly because of her orientation to these topics:

I will argue that far from making history impossible, the interpretive space of the World Wide Web can prompt history in exciting new ways.

All this being said, I finished the book with the sneaking feeling that I needed to reread it. Gitelman’s thesis was subtle enough that it was only when I got to the end that I felt like I understood it: the strange loop that thinking and media participate in, and how difficult (and yet fruitful) it is to talk about media and their social context. Maybe this was also partly the effect of reading it on a Kindle :-)

View all my reviews


learning from people that do

Anil Dash recently wrote a nice piece about the need for what he calls a Hi-Tech Vo-tech in the technology sector. If you are not familiar with it already, Vo-Tech is shorthand in the US for Vocational-technical school, which provide focused training in specific areas, often on a part time basis. The Vo-Tech experience is markedly different from the typical 4 year university experience, which tends to be focused more on theory than practice.

I totally agree.

But if you are looking to work as a software developer, and to help build this amazing information space we call the World Wide Web, you don’t need to wait for this dream of a better high school curriculum for computer programming, or Hi-Tech Vo-Techs to come to your town. I don’t want to minimize the effort involved in finding your way into the workplace…it’s hard, especially when there is competition from “qualified” candidates, and the skill sets seem to be constantly shifting. But here are some relatively simple steps you can take to get started.

Look at Job Ads

Go to the CraigsList for your area, look at what jobs are available under the internet engineers and software / qa / dba sections. I suggest Craigslist because of their local flavor, and the low cost to advertise, which typically means the jobs are at smaller companies who are less interested in finding someone with the right college degree, and more interested in finding someone who can get things done. Look for jobs that focus on what you can do rather than schooling. Don’t apply for any of the jobs just yet. Note down the tools they want people to know: computer languages, operating systems, web frameworks, etc. Research them on Wikipedia. Focus on tools that seem to pop up a lot, are opensource, and can be downloaded and used without cost. You don’t need to do anything with them just yet though.

Go To User Group Meetings

I say opensource because opensource tools often have open communities around them. You should be able to find user groups in your area where people present on how they use these tools at their place of work. You might have to drive a while, or take a long bus/train ride – but it’s worth it. To find the meetings do some searches by technology and location on Meetup. Alternatively you can Google for whatever the technology is + “user group” + your area (e.g. Philadelphia) and go through a few pages of results. At a user group meeting you will not only learn about the details of the technology, but you will meet actual, real people who are using it. There are often subtle differences in the cultures and communities of practice around software tools. Some user groups will feel more comfortable than others. Pay attention to your gut reactions–they are indicators of how much you would like a job working with the technology, and the people who like it. If you get a bad vibe, don’t take it personally, try another meeting. Finding a job is often a matter of who you know, not what you know … and user groups are a great place to get to know people working in the software development field. There’s no online substitute for meeting people in real life.

Use Social Networks

At user group meetings you meet people who you can learn from. See if they have a blog, are on Twitter or Facebook. Maybe they use a social bookmarking tool you can follow. Or perhaps there are email discussion lists you can subscribe to. It’s not stalking, these people are your mentors, learn from them. Take a dip into sites like Hacker News or Programming Reddit. Watch the trends, you aren’t being a fanboy/girl, you are learning about what people care about in the field. Don’t feel bad if it’s overwhelming (it’s overwhelming to “experts” too), focus on what seems interesting. Also, cultivate your own online identity by posting stuff that you are interested in, or have questions about. Stay positive, and try not to bash things: people (and potential employers) are watching you the same way you are watching them.

Read

Sometimes the speakers at User Group meetings will also be authors of books. You will see books reviewed on sites like Hacker News. People you follow may mention the books they read, or have accounts on sites like GoodReads. See if a library or a bookstore has them, and go skim them. Buy or borrow the ones you like. Take notes about them online, so people can see your interests. Get a Google Reader account and follow blogs related to tools you would like to use. Look for tools that have approachable/readable tutorials. Try out the examples, and get a feel for how well the theory of the tutorial translates into practice. If tools don’t install or seem to work the way they are described, don’t feel like you did something wrong…move on to tools that work more smoothly, and fit your brain better. The benefit to focusing on opensource projects is that you will find more content about them online. You can can read code. Reading the source code for Ruby or GoLang is definitely not for the faint of heart, though it’s nice you can do it. It’s more important that you look at code that uses these tools. Go to GitHub and see what projects there are that use the tool. Browse the source online, or clone the repositories to your workstation. See if you can help out with some low hanging fruit tasks in their issue queue.

Find a Niche

You are probably interested in things other than programming. For example I like libraries and archives, and the cultural heritage sector. I’ve found a virtual community of software developers in this area called code4lib, which helps me learn more about new projects, tools in the field, and is a way to get to know people. You may be surprised to find a similar community around something you are interested in: be it astrophysics, cartoons, music, maps, real estate, etc. If you don’t find one, maybe think about starting one up–you might be surprised by how many people turn up. Sometimes there are collaborative projects that need your help like Wikipedia, Open Street Map where the ability to automate mundane tasks is needed. You might not get paid for this work, but it will broaden your circle of contacts, deepen your technical skills, will build your self confidence, and will be something to put on your resume. The key thing that finding a niche can do is make your job search a bit easier, since technology skills cut across domains. You will also find that your niche has a particular set of tools that it likes to use. These typically aren’t hard and fast rules about using X instead of Y, but are norms. Pay attention to them, and learn about things that interest you.

Be Confident

I don’t mean to imply any of this is easy. It can be extremely difficult to get out of your comfort zone and explore things you don’t know. But you will be rewarded for your efforts, by learning from people who actually do things in the world. I’ve worked with some really excellent software developers that didn’t have a compsci degree, and some that I wasn’t even sure if they graduated high school. Sometimes I wonder if I even graduated from high school. So be confident in your ability to learn and do this thing we call software development. Show that you are humble about what you don’t know, and that you are hungry to learn it. Above all, don’t buy into the cult of the “real programmer” … she doesn’t exist. There are just people to learn from, and if you are doing it right, you never stop learning.


a look at who makes the LCNAF

As a follow up to my last post about visualizing Library of Congress Name Authority File (LCNAF) records created by year, I decided to dig a little bit deeper to see how easy it would be to visualize how participating Name Authority Cooperative institutions have contributed to the LCNAF over time. This idea was mostly born out of spending the latter part of last week participating in a conversation about the need for a National Archival Authority Cooperative hosted at NARA. This blog post is one part nerdy technical notes on how I worked with the LCNAF Linked Data, and one part line charts showing who creates and modifies LCNAF records. It might’ve made more sense to start with the pretty charts, and then show you how I did it…but if the tech details don’t interest you can jump to the second half.

The Work

After a very helpful Twitter conversation with Kevin Ford I discovered that the Linked Data MADSRDF representation of the LCNAF includes assertions about the institution responsible for creating or revising the a record. Here’s a snippet of Turtle for RDF that describes who created and modified the LCNAF record for J. K. Rowling (if your eyes glaze over when you see RDF, don’t worry keep reading, it’s not essential you understand this):

@prefix ri: <http://id.loc.gov/ontologies/RecordInfo#> .

<http://id.loc.gov/authorities/names/n97108433>
    madsrdf:adminMetadata [
        ri:recordChangeDate "1997-10-28T00:00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
        ri:recordContentSource <http://id.loc.gov/vocabulary/organizations/dlc> ;
        ri:recordStatus "new"^^<http://www.w3.org/2001/XMLSchema#string> ;
        a ri:RecordInfo
    ],
    [
        ri:recordChangeDate "2011-08-25T06:29:06"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
        ri:recordContentSource <http://id.loc.gov/vocabulary/organizations/dlc> ;
        ri:recordStatus "revised"^^<http://www.w3.org/2001/XMLSchema#string> ;
        a ri:RecordInfo
    ] .

So I picked up an EC2 m1.large spot instance (7.5G of RAM, 2 virtual cores, 850G of storage) for a miserly $0.026/hour, installed 4store (which is a triplestore I’d heard good things about), and loaded the data.

% wget http://id.loc.gov/static/data/authoritiesnames.nt.madsrdf.gz
% gunzip authoritiesnames.nt.madsrdf.gz
% sudo apt-get install 4store
% sudo mkdir /mnt/4store
% sudo chown fourstore:fourstore /mnt/4store
% sudo ln -s /mnt/4store /var/lib/4store
% sudo -u fourstore 4s-backend-setup lcnaf --segments 4
% sudo -u fourstore 4s-backend lcnaf
% sudo -u fourstore 4s-import --verbose lcnaf authoritiesnames.nt.madsrdf

I used 4 segments as a best guess to match the 4 EC2 compute units available to an m1.large. The only trouble was that after loading 90M of the 226M assertions it began to slow to a crawl as the memory was about used up.

I thought briefly about upgrading to a larger instance…but it occurred to me that I actually didn’t need all the triples. I just need the ones related to the record changes, and the organization that made them. So I filtered out just the assertions I needed. By the way, this is a really nice artifact of the ntriples data format, which is very easy to munge with line oriented Unix utilities and scripting tools:

zcat authoritiesnames.nt.madsrdf.gz | egrep '(recordChangeDate)|(recordContentSource)|(recordStatus)'  > updates.nt

This left me with 50,313,810 triples which loaded in about 20 minutes! With the database populated I was then able to execute the following query to fetch all the create dates with their institution code using 4s-query:

@prefix ri: <http://id.loc.gov/ontologies/RecordInfo#> .

SELECT ?date ?source WHERE { 
  ?s ri:recordChangeDate ?date . 
  ?s ri:recordContentSource ?source . 
  ?s ri:recordStatus "new"^^<http://www.w3.org/2001/XMLSchema#string> . 
}

This returned a tab delimited file that looked something like:

"1991-08-16T00:00:00"^^>http://www.w3.org/2001/XMLSchema#dateTime>      <http://id.loc.gov/vocabulary/organizations/dlc>
"1995-01-07T00:00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime>      <http://id.loc.gov/vocabulary/organizations/djbf>
"2004-03-04T00:00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime>      <http://id.loc.gov/vocabulary/organizations/nic>

I then wrote a simplistic python program to read in the TSV file and output a table of data where each row represented a year and the columns were the institution codes.

The Result

If you’d like to see the table you can check it out as a Google Fusion Table. If you are interested, you should be able to easily pull the data out into your own table, modify it, and visualize it. Google Fusion tables can be really easily rendered in a variety of ways, including a line graph, which I’ve embedded here, just displaying the top 25 contributors:

While I didn’t quite expect to see LC tapering off the way it is, I did expect it to dominate the graph. Removing LC from the mix makes the graph a little bit more interesting. For example you can see the steady climb of the British Library, and the strong role that Princeton University plays:

Out of curiosity I then executed a SPARQL query for record updates (or revisions), repeated the step with stats.py, uploaded to Google Fusion Tables, and removed LC to better see trends in who is updating records:

@prefix ri: <http://id.loc.gov/ontologies/RecordInfo#> .

SELECT ?date ?source WHERE { 
  ?s ri:recordChangeDate ?date . 
  ?s ri:recordContentSource ?source . 
  ?s ri:recordStatus "revised"^^<http://www.w3.org/2001/XMLSchema#string> . 
}

I definitely never understood what Twin Peaks was about, and I similarly don’t really know what the twin peaks in this graph signify (2000 and 2008). I guess these were years where there were a lot of coordinated edits? Perhaps some NACO folks who have been around for a few years may know the answer. You can also see in this graph that Princeton University plays a strong role in updating records as well as creating them.

So I’m not sure I understand the how/when/why of an NAAC any better, but I did learn:

  • EC2 is a big win for quick data munging projects like this. I spent $0.98 with the instance up and running for 3 days.
  • Filtering ntriples files to what you actually need prior to loading into a triplestore can save time, money.
  • Working with ntriples is still pretty esoteric, and the options out there for processing a dump of ntriples (or rdf/xml) of LCNAF’s size are truly slim. If I’m wrong about this I would like to be corrected.
  • Google Fusion Tables are a nice way to share data and charts.
  • It seems like while more LCNAF records are being created per year, they are being created by a broader base of institutions instead of just LC (who appear to be in decline). I think this is a good sign for NAAC.
  • Open Data, and Open Data Curators (thanks Kevin) are essential to open, collaborative enterprises.

Now I could’ve made some hideous mistakes here, so in the unlikely event you have the time and inclination I would be interested to hear if you can reproduce these results. If the results confirm or disagree with other views of LCNAF participation I would be interested to see them.


lcnaf unix hack

I was in a meeting today listening to a presentation about the Library of Congress Name Authority File and I got it into my head to see if I could quickly graph record creation by year. Part of this might’ve been prompted by sitting next to Kevin Ford, who was multi-tasking by what looked like loading some MARC data into id.loc.gov. I imagine this isn’t perfect, but I thought it was kind of fun hack that demonstrates what you can get away with on the command line with some open data:

  curl http://id.loc.gov/static/data/authoritiesnames.nt.skos.gz \
    | zcat - \
    | perl -ne '/terms\/created> "(\d{4})-\d{2}-\d{2}/; print "$1\n" if $1;' \
    | sort \
    | uniq -c \
    | perl -ne 'chomp; @cols = split / +/; print "$cols[2]\t$cols[1]\n";' \
    > lcnaf-years.tsv

Which yields a tab delimited file where column 1 is the year and column 2 is the number of records created in that year. The key part is the perl one-liner on line 3 which looks for assertions like this in the ntriples rdf, and pulls out the year:

<http://id.loc.gov/authorities/names/n90608287> <http://purl.org/dc/terms/created> "1990-02-05T00:00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> .

The use of sort and uniq -c together is a handy trick my old boss Fred Lindberg taught me, for quickly generating aggregate counts from a stream of values. It works surprisingly well with quite large sets of values, because of all the work that has gone into making sort efficient.

WIth the tsv in hand I trimmed the pre-1980 values, since I think there are lots of records attributed to 1980 since that’s when OPAC came online, and I wasn’t sure what the dribs and drabs prior to 1980 represented. Then I dropped the data into ye olde chart maker (in this case GoogleDocs) and voilà:

It would be more interesting to see the results broken out by contributing NACO institution, but I don’t think that data is in the various RDF representations. I don’t even know if the records contributed by other NACO institutions are included in the LCNAF. I imagine a similar graph is available somewhere else, but it was neat that the availability of the LCNAF data meant I could get a rough answer to this passing question fairly quickly.

The numbers add up to ~7.8 million which seems within the realm of possibile correctness. But if you notice something profoundly wrong with this display please let me know!


data dumps

As usual, the following comments are the reflections of a software developer working at the Library of Congress and are not an official statement of my employer..

One of the challenges that we’ve had at the National Digital Newspaper Program’s website Chronicling America has been access to data. At the surface level Chronicling America is a conventional web application that provides access to millions of pages of historic newspapers. Here “access” means a researcher’s ability to browse to each newspaper, issue and page, as well as search across the OCR text for each page.

Digging a bit deeper “access” also means programmatic access via a Web API. Chronicling America’s API enables custom software to issue queries using the popular OpenSearch protocol, and it also makes URL addressable data available using principles of Linked Data. In addition the website also makes the so called “batch” data that each NDNP awardee sends to the Library of Congress available on the Web. The advantage to making the batch data available is that it allows 3rd parties are then able to build their own custom search indexes on top of the data so their own products and services don’t have a runtime dependency on our Web API. Also researchers can choose to index things differently, perform text mining operations, or conduct other experiments. Each batch contains JPEG 2000, PDF, OCR XML and METS XML data for all the newspaper content; and it is in fact the very same data that the Chronicling America web application ingests. The batch data views makes it possible for interested parties to crawl the content using wget or some similar tool that talks HTTP, and fetch a lot of newspaper data.

But partly because of NDNP’s participation in the NEH’s Digging Into Data program, as well as the interest from other individuals and organizations we’ve recently started making data dumps of the OCR content available. This same OCR data is available as part of the batch data mentioned above, but the dumps provide two new things:

  1. The ability to download a small set of large compressed files with checksums to verify their transfer, as opposed to having to issue HTTP GETs for millions of uncompressed files with no verification.
  2. The ability to easily map each of the OCR files to their corresponding URL on the web. While it is theoretically possible to extract the right bits from the METS XML data in the batch data, the best of expression of how to do this is encapsulated in the Chronicling America ingest code, and is non-trivial.

So when you download, decompress and untar one of the files you will end up with a directory structure like this:

sn86063381/
|-- 1908
|   |-- 01
|   |   |-- 01
|   |   |   `-- ed-1
|   |   |       |-- seq-1
|   |   |       |   |-- ocr.txt
|   |   |       |   `-- ocr.xml
|   |   |       |-- seq-2
|   |   |       |   |-- ocr.txt
|   |   |       |   `-- ocr.xml
|   |   |       |-- seq-3
|   |   |       |   |-- ocr.txt
|   |   |       |   `-- ocr.xml
|   |   |       `-- seq-4
|   |   |           |-- ocr.txt
|   |   |           `-- ocr.xml
|   |   |-- 02
|   |   |   `-- ed-1
|   |   |       |-- seq-1
|   |   |       |   |-- ocr.txt
|   |   |       |   `-- ocr.xml
|   |   |       |-- seq-2
|   |   |       |   |-- ocr.txt
|   |   |       |   `-- ocr.xml
|   |   |       |-- seq-3
|   |   |       |   |-- ocr.txt
|   |   |       |   `-- ocr.xml
|   |   |       `-- seq-4
|   |   |           |-- ocr.txt
|   |   |           `-- ocr.xml

...

The pattern here is:

{lccn}/{year}/{month}/{day}/{edition}/{sequence}/

If you don’t work in a library, an lccn is a Library of Congress Control Number, which is a unique ID for each newspaper title. Each archive file will lay out in a similar way, such that you can process each .tar.bz2 file and will end up with a complete snapshot of the OCR data on your filesystem. The pattern maps pretty easily to URLs of the format:

http://chroniclingamerica.loc.gov/lccn/{lccn}/{year}-{month}-{day}/{edition}/{sequence}/

This is an obvious use case for a pattern like PairTree, but there was some perceived elegance to using paths that were a bit more human readable, and easier on the filesystem, which stands a good chance of not being ZFS.

Another side effect of having a discrete set of files to download is that each dump file can be referenced in an Atom feed, so that you can keep your snapshot up to date with a little bit of automation. Here’s a snippet of the feed:

< ?xml version="1.0" encoding="utf-8"?>


    Chronicling America OCR Data Feed
    
    info:lc/ndnp/ocr
    
        Library of Congress
        http://loc.gov
    
    2012-09-20T10:34:02-04:00

    
        part-000292.tar.bz2
        
        info:lc/ndnp/dump/ocr/part-000292.tar.bz2
        2012-09-20T10:34:02-04:00
        
OCR dump file part-000292.tar.bz2 with size 620.1 MB generated Sept. 20, 2012, 10:34 a.m.
...

As you can see it’s a pretty vanilla Atom feed that should play nicely with whatever feed reader or library you are using. You may notice the <link> element has some attributes that you might not be used to seeing. The enclosure and length attributes are directly from RFC 4287 for giving clients an idea that the referenced resource might be on the large side. The hash attribute is a generally useful attribute from James Snell’s Atom Link Extensions IETF draft.

If parsing XML is against your religion, there’s also a JSON flavored feed that looks like:

{
  ocr: [
    {
      url: "http://chroniclingamerica.loc.gov/data/ocr/part-000337.tar.bz2",
      sha1: "fd73d8e1df33015e06739c897bd9c08a48294f82",
      size: 283454353,
      name: "part-000337.tar.bz2",
      created: "2012-09-21T06:56:35-04:00"
    },
    ...
  ]
}

Again, I guess we could’ve kicked the tires on the emerging ResourceSync specification to simliar effect. But ResourceSync is definitely still in development, and well, Atom is a pretty nice Internet standard for publishing changes. Syndication technologies like RSS and Atom have already been used by folks like Wikipedia for publishing the availability of data dumps. ResourceSync seems intent on using Zip for compressing dump files, and bzip is common enough, and enough better than zip that it’s worth diverging. In some ways this blog post has turned into a when-to-eschew-digital-library-standards in favor of more mainstream or straightforward patterns. I didn’t actually plan that, but those of you that know me probably are not surprised.

If you plan to use the OCR dumps I, and others on the NDNP team, would love to hear from you. One of the big problems with them so far is that there is no explict statement about how the data is in the public domain, which it is. I’m hopeful this can be rectified soon. If you have feedback on the use of Atom here I would be interested in that too. But the nice thing about using it is really how uncontroversial it is, so I doubt I’ll hear much feedback on that front.


archiving wikitweets

Earlier this year I created a little toy webapp called wikitweets that uses the Twitter streaming API to identify tweets that reference Wikipedia, which it then displays realtime in your browser. It was basically a fun experiment to kick the tires on NodeJS and SocketIO using a free, single process Heroku instance.

At the time I announced the app on the wiki-research-l discussion list to see if anyone was interested in it. Out of the responses I received were ones from Emilio Rodríguez-Posada and Taha Yasseri where they asked whether the tweets are archived as they stream by. This struck a chord with me, since I’m a software developer working in the field of “digital preservation”. You know that feeling when you suddenly see one of your huge gaping blindspots? Yeah.

Anyway, some 6 months or so later I finally got around to adding an archive function to wikitweets, and I thought it might be worth writing about very quickly. Wikitweets uses the S3 API at Internet Archive to store every 1000 tweets. So you can visit this page at Internet Archive and download the tweets. Now I don’t know how long Internet Archive is going to be around, but I bet it will be longer than inkdroid.org, so it seemed like a logical (and free) safe harbor for the data.

In addition to being able to share the files Internet Archive also make a BitTorrent seed available, so the data can easily be distributed around the Internet. For example you could open wikitweets_archive.torrent in your BitTorrent client and download a copy of the entire dataset, while providing a redundant copy. I don’t really expect this to happen much with the wikitweets collection, but it seems to be a practical offering in the Lots of Copies Keeps Stuff Safe category.

I tried to coerce several of the seemingly excellent s3 libraries for NodeJS to talk to the Internet Archive, but ended up writing my own very small library that works specifically with Internet Archive. ia.js is bundled as part of wikitweets, but I guess I could put it on npm if anyone is really interested. It gets used by wikitweets like this:

  var c  = ia.createClient({
    accessKey: config.ia_access_key,
    secretKey: config.ia_secret_key,
    bucket: config.ia_bucket
  });

  c.addObject({name: "20120919030946.json", value: tweets}, function() {
    console.log("archived " + name);
  });

The nice thing is that you can use s3 libraries that have support for Internet Archive, like boto to programatically pull down the data. For example, here is a Python program that goes through each file and prints out the Wikipedia article title that is referenced by the tweet:

  import json
  import boto

  ia = boto.connect_ia()
  bucket = ia.get_bucket("wikitweets")

  for keyfile in bucket:
      content = keyfile.get_contents_as_string()
      for tweet in json.loads(content):
          print tweet['article']['title']

The archiving has only been running for last 24 hours or so, so I imagine there will be tweaks that need to be made. I’m considering compression of the tweets as one of them. Also it might be nice to put the files in subdirectories, but it seemed that Internet Archive’s API wanted to URL encode object names that have slashes in them.

If you have any suggestions I’d love to hear them.


finding soundcloud users with lastfm

I stumbled upon the lovely Soundcloud API this weekend, and before I knew it I was hacking together something that would use the LastFM API to lookup artists that I listen to, and then look to see if they are on Soundcloud. If you haven’t seen it before Soundcloud is a social networking site for musicians and audiophiles to share tracks. Sometimes artists will share works in progress, which is really fascinating.

It’s kind of amazing what you can accomplish in just HTML and JavaScript these days. It sure makes it easy to deploy, which I did at http://inkdroid.org/lastcloud/. If you want to give it a try enter your LastFM username, or the username of someone you know, like mine: inkdroid. As you can see the hack sorta worked. I say sorta because there seem to be a fair amount of users who are squatting on names of musicians. There also seem to be accounts that are run by fans, pretending to be the artist. Below is a list of seemingly legit Soundcloud accounts I found, and have followed. If you have any ideas for improving the hack, I put the code up on GitHub.


fido test suite

I work in a digital preservation group at the Library of Congress where we do a significant amount of work in Python. Lately, I’ve been spending some time with OpenPlanet’s FIDO utility, mainly to see if I could refactor it so that it’s a bit easier to use as a Python module, for use in other Python applications. At the moment FIDO is designed to be used from the command line. This work involved more than a little bit of refactoring, and the more I looked at the code, the more it became clear that a test suite would be useful to have as a safety net.

Conveniently, I also happened to have been reading a recent report from the National Library of Australia on File Characterization Tools, which in addition to talking about FIDO, pointed me at the govdocs1 dataset. Govdocs1 is a dataset of 1 million files harvested from the .gov domain by the NSF funded Digital Corpora project. The data was collected to serve as a public domain corpus for forensics tools to use as a test bed. I thought it might be useful to survey the filenames in the dataset, and cherry pick out formats of particular types for use in my FIDO test suite.

So I wrote a little script that crawled all the filenames, and kept track of file extensions used. Here are the results:

extension count
pdf 232791
html 191409
jpg 109281
txt 84091
doc 80648
xls 66599
ppt 50257
xml 41994
gif 36301
ps 22129
csv 18396
gz 13870
log 10241
eps 5465
png 4125
swf 3691
pps 1629
kml 995
kmz 949
hlp 660
sql 632
dwf 474
java 323
pptx 219
tmp 196
docx 169
ttf 104
js 92
pub 76
bmp 75
xbm 51
xlsx 46
jar 34
zip 27
wp 17
sys 8
dll 7
exported 5
exe 5
tif 3
chp 2
pst 1
squeak 1
data 1

With this list in hand, I downloaded an example of each file extension, ran it through the current release of FIDO, and used the output to generate a test suite for my new refactored version. Interestingly, two tests fail:

Traceback (most recent call last):
  File "/home/ed/Projects/fido/test.py", line 244, in test_pst
    self.assertEqual(i.puid, "x-fmt/249")
AssertionError: 'x-fmt/248' != 'x-fmt/249'

======================================================================
FAIL: test_pub (test.FidoTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ed/Projects/fido/test.py", line 260, in test_pub
    self.assertEqual(i.puid, "x-fmt/257")
AssertionError: 'x-fmt/252' != 'x-fmt/257'

I’ll need to dig in to see what could be different between the two versions that would confuse x-fmt/248 with x-fmt/249 and x-fmt/252 with x-fmt/257. Perhaps it is related to Dave Tarrant’s recent post about how FIDO’s identification patterns have flip flopped in the past.

You may have noticed that I’m linking the PUIDs to Andy Jackson’s PRONOM Prototype Registry (built in 6 days with Drupal) instead of the official PRONOM registry. I did this because a Google search for the PRONOM identifier (PUID) pulled up a nice detail page for the format in Andy’s prototype, and it doesn’t seem possible (at least in the 5 minutes I tried) to link directly to a file format record in the official PRONOM registry. I briefly tried the Linked Data prototype, but it proved difficult to search for a given PUID (server errors, the unforgiving glare of SPARQL query textareas, etc).

I hope OpenPlanets and/or the National Archives give Andy’s Drupal experiment a fair shake. Getting a functional PRONOM registry running in 6 days with an opensource toolkit like Drupal definitely seems more future proof than spending years with a contractor only to get closed source code. The Linked Data prototype looks promising, but as the recent final report on the Unified Digital Format Registry project highlights, choosing to build on a semantic web stack has its risks compared with more mainstream web publishing frameworks or content management systems like Drupal. PRONOM just needs an easy way for digital preservation practitioners to be able to collaboratively update the registry, and for each format to have a unique URL that uses the PUID. My only complaint is that Andy’s prototype seemed to advertise RDF/XML in the HTML, but it seemed to return an empty RDF document, for example the HTML at http://beta.domd.info/pronom/x-fmt/248 has a <link> that points at http://beta.domd.info/node/1303/rdf.

I admit I am a fan of linked data, or being able to get machine readable data back (RDFa, Microdata, JSON, RDF/XML, XML, etc) from Cool URLs. But using triplestores, and SPARQL don’t seem to be terribly important things for PRONOM to have at this point. And if they are there under the covers, there’s no need to confront the digital preservation practitioner with them. My guess is that they want to have an application that lets them work with their peers to document file formats, not learn a new query or ontology language. Perhaps Jason Scott’s Just Solve the Problem effort in October, will be a good kick in the pants to mobilize grassroots community work around digital formats.

Meanwhile, I’ve finished up the FIDO API changes and the test suite enough to have submitted a pull request to OpenPlanets. My fork of the OpenPlanets repository is similarly on Github. I’m not really holding my breath waiting for it to be accepted, as it represents a significant change, and they have their own published roadmap of work to do. But I am hopeful that they will recognize the value in having a test suite as a safety net as they change and refactor FIDO going forward. Otherwise I guess it could be the beginnings of a fido2, but I would like to avoid that particular future.

Update: after posting this Ross Spencer tweeted me some instructions for linking to PRONOM

https://twitter.com/beet_keeper/status/242515266146287616

Maybe I missed it, but PRONOM could use a page that describes this.