code4lib 2015 is about to kick off in Portland this morning. Unfortunately I couldn’t make it this year, but I’m looking forward to watching the livestream over the next few days. Thanks so much to the conference organizers for setting up the livestream. The schedule has the details about who is speaking when.

As a little gift to real and virtual conference goers (mostly myself) I quickly created a little web app that will watch the Twitter stream for #c4l15 tweets, and keep track of which URLs people are talking about. You can see it running, at least while the conference is going here.

I’ve done this sort of thing in an ad hoc way with twarc and some scripts–mostly after (rather than during) an event. For example here’s a report of URLs mentioned during #dlfforum. But I wanted something a bit more dynamic. As usual the somewhat unkempt code is up on Github as a project named earls, in case you have ideas you’d like to try out.

#c4l15 urls, or earls

earls is a node app that listens to Twitter’s filter stream API for tweets mentioning #c4l15. When it finds one it then looks for 1 or more links in the tweet. Each link is fetched (which also unshortens it), it tries to parse any HTML (thanks cheerio) to find a page title, and then stashes these details as well as the tweet in redis.

When you load the page it will show you the latest counts for all URLs it has found so far. Unfortunately at the moment you need to reload the page to get an update. If I have time I will work on making it update live in the page with socket.io. earls could be used for other conferences, and ought to run pretty easily on heroku for free.

Oh, and you can see the JSON data here in case you have other ideas of things you’d like to do with the data.

Have a superb conference you crazy dreamers and doers!


Wikipedia’s 10th Birthday Party at the National Archives in Washington DC on Saturday was a lot of fun. Far and away, the most astonishing moment for me came early in the opening remarks by David Ferriero, the Archivist of the United States, when he stated (in no uncertain terms) that he was a big fan of Wikipedia, and that it was often his first go-to for information. Not only that, but when discussion about a bid for a DC WikiMania (the Wikipedia Annual Conference) came up later in the morning, Ferriero suggested that the National Archives would be willing to host it if it came to pass. I’m not sure if anything actually came of this later in the day–a WikiMania in DC would be incredible. It was just amazing to hear the Archivist of the United States be supportive of Wikipedia as a reference source…especially as stories of schools, colleges and universities rejecting Wikipedia as a source are still common. Ferriero’s point was even more poignant with several high schoolers in attendance. Now we all can say:

If Wikipedia is good enough for the Archivist of the United States, maybe it should be good enough for you.

Another highlight for me was meeting Phoebe Ayers, who is a reference librarian at UC Davis, member of the Wikimedia Foundation Board of Trustees, and author of How Wikipedia Works. I strong armed Phoebe into signing my copy (I bought this copy on Amazon after it was de-accessioned from Cuyahoga County Public Library in Parma, Ohio ). Phoebe has some exciting ideas for creating collaborations between libraries and Wikipedia, which I think fit quite well into the Galleries, Libraries, Archives and Museuems (GLAM) effort within Wikipedia. I think she is still working on how to organize the effort.

Later in the day we heard how the National Archives is thinking of following the lead of the British Museum and establishing a Wikipedian in Residence. Liam Wyatt, the first Wikipedian in Residence, put a human face on Wikipedia for the British Museum, and familiarized museum staff with editing Wikipedia, through activities like the Hoxne Challenge. Having a Wikipedia in Residence at the National Archives (and who knows maybe the Smithsonian and the Library of Congress) would be extremely useful I think.

In a similar vein, Sage Ross spoke at length about the Wikipedia Ambassador Program. The Ambassador Program is a formal way for folks to represent Wikipedia in academic settings (universities, high schools, etc). Ambassadors can get training in how to engage with Wikipedia (editing, etc) and can help professors and teachers who want to integrate Wikipedia into their curriculum, and scholarly activities.

I got to meet Peter Benjamin Meyer of the Bureau of Labor Statistics, who has some interesting ideas for aggregating statistical information from federal statistical sources, and writing some bots that will update article info-boxes for places in the United States. The impending release of the 2010 US Census Data has the Wikipedia community discussing the best way to update the information that was added by a bot for the 2000 census. It seemed like Peter might be able to piggy back some of his efforts on this work that is going on at Wikipedia for the 2010 Census.

Jyothis Edthoot an Oracle employee and Wikipedia Steward gave me a behind the scenes look at the tools he and others in Counter Vandalism Unit use to keep Wikipedia open for edits from anyone in the world. I also got to meet Harihar Shankar from Herbert van de Sompel’s team at Los Alamos National Lab, and to learn more about the latest developments with Memento, which he gave a lightning talk about. I also ran into Jeanne Kramer-Smyth of the World Bank, and got to hear about their efforts to provide meaningful access to their document collections to web crawlers using their metadata.

I did end up giving a lightning talk about Linkypedia (slides on the left). I was kind of rushed, and I wasn’t sure that this was exactly the right audience for the talk (being mainly Wikipedians instead of folks from the GLAM sector). But it helped me think through some of the challenges in expressing what Linkypedia is about, and who it is for. All in all it was a really fun day, with a lot of friendly folks interested in the Wikipedia community. There must’ve been at least 70 people there on a very cold Saturday–a promising sign of good things to come for collaborations between Wikipedia and the DC area.

ipres, iipc, pasig roundup/braindump

I spent last week in San Francisco attending 3 back-to-back conferences: the International Conference on Preservation of Digital Objects (iPRES), International Internet Preservation Consortium (IIPC), and the Sun Preservation and Archiving Special Interest Group (PASIG)…thanks to the Library of Congress and to Kesa Summers for letting me go. Also, thanks to the 3 conferences for deciding to co-locate in San Francisco at the same time, which made this sort of tag-team-digital-preservation-event-week possible. I hadn’t been to either iPRES, IIPC or PASIG before, so it was a lot of fun being able to take them all in at once…especially since given the nature of my group at the Library of Congress, these are my kind of people.

Each event had a different flavor, but the topic under discussion at each was digital preservation. iPRES focused generally on digital preservation, particularly from a research angle. IIPC also had a bit of a research flavor, but focused more specifically on the practicalities of archiving web content. And PASIG was less research oriented, and much more oriented around building/maintaining large scale storage systems. There was so much good content at these events, that it’s kind of impossible to summarize it here. But I thought I would at least attempt to blurrily characterize some of the ideas from the three events that I’m taking back with me.


Long term digital preservation has many hard problems–so many that I think it is rational to feel somewhat overwhelmed and to some extent even paralyzed. It was important to see other people recognize the big problems of emulation, format characterization/migration, compression — but continue working on pragmatic solutions, for today. Martha Anderson made the case several times for thinking of digital preservation in terms of 5-10 year windows, instead of forever. The phrase “to get to forever you have to get to 5 years first” got mentioned a few times, but I don’t know who said it first. John Kunze brought up the notion of preservation as a “relay”, where bits are passed along at short intervals–and how digital curation need to enable these hand offs to happen easily. It came to my attention later that this relay idea is something that Chris Rusbridge written about back in 2006.


On a similar note, Martha Anderson indicated that making bits useful today is a key factor that the National Digital Information Infrastructure and Preservation Program (NDIIPP) weighs when making funding decisions. Brewster Kahle in his keynote for IIPC struck a similar note that “preservation is driven by access”. Gary Wright gave an interesting presentation about how the Church of Latter Day Saints had to adjust the Reference Model for Open Archival Information System (OAIS) to enable thousands of concurrent users access to their archive of 3.1 billion genealogical image records. Jennifer Waxman was kind enough to give me a pointer to some work Paul Conway has done on this topic of access driven preservation. The topic of access in digital preservation is important to me, because I work in a digital preservation group at the Library of Congress, working primarily on access applications. We’ve had a series of pretty intense debates about the role of access in digital preservation … so it was good to hear the topic come up in San Francisco. In a world where Lots of Copies Keeps Stuff Safe, access to copies is pretty important.

Less is More (More or Less)

Over the week I got several opportunities to hear details from John Kunze, Stephen Abrams, and Margaret Low about the California Digital Library’s notion of curation micro-services, and how they enable digital preservation efforts at CDL. Several folks in my group at LC have been taking a close look at the CDL specifications recently, so getting to hear about the specs, and even see some implementation demos from Margaret was really quite awesome. The specs are interesting to me because they seem to be oriented around the fact that our digital objects ultimately reside on some sort of hierarchical file-system. Fileystem APIs are fairly ubiquitous. In fact, as David Rosenthal has pointed out, some file systems are even designed to resist change. As Kunze said at PASIG in his talk Permanent Objects, Evolving Services, and Disposable Systems: An Emergent Approach to Digital Curation Infrastructure

What is the thinnest smear of functionality that we can add to the filesystem so that it can act as an object storage system?

Approaches to building digital repository software thus far have been primarily aimed at software stacks (dspace, fedora, eprints) which offer particular services, or service frameworks. But the reality is that these systems come and go, and we are left with the bits. Why don’t we try to get the bits in shape so that they can be handed off easily in the relay from application to application, filesystem to filesystem? What is nice about the micro-services approach is that:

  • The services are compose-able, allowing digital curation activities to be emergent, rather than imposed by a pre-defined software architecture. Since I’ve been on a bit of a functional programming kick lately, I see compose-ability as a pretty big win.
  • The services are defined by short specifications, not software–so they are ideas instead of implementations. The specifications are clearly guided by ease of implementation, but ultimately they could be implemented in a variety of languages, and tools. Having a 2-3 page spec that defines a piece of functionality, and can be read by a variety of people, and implemented by different groups seems to be an ideal situation to strive for.

Everything Else Is Miscellaneous

Like I said, there was a ton of good content over the week…and it seems somewhat foolhardy to try to summarize it all in a single blog post. I tried to summarize the main themes I took home with me on the plane back to DC…but there were also lots of nuggets of ideas that came up in conversation, and in presentations that I want to at least jot down:

  • While archival storage may not be best served by HDFS, jobs like virus scanning huge web crawls are well suited to distributed computing environments like Hadoop. We need to be able to operate at this scale at loc.gov.
  • In Cliff Lynch’s summary wrap up for PASIG he indicated that people don’t talk so much about what we do when the inevitable happens, and bits are lost. The digital preservation community needs to share more statistics on bit loss, system failure modes, and software design patterns that let us build more sustainable storage systems.
  • Dave Tarrant’s presentation on Where the Semantic Web and Web 2.0 meet format risk management: P2 registry was a welcome revelation about the intersection of my interest in linked data and digital preservation. His presentation of the PRONOM format registry as linked data, and Kevin De Vorsey’s talk about Obsolescence, Risk Management, and Preservation Planning at the National Library of New Zealand made me think that it might be interesting to explore how the LC’s Digital Formats website could be delivered as linked data, and linked to something like PRONOM. David Pearson also suggested that collaborative wiki-spaces could be used by digital format specialists to collect information…which got me thinking of how a semantic media wiki instance could be used in conjunction with Tarrant’s ideas. How easy would it be to use the web to build a distributed network of preservation information, as opposed to some p2p solution?
  • I want to learn more about the (w)arc data format, and perhaps contribute to some of the existing code bases for working w/ (w)arc. I’m particularly interested in using harvesting tools and WARC to preserve linked data…which I believe some of the Sindice folks have worked on for their bot.
  • It’s long since time I understood how LOCKSS works as a technology. It was mentioned as the backbone of several projects during the week. I even overheard some talk about establishing rogue LOCKSS networks, which of course piqued my interest even more.
  • It would be fun to put a jython or jruby web front end on DROID for format identification, but it seems that Carol Chou of the Florida Center for Library Automation has already done something similar. Still, it would be neat to at least try it out, and perhaps have it conneg to Dave’s P2 registry or PRONOM.

Ok, braindump complete. Thanks for reading this far!