RealAudio, AAC and Archivy

A few months ago I happened to read a Pitchfork interview with David Grubbs about his book Records Ruin the Landscape. In the interview Grubbs mentioned how his book was influenced by a 2004 Kenny Goldsmith interview with Henry Flynt…and Pitchfork usefully linked to the interview in the WFMU archive.

You know, books linking to interviews linking to interviews linking to archives, the wondrous beauty and utility of hypertext.

I started listening to the interview on my Mac with Chrome and the latest RealAudio plugin but after a few minutes it went into a feedback loop of some kind, and became full of echoes and loops, and was completely unlistenable. This is WFMU so I thought maybe this was part of the show, but it went on for a while, which seemed a little bit odd. I tried reloading thinking it might be some artifact of the stream, but the exact thing happened again. I noticed a prominent Get Help link right next to the link for listening to the content. I clicked on it and filled out a brief form, not really expecting to hear back.

As you can see the WFMU archive view for the interview is sparse but eminently useful.

Unexpectedly, just a few hours later I received an email from Jeff Moore who wrote that playback of Real Audio had been reported to be a problem before on some items in the archive, and that they were in the process of migrating them to AAC. My report had pushed this particular episode up in the queue, and I could now reload the page and listen to an AAC stream via their Flash player. I guess now that it’s AAC there is probably something that could be done with the audio HTML element to avoid the Flash bit. But now I could listen to the interview (which, incidentally, is awesome) so I was happy.

I asked Jeff about how they were converting the RealAudio, because we have a fair bit of RealAudio laying around at my place of work. He wrote back with some useful notes that I thought I would publish on the Web for others googling for how to do it at this particular point in time. I’d be curious to know if you regard RealAudio as a preservation risk, and good example of a format we ought to be migrating. The playback options seem quite limited, and precarious, but perhaps that’s just my own limited experience.

The whole interaction with WFMU, from discovery, to access, to preservation, to interaction seemed like such a perfect illustration of what the Web can do for archives, and vice-versa.

Jeff’s Notes

The text below is from Jeff’s email to me. Jeff, if you are reading this and don’t really want me quoting you this way, just let me know.

I’m still fine-tuning the process, which is why the whole bulk transcode isn’t done yet. I’m trying to find the sweet spot where I use enough space / bandwidth for the resulting files so that I don’t hear any obvious degradation from the (actually pretty terrible-sounding) Real files, but don’t just burn extra resources with nothing gained.

Our Real files are mostly mono sampled at 22.04khz, using a codec current decoders often identify as “Cook”.

I’ve found that ffmpeg does a good job of extracting a WAV file from the Real originals - oh, and since there are two warring projects which each provide a program called ffmpeg, I mean this one:

http://ffmpeg.org/

We’ve been doing our AAC encoding with the Linux version of the Nero AAC Encoder released a few years ago:

http://www.nero.com/enu/company/about-nero/nero-aac-codec.php

…although I’m still investigating alternatives.

One interesting thing I’ve encountered is that a straight AAC re-encoding from the Real file (mono, 22.05k) plays fine as a file on disk, but hasn’t played correctly for me (in the same VLC version) when streamed from Amazon S3. If I convert the mono archive to stereo and AAC-encode that with the Nero encoder, it’s been streaming fine.

Oh, and if you want to transfer tags from the old Real files to any new files, and your transcoding pipeline doesn’t automatically copy tags, note that ffprobe (also from the ffmpeg package) can extract tags from Real files, which you can then stuff back in (with neroAacTag or the tagger of your choice).

Afterword

Here is Googlebot coming to get the content a few minutes after I published this post.

54.241.82.166 - - [23/May/2014:10:36:22 +0000] "GET http://inkdroid.org/journal/2014/05/23/realaudio-aac-and-archivy/ HTTP/1.1" 200 20752 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"

So someone searching for how to convert RealAudio to AAC might stumble across it. This decentralized Web thing is kinda neat. We need to take care of it.


Broken World

You know that tingly feeling you get when you read something, look at a picture, or hear a song that subtly and effortlessly changes the way you think?

I don’t know about you, but for me thoughts, ideas and emotions can often feel like puzzles that stubbornly demand a solution, until something or someone helps make the problem evaporate or dissolve. Suddenly I can zoom in, out or around the problem, and it is utterly transformed. As that philosophical trickster Ludwig Wittgenstein wrote:

It is not surprising that the deepest problems are in fact not problems at all.

A few months ago, a tweet from Matt Kirschenbaum had this effect on me.

It wasn’t the tweet itself, so much as what the tweet led to: Steven Jackson’s Rethinking Repair, which recently appeared in the heady sounding Media Technologies Essays on Communication, Materiality, and Society.

I’ve since read the paper three or four times, taken notes, underlined stuff to follow up on, etc. I’ve been meaning to write about it here, but I couldn’t start…I suppose for reasons that are (or will become) self evident. The paper is like a rhizome that brings together many strands of thought and practice that are directly relevant to my personal and professional life.

I’ve spent the last decade or so working as a software developer in the field of digital preservation and archives. On good days this seems like a surprisingly difficult thing, and on the bad days it seems more like an oxymoron…or an elaborate joke.

At home I’ve spent the last year helping my wife start a business to change our culture one set of hands at a time, while trying to raise children in a society terminally obsessed with waste, violence and greed…and a city addicted to power, or at least the illusion of power.

In short, how do you live and move forward amidst such profound brokenness? Today Quinn Norton’s impassioned Everything is Broken reminded me of Jackson’s broken world thinking, and what a useful hack (literally) it is…especially if you are working with information technology today.

Writing software, especially for the Web, is still fun, even after almost 20 years. It keeps changing, spreading into unexpected places, and the tools just keep evolving, getting better, and more varied. But this same dizzying rate of change and immediacy poses some real problems if you are concerned about stuff staying around so people can look at it tomorrow.

When I was invited to the National Digital Forum I secretly fretted for months, trying to figure out if I had anything of substance to say to that unique blend of folks interested in the cross section of the Web and the cultural heritage. The thing I eventually landed on was taking a look at the Web as a preservation medium, or rather the Web as a process, which has a preservation component to it. In the wrapup I learned that the topic of “web preservation” had already been covered a few years earlier, so there wasn’t much new there … but there was some evidence that the talk connected with a few folks.

If I could do it all again I would totally (as Aaron would say) look at the Web and preservation through Jackson’s prism of broken world thinking.

The bit where I talked about how Mark Pilgrim and Why’s online presence was brought back from virtual suicide using Git repositories, the Internet Archive and a lot of TLC was totally broken world thinking. Verne Harris’ notion that the archive is always just a sliver of a sliver of a sliver of a window into process, and that as such it is extremely, extremely valuable is broken world thinking. Or the evolution of permalinks, cool URIs in the face of swathes of linkrot is at its heart broken world thinking.

The key idea in Jackson’s article (for me) is that there are very good reasons to remain hopeful and constructive while at the same time being very conscious of the problems we find ourself in today. The ethics of care that he outlines, with roots in the feminist theory, is a deeply transformative idea. I’ve got lots of lines of future reading to follow, in particular in the area of sustainability studies, which seems very relevant to the work of digital preservation.

But most of all Jackson’s insight that innovation doesn’t happen in lightbulb moments (the mythology of the Silicon Valley origin story) or the latest tech trend, but in the recognition of brokenness, and the willingness to work together with others to repair and fix it. He positions repair as an ongoing process that fuels innovation:

… broken world thinking asserts that breakdown, dissolution, and change, rather than innovation, development, or design as conventionally practiced and thought about are the key themes and problems facing new media.

I should probably stop there. I know I will return to this topic again, because I feel like a lot of my previous writing here has centered on the importance of repair, without me knowing it. I just wanted to stop for a moment, and give a shout out to some thinking that I’m suspecting will guide me for the next twenty years.


linking spoken quotes of quotes

An ancient buddha said, “If you do not wish to incur the cause for Unceasing Hell, do not slander the true dharma wheel of the Tathagata. You should carve these words on your skin, flesh, bones and marrow; on your body, mind and environment; on emptiness and on form. They are already carved on trees and rocks, on fields and villages.”

From Gary Snyder’s reading of The Teachings of Zen Master Dogen (about 1:26:00 in).

His delivery is just a delight to listen to. The puzzling strangeness of the text are made whole in the precision, earthiness and humor of his words.


The Archive as Data Platform

Yesterday Wikileaks announced the availability of a new collection, the Carter Cables, which are a new addition to the Public Library of US Diplomacy (PlusD). One thing in particular in the announcement caught my attention:

The Carter Cables were obtained by WikiLeaks through the process described here after formal declassification by the US National Archives and Records Administration earlier this year.

If you follow the link you can see that this content was obtained in a similar manner as the Kissinger Files, that were released just over a year ago. Perhaps this has already been noted, but I didn’t notice before that the Kissinger Files (the largest Wikileaks release to date) were not leaked to Wikileaks, but were legitimately obtained directly from NARA’s website:

Most of the records were reviewed by the United States Department of State’s systematic 25-year declassification process. At review, the records were assessed and either declassified or kept classified with some or all of the metadata records declassified. Both sets of records were then subject to an additional review by the National Archives and Records Administration (NARA). Once believed to be releasable, they were placed as individual PDFs at the National Archives as part of their Central Foreign Policy Files collection.

The Central Foreign Policy Files are a series from the General Records of the Department of State record group. Anyone with a web browser can view these documents on NARA’s Access to Archival Databases website. If you try to access them you’ll notice that the series is broken up into 15 separate files. Each file is a set of documents that can be searched individually. There’s no way to browse the contents of a file, series or the entire group: you must do a search and click through each of the results (more on this in a moment).

The form in which these documents were held at NARA was as 1.7 million individual PDFs. To prepare these documents for integration into the PlusD collection, WikiLeaks obtained and reverse-engineered all 1.7 million PDFs and performed a detailed analysis of individual fields, developed sophisticated technical systems to deal with the complex and voluminous data and corrected a great many errors introduced by NARA, the State Department or its diplomats, for example harmonizing the many different ways in which departments, capitals and people’s names were spelt.

It would be super to hear more details about their process for doing this work. I think archives could potentially learn a lot about how to enhance their own workflows for doing this kind of work at scale.

And yet I think there is another lesson here in this story. It’s actually important to look at this PlusD work as a success story for NARA…and one that can potentially be improved upon. I mentioned above that it doesn’t appear to be possible to browse a list of documents and that you must do a search. If you do a search and click on one of the documents you’ll notice you get a URL like this:

http://aad.archives.gov/aad/createpdf?rid=99311&dt=2472&dl=1345

And if you browse to another you’ll see something like:

http://aad.archives.gov/aad/createpdf?rid=841&dt=2472&dl=1345

Do you see the pattern? Yup, the rid appears to be a record number, and it’s an integer that you can simply start at 1 and keep going until you’ve got to the last one for that file, in this case 155278.

It turns out the other dt and dl parameters change for each file, but they are easily determined by looking at the overview page for the series. Here they are if you are curious:

  • http://aad.archives.gov/aad/createpdf?rid=&dt=2472&dl=1345
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2473&dl=1348
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2474&dl=1345
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2475&dl=1348
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2492&dl=1346
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2493&dl=1347
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2476&dl=1345
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2477&dl=1348
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2494&dl=1346
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2495&dl=1347
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2082&dl=1345
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2083&dl=1348
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2084&dl=1346
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2085&dl=1347
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2532&dl=1629
  • http://aad.archives.gov/aad/createpdf?rid=&dt=2533&dl=1630

Of course it would be trivial to write a harvesting script to pull down the ~380 gigabytes of PDFs by creating a loop with a counter and using one of the many many HTTP libraries. Maybe even with a bit of sleeping in between requests to be nice to the NARA website. I suspect that this how Wikileaks were able to obtain the documents.

But, in an ideal world, this sort of URL inspection shouldn’t be necessary right? Also, perhaps it could be done in such a way that the burden of distributing the data doesn’t fall on NARA alone? It feels like a bit of an accident that it’s possible to download the data in bulk from NARA’s website this way. But it’s an accident that’s good for access.

What if instead of trying to build the ultimate user experience for archival content, archives focused first and foremost on providing simple access to the underlying data first. I’m thinking of the sort of work Carl Malamud has been doing for years at public.resource.org. With a solid data foundation like that, and simple mechanisms for monitoring the archive for new accessions it would then be possible to layer other applications on top within the enterprise and (hopefully) at places external to the archive, that provide views into the holdings.

I imagine this might sound like ceding the responsibility of the archive to some. It may also sound a bit dangerous to those that are concerned about connecting up public data that is currently unconnected. I’m certainly not suggesting that user experience and privacy aren’t important. But I think Cassie is right:

I imagine there are some that feel that associating this idea of the archive as data platform with the Wikileaks project might be counterproductive to an otherwise good idea. I certainly paused before hitting publish on this blog post, given the continued sensitivity around the issue of Wikileaks. But as other archivists have noted there is a great deal to be learned from the phenomenon that is Wikileaks. Open and respectful conversations about what is happening is important, right?

Most of all I think it’s important that we don’t look at this bulk access and distant reading of archival material as a threat to the archive. Researchers should feel that downloading data from the archive is a legitimate activity. Where possible they should be given easy and efficient ways to do it. Archives need environments like OpenGov NSW (thanks Cassie) and the Government Printing Office’s Bulk Data website (see this press release about the Federal Register) where this activity can take place, and where a dialogue can happen around it.

Update: May 8, 2014

Alexa O’Brien’s interview on May 6th with Sarah Harrison of Wikileaks at re:publica14 touched on lots of issues related to Wikileaks the archive. In particular the discussion of redaction, accessibility and Wikileaks role in publishing declassified information for others (including journalists) was quite relevant the topic of this blog post.


Flickr Commons LAMs

After the last post Seb got me wondering if there were any differences between libraries, archives and museums when looking at upload and comment activity in Flickr Commons in Aaron’s snapshot of the Flickr Commons metadata.

First I had to get a list of Flickr Commons organizations and classify them as either a library, museum or archive. It wasn’t always easy to pick, but you can see the result here. I lumped galleries in with museums. I also lumped historical societies in with archives. Then I wrote a script that walked around in the Redis database I already had from loading Aaron’s data.

In doing this I noticed there were some Flickr Commons organizations that were missing from Aaron’s snapshot:

  • Tasmanian Archive and Heritage Office Commons
  • The Royal Library, Denmark
  • The Finnish Museum of Photography
  • Musée McCord Museum

Update: Aaron quickly fixed this.

I didn’t do any research to see if these organizations had significant activity. Also, since there were close to a million files, I didn’t load the British Library activity yet. If there’s interest in adding them into the mix I’ll splurge for the larger ec2 instance.

Anyhow, below are the results. You can find the spreadsheet for these graphs up in Google Docs

This was all done rather quickly, so if you notice anything odd or that looks amiss please let me know. Initially it seemed a bit strange to me that libraries, archives and museums trended so similarly in each graph, even if the volume was different.


Where Brooklyn At?

As a follow up to my last post I added a script to my fork of Aaron’s py-flarchive that will load up a Redis instance with comments, notes, tags and sets for Flickr images that were uploaded by Brooklyn Museum. The script assumes you’ve got a snapshot of the archived metadata, which I downloaded as a tarball. It took several hours to unpack the tarball on a medium ec2 instance; so if you want to play around and just want the redis database let me know and I’ll get it to you.

Once I loaded up Redis I was able to generate some high level stats:

  • images: 5,697
  • authors: 4,617
  • tags: 6,132
  • machine tags: 933
  • comments: 7,353
  • notes: 963
  • sets: 141

Given how many images there were there it represents an astonishing number of authors: unique people who added tags, comments or notes. If you are curious I generated a list of the tags and saved them as a Google Doc. The machine tags were particularly interesting to me. The majority (849) of them look like Brooklyn Museum IDs of some kind, for example:

bm:unique=S10_08_Thebes/9928

But there were also 51 geotags, and what looks like 23 links to items in Pleiades, for example:

tag:pleiades:depicts=721417202

If I had to guess I’d say this particular machine tag indicated that the Brooklyn Museum image depicted Abu Simbel. Now there weren’t tons of these machine tags but it’s important to remember that other people use Flickr as a scratch space for annotating images this way.

If you aren’t familiar with them, Flickr notes are annotations of an image, where the user has attached a textual note to a region in the image. Just eyeballing the list, it appears that there is quite a bit of diversity in them, ranging from the whimsical:

  • cool! they look soo surreal
  • teehee somebody wrote some graffiti in greek
  • Lol are these painted?
  • Steaks are ready!

to the seemingly useful:

  • Hunter’s Island
  • Ramesses III Temple
  • Lapland Village
  • Lake Michigan
  • Montuemhat Crypt
  • Napoleon’s troops are often accused of destroying the nose, but they are not the culprits. The nose was already gone during the 18th century.

Similarly the general comments run the gamut from:

  • very nostalgic…
  • always wanted to visit Egypt

to:

  • Just a few points. This is not ‘East Jordan’ it is in the Hauran region of southern Syria. Second it is not Qarawat (I guess you meant Qanawat) but Suweida. Third there is no mention that the house is enveloped by the colonnade of a Roman peripteral temple.
  • The fire that destroyed the buildings was almost certainly arson. it occurred at the height of the Pullman strike and at the time, rightly or wrongly, the strikers were blamed.
  • You can see in the background, the TROCADERO with two towers .. This “medieval city” was built on the right bank where are now buildings in modern art style erected for the exposition of 1937.

Brooklyn Museum pulled over 48 tags from Flickr before they deleted the account. That’s just 0.7% of the tags that were there. None of the comments or notes were moved over.

In the data that Aaron archived there was one indicator of user engagement: the datetime included with comments. Combined with the upload time for the images it was possible to create a spreadsheet that correlates the number of comments with the number of uploads per month:

Brooklyn Museum Flickr Activity

I’m guessing the drop off in December of 2013 is due to that being the last time Aaron archived Brooklyn Museum’s metadata. You can see that there was a decline in user engagement: the peak in late 2008 / early 2009 was never matched again. I was half expecting to see that user engagement fell off when Brooklyn Museum’s interest in the platform (uploads) fell off. But you can see that they continued to push content to Flickr, without seeing much of a reward, at least in the shape of comments. It’s impossible now to tell if tagging, notes or sets trended differently.

Since Flickr includes the number of times each image was viewed it’s possible to look at all the images and see how many times images were viewed, the answer?

9,193,331

Not a bad run for 5,697 images. I don’t know if Brooklyn Museum downloaded their metadata prior to removing their account. But luckily Aaron did.


Glass Houses

You may have noticed Brooklyn Museum’s recent announcement that they have pulled out of Flickr Commons. Apparently they’ve seen a “steady decline in engagement level” on Flickr, and decided to remove their content from that platform, so they can focus on their own website as well as Wikimedia Commons.

Brooklyn Museum announced three years ago that they would be cross-posting their content to Internet Archive and Wikimedia Commons. Perhaps I’m not seeing their current bot, but they appear to have two, neither of which have done an upload since March of 2011, based on their user activity. It’s kind of ironic that content like this was uploaded to Wikimedia Commons by Flickr Uploader Bot and not by one of their own bots.

The announcement stirred up a fair bit of discussion about how an institution devoted to the preservation and curation of cultural heritage material could delete all the curation that has happened at Flickr. The theory being that all the comments, tagging and annotation that has happened on Flickr has not been migrated to Wikimedia Commons. I’m not even sure if there’s a place where this structured data could live at Wikimedia Commons. Perhaps some sort of template could be created, or it could live in Wikidata?

Fortunately, Aaron Straup-Cope has a backup copy of Flickr Commons metadata, which includes a snapshot of the Brooklyn Museum’s content. He’s been harvesting this metadata out of concern for Flickr’s future, but surprise, surprise – it was an organization devoted to preservation of cultural heritage material that removed it. It would be interesting to see how many comments there were. I’m currently unpacking a tarball of Aaron’s metadata on an ec2 instance just to see if it’s easy to summarize.

But:

I’m pretty sure I’m living in one of those.

I agree with Ben:

It would help if we had a bit more method to the madness of our own Web presence. Too often the Web is treated as a marketing platform instead of our culture’s predominant content delivery mechanism. Brooklyn Museum deserves a lot of credit for talking about this issue openly. Most organizations just sweep it under the carpet and hope nobody notices.

What do you think? Is it acceptable that Brooklyn Museum discarded the user contributions that happened on Flickr, and that all the people who happened to be pointing at said content from elsewhere now have broken links? Could Brooklyn Museum instead decided to leave the content there, with a banner of some kind indicating that it is no longer actively maintained? Don’t lots of copies keep stuff safe?

Or perhaps having too many copies detracts from the perceived value of the currently endorsed places of finding the content? Curators have too many places to look, which aren’t synchronized, which add confusion and duplication. Maybe it’s better to have one place where people can focus their attention?

Perhaps these two positions aren’t at odds, and what’s actually at issue is a framework for thinking about how to migrate Web content between platforms. And different expectations about content that is self hosted, and content that is hosted elsewhere?


dump truck

I had a strange dream last night.

I was working on a consulting project with an archivist friend, and a group of others I didn’t know as well. I knew that the work was politically sensitive, and that it was important for some reason that escapes me now.

Before the work could start I was required to sign a document. I trusted my friend implicitly, and didn’t really read it closely. Afterwards my friend let me know that I had signed a document stating that said I had committed suicide, and that I no longer legally existed. Apparently, this would provide a framework for the work to happen more easily.

I remember being concerned about my family. I walked outside to smoke a cigarette (something I don’t do anymore). A few of the other people joined me, and we got in a dump truck.


I don’t know why, but I woke up from the dream feeling strangely relaxed. I looked up a few things in our our cheesy dreamer’s dictionary.

Archives. Anything to do with archives in a dream is a forerunner of unexpected legal entanglements.

Document. Business or legal documents in a dream are usually a warning against speculations, unless the documents were in an indigenous place such as a notary or lawyer’s office, in which case they portend a coming increase, possibly for inheritance.

Suicide. This dream is a signal that you need a change of scene or more mental relaxation. Try sharing your troubles with a trusted friend or adviser, but in any event stop brooding.

Cigar (or Cigarette) Whether you were offering them, smoking yourself, or observing someone else with them, this form of tobacco in a dream is a lucky omen pertaining to prosperity.

Unfortunately, there wasn’t an entry for dump truck.


To Mr. Hazard

Lots of Copies Keeps Stuff Safe, 1791-Style.

Time and accident are committing daily havoc on the originals deposited in our public offices. The late war has done the work of centuries in this business. The last cannot be recovered, but let us save what remains; not by vaults and locks, which fence them from the public eye and use in consigning them to the waste of time, but by such a multiplication of copies as shall place them beyond the reach of accident.

– Thomas Jefferson

GoogleBooks


Incompleteness

Zen spirit has come to mean not only peace and understanding, but devotion to art and to work, the rich unfoldment of contentment, opening the door to insight, the expression of innate beauty, the intangible charm of incompleteness. Zen carries many meanings, none of them entirely definable. If they are defined, they are not Zen.

Zen Flesh, Zen Bones, p. 18