Tag Archives: opensource

day of digital archives psa

Today is Day of Digital Archives day and I had this semi-thoughtful post written up about BagIt and how it’s a brain dead simple format to use to package up your files so that you’ll know if you still have them 5 minutes, 5 hours, 5 days, 5 years, maybe even 5 decades from now–if the notion of directories and files persists that long.

But I deleted that…you’re welcome…

I was also going to write about how in a fit of web performance art Mark Pilgrim recently deleted his online presence, including various extremely useful opensource tools, and several popular online books, only to see them re-materialize on the Web at new locations.

But I deleted most of that too…you’re welcome again!

Here’s a public service announcement instead. If you happen to use Franco Lazzarino’s Ruby BagIt Library to create bags that contains largish files (> 500MB), you might have accidentally created bad SHA1 manifests. I added a test, and fixed the bug with help from Mark Matienzo and Michael Klein, and sent a pull request. It hasn’t been applied yet, so here’s to hoping it will.

At $mpow we’ve been getting terabytes of data from this social media company that has been bagging their data using this Ruby library. Many of the files are multi-gigabytes gzip compressed. And many of the bags now have bad SHA1 manifests. The social media company wasn’t sure what the problem was, and told us just to ignore the SHA1 manifests. Which is easy enough to do.

It seems like no matter how simple the spec, it’s easy to create bugs. If you create bags, throw Bag-Software-Agent into your bag-info.txt…you never know who might find it useful.