Over the weekend you probably saw the announcements going around about Google Books releasing +1 million public domain ebooks on the web as epubs. This is great news: epub is a web friendly, open format – and having all this content available as epub is important.
Now I might be greedy, but when I saw that 1 million epubs are available my mind immediately jumps to thinking of getting them, indexing them and whatnot. Then I guiltily justified my greedy thoughts by pondering the conventional digital preservation wisdom that Lots of Copies Keeps Stuff Safe (LOCKSS). The books are in the public domain, so …. why not?
Google Books has a really nice API, which lets you get back search results as Atom, with lots of links to things like thumbnails, annotations, item views, etc. You also get a nice amount of Dublin Core metadata. And you can limit your search to books published before 1923. For example here’s a search for pre-1923 books that mention “Stevenson” (disclaimer: I don’t think the 1923 limit is actually working):
curl 'http://books.google.com/books/feeds/volumes?tbs=cd_max:Jan%2001_2%201923&q=Stevenson' | xmllint --format -
< ?xml version="1.0" encoding="UTF-8"?>
http://www.google.com/books/feeds/volumes 2010-08-30T20:37:27.000Z Search results for Stevenson Google Books Search http://www.google.com Google Book Search data API 206 1 10 http://www.google.com/books/feeds/volumes/ENMWAAAAYAAJ 2010-08-30T20:37:27.000Z Kidnapped Robert Louis Stevenson 1909 308 pages book ENMWAAAAYAAJ HARVARD:HN1JZ9 Kidnapped being memoirs of the adventures of David Balfour in the year 1851 ... http://www.google.com/books/feeds/volumes/WZ0vAAAAMAAJ 2010-08-30T20:37:27.000Z Treasure Island Robert Louis Stevenson George Edmund Varian 1918 CHAPTER I THE OLD SEA DOG AT THE "ADMIRAL BENBOW" SQUIRE Trelawney, Dr. Livesey, and the rest of these gentlemen having asked me to write down the whole ... 306 pages book WZ0vAAAAMAAJ NYPL:33433075793830 Fiction Treasure Island http://www.google.com/books/feeds/volumes/REUrAQAAIAAJ 2010-08-30T20:37:27.000Z Stevenson Adlai Ewing Stevenson Grace Darling David Darling 1977-10 127 pages book REUrAQAAIAAJ STANFORD:36105037014342 McGraw-Hill/Contemporary Biography & Autobiography Stevenson http://www.google.com/books/feeds/volumes/3ibdGgAACAAJ 2010-08-30T20:37:27.000Z Stevenson Robert Louis Stevenson 2007-01-17 This scarce antiquarian book is included in our special Legacy Reprint Series. 128 pages book 3ibdGgAACAAJ ISBN:1430495375 ISBN:9781430495376 Kessinger Pub Co Poetry Stevenson Day by Day http://www.google.com/books/feeds/volumes/3QI-AAAAYAAJ 2010-08-30T20:37:27.000Z A child's garden of verses Robert Louis Stevenson 1914 IN winter I get up at night And dress by yellow candle-light. In summer, quite the other way, I have to go to bed by day. I have to go to bed and see The ... 136 pages book 3QI-AAAAYAAJ CORNELL:31924052752262 Children's poetry, Scottish A child's garden of verses by Robert Louis Stevenson; illustrated by Charles Robinson http://www.google.com/books/feeds/volumes/Gmk-AAAAYAAJ 2010-08-30T20:37:27.000Z Travels with a donkey in the Cevennes Robert Louis Stevenson 1916 THE DONKEY, THE PACK, AND THE PACK - SADDLE IN a little place called Le Monastier, in a pleasant highland valley fifteen miles from Le Puy, I spent about a ... 287 pages book Gmk-AAAAYAAJ HARVARD:HWP541 Cévennes Mountains (France) Travels with a donkey in the Cevennes An inland voyage http://www.google.com/books/feeds/volumes/f3A-AAAAYAAJ 2010-08-30T20:37:27.000Z St. Ives Robert Louis Stevenson 1906 IVES CHAPTER IA TALE OF A LION RAMPANT IT was in the month of May,, that I was so unlucky as to fall at last into the hands of the enemy. ... 528 pages book f3A-AAAAYAAJ HARVARD:HWP61W St. Ives being the adventures of a French prisoner in England http://www.google.com/books/feeds/volumes/4mb8LuKKwocC 2010-08-30T20:37:27.000Z Cruising with Robert Louis Stevenson Oliver S. Buckton 2007 Cruising with Robert Louis Stevenson: Travel, Narrative, and the Colonial Body is the first book-length study about the influence of travel on Robert Louis ... 344 pages book 4mb8LuKKwocC ISBN:0821417568 ISBN:9780821417560 Ohio Univ Pr Literary Criticism Cruising with Robert Louis Stevenson travel, narrative, and the colonial body http://www.google.com/books/feeds/volumes/4yo9AAAAYAAJ 2010-08-30T20:37:27.000Z New Arabian nights Robert Louis Stevenson 1922 THE SUICIDE CLUB STORY OF THE YOUNG MAN WITH THE CREAM TARTS DURING his residence in London, the accomplished Prince Florizel of Bohemia gained the ... 386 pages book 4yo9AAAAYAAJ HARVARD:HWP51H Fiction New Arabian nights http://www.google.com/books/feeds/volumes/z2Yf1FX02EkC 2010-08-30T20:37:27.000Z Robert Louis Stevenson Richard Ambrosini Richard Dury 2006 As the editors point out in their Introduction, Stevenson reinvented the “personal essay” and the “walking tour essay,” in texts of ironic stylistic ... 377 pages book z2Yf1FX02EkC ISBN:0299212246 ISBN:9780299212247 Univ of Wisconsin Pr Literary Criticism Robert Louis Stevenson writer of boundaries
Now it would be nice if the Atom included <link> elements for the epubs themselves. Perhaps the feed could even use the recently released “acquisition” link relation defined by OPDS v1.0. For example, by including something like the following in each
Theoretically it should be possible to construct the appropriate link for the epub, based on what data is available in the Atom. But it would enable quite a bit of use of the epubs to make their URLs available explicitly in a programmatic way. Unfortunately we would still be limited to dipping into the full dataset using a query, instead of being able to crawl the entire archive, with something like a paged Atom feed. From a conversation over on get-theinfo it appears that this approach might not be as easy as it sounds. Also, it turns out that magically, many of the books have been uploaded to the Internet Archive. 902,188 of them in fact.
So maybe not that much work needs to be done. But presumably more public domain content will become available from Google Books, and it would be nice to be able to say there was at least one other copy of it elsewhere, for digital preservation purposes. It would be great to see Google step up and do some good, by making their API usable for folks wanting to replicate the public domain content. Still, at least they haven’t of done evil by locking it away completely. Dan Brickley had an interesting suggestion to possibly collaborate on this work.