Neil Gaiman wrote recently in the Guardian about reading, writing and libraries. I suspect you’ve seen it already, but it sat in my to-read pile (on my Kindle) till now. I think I’m going to read it to my kids tonight.
One of the nice things about reading on my Kindle is I’m much more likely to highlight and annotate. Really, I would’ve highlighted the whole thing if I could have. But anyway, here is what I highlighted:
There were noises made briefly, a few years ago, about the idea that we were living in a post-literate world, in which the ability to make sense out of written words was somehow redundant, but those days are gone: words are more important than they ever were: we navigate the world with words, and as the world slips onto the web, we need to follow, to communicate and to comprehend what we are reading. People who cannot understand each other cannot exchange ideas, cannot communicate, and translation programs only go so far.
I do not believe that all books will or should migrate onto screens: as Douglas Adams once pointed out to me, more than 20 years before the Kindle turned up, a physical book is like a shark. Sharks are old: there were sharks in the ocean before the dinosaurs. And the reason there are still sharks around is that sharks are better at being sharks than anything else is. Physical books are tough, hard to destroy, bath-resistant, solar-operated, feel good in your hand: they are good at being books, and there will always be a place for them. They belong in libraries, just as libraries have already become places you can go to get access to ebooks, and audiobooks and DVDs and web content.
The (Public) Library
A library is a place that is a repository of information and gives every citizen equal access to it. That includes health information. And mental health information. It’s a community space. It’s a place of safety, a haven from the world. It’s a place with librarians in it. What the libraries of the future will be like is something we should be imagining now … If you do not value libraries then you do not value information or culture or wisdom. You are silencing the voices of the past and you are damaging the future.
It’s hard to read Yves Raimond and Tristan Ferne’s paper The BBC World Service Archive Prototype and not imagine a possible future for radio archives, archives on the Web, and archival description in general.
Actually, it’s not just the future, it’s also the present, as embodied in the BBC World Service Archive prototype itself, where you can search and listen to 45 years of radio, and pitch in by helping describe it if you want.
As their paper describes, Raimond and Ferne came up with some automated techniques to connect up text about the programs (derived directly from the audio, or indirectly through supplied metadata) to Wikipedia and DBPedia. This resulted in some 20 million RDF assertions, that form the database that the (very polished) web application sits on top of. Registered users can then help augment and correct these assertions. I can only hope that some of these users are actually BBC archivists, who can also help monitor and tune the descriptions provided from the general public.
Their story is full of win, so it’s understandable why the paper won the 2013 Semantic Web Challenge:
They used WikipedidMiner to take a first pass at entity extraction of the text they were able to collect for each program. The MapHub project uses WikipediaMiner for the same purpose of adding structure to otherwise unstructured text.
They used Amazon Web Services (aka the cloud) to do what would have taken them 4 years in the space of 2 weeks, for a fixed, one time cost.
They use ElasticSearch for search, instead of trying to squeeze that functionality and scalability out of a triple store.
They wanted to encourage curation of the content, so they put an emphasis on usability and design that is often absent from Linked Data prototypes.
They have written in more detail about the algorithms that they used to connect up their text to Wikipedia/DBpedia.
Their github account reflects the nuts and bolts of how they did this work. Specifically their rdfsim Python project that vectorizes a SKOS hierarchy, for determining the distance between concepts, seems like a really useful approach to disambiguating terms in text.
But it is the (implied) role of the archivist, as the professional responsible for working with developers to tune these algorithms, evaluating/gauging user contributions, and helping describe the content themselves that excites me the most about this work. It’s also the future role of the archive that is at stake too. In another paper Raimond, Smethurst, McParland and Lowiswhich describe how having this archival data allows them to augment live BBC News subtitles with links to the audio archive, where people can follow their nose (or ears in this case) to explore the context around news stories.
The fact that it’s RDF and Linked Data isn’t terribly important in all this. But the importance of using world curated, openly licensed entities derived from Wikipedia cannot be understated. It’s the conceptual glue that allows connections to be made. As Wikidata grows in importance at Wikipedia it will be interesting to see if it supplants the role that DBpedia has been playing to date.
And of course, it’s exciting because it’s not just anyone doing this, it’s the BBC.
My only nit is that it would be nice to see some of the structured data they’ve collected expressed more in their HTML. For example they have minted a URI for Brian Eno which lists radio programs that are related to him. Why not display his bio, and perhaps a picture? Why not put links to other radio programs for people he is associated with him, like David Byrne or David Bowie, etc. Why not express some of this semantic metadata in microdata or RDFa in the page, to enable search engine optimization and reuse?
Luckily, it sounds like they have invested in the platform and data they would need to add these sorts of features.
PS. Apologies to the Mighty Boosh for the title of this post. “The future’s dead … Everyone’s looking back, not forwards.”
Internet Archive recently announced their new Availability API for checking if a representation for a given URL is in their archive with a simple HTTP call. In addition to the API they highlighted a few IA related projects, including a Wordpress plugin called Broken Link Checker which will check the links in your Wordpress site, and offer to fix any broken ones using an Internet Archive URL, if it is available based on a call to the Availability API.
I installed the plugin here and let it run for a bit. It detected 3898 unique URLs in 4910 links of which 482 are broken. This amounts to 12% link rot … but there were also 1038 redirects that resulted in a 200 OK ; so there may be a fair bit of reference rot lurking there. The plugin itself doesn’t provide a summary of HTTP status codes for the “broken URLS” but they are listed one by one in the broken link report. Since I could see the HTTP status codes in the table, I figured out you can easily log into your Wordpress database and run a query like this to get a summary:
mysql> select http_code, count(*) from wp_blc_links where broken is true group by http_code;
| http_code | count(*) |
| 0 | 113 |
| 400 | 9 |
| 403 | 15 |
| 404 | 333 |
| 410 | 5 |
| 416 | 1 |
| 500 | 3 |
| 502 | 1 |
| 503 | 2 |
I’m assuming the 113 (23% of the broken links) are DNS lookup failure, or connection failures. Once the broken links are identified, you have to manually curate each link to decide whether you want to link out to the Internet Archive based on whether it’s possible, and whether the most recent link is appropriate or not. This can take some time, but it is useful given I uncovered a number of fat-fingered URLs that looked like they never worked, which I was able to fix. Of the legitimately broken URLs, 136 URLs (~28%) were available at the Internet Archive. Once you’ve decided to use an IA URL the plugin can quickly rewrite the original content without requiring you to do in and tweak the content yourself.
One thing that would be nice would be for the API to be queried for a representation of the resource based on when the post was authored. For example my Snakes and Rubies post had a broken link to http://snakesandrubies.com and the plugin correctly found that it was available at the Internet Archive with an API query like:
% curl --silent 'http://archive.org/wayback/available?url=snakesandrubies.com' | python -mjson.tool
When requesting that URL you get this hit in the archive: http://web.archive.org/web/20130131030609/http://snakesandrubies.com/ but that’s an archive of a cyerbsquatted version of the page:
If the timestamp of the blogpost were used perhaps a better representation like this could’ve been found automatically, or at least it could have been offered first?
Based on the svn log for the plugin it appears to have been 2007-10-08 and has been downloaded 2,099,072 times since then. When people gripe about the Web being broken by design I think it’s good to remember that tools like this exist to help make it better, one website and link at a time.
You probably already saw the news about NYPL’s Building Inspector that was released yesterday. If you haven’t, definitely check it out…it’s a beautiful app. I hope Building Inspector represents the shape of things to come for engagement with the Web by cultural heritage institutions.
I think you’ll find that it is strangely addictive. This is partly because you get to zoom in on random closeups of historic NYC maps: which is like candy if you are a map junkie, or have spent any time in the city. More importantly you get the feeling that you are helping NYPL build and enrich a dataset for further use. I guess you could say it’s gamification, but it feels more substantial than that.
Building Inspector hits a sweet spot for a few reasons:
It has a great name. Building Inspector describes the work you will be doing, and contextualizes the activity with a profession you may already be familiar with.
It opens with some playful yet thoughtfully composed instructions that describe how to do the inspection. The instructions aren’t optional, but can easily be dismissed. They are fun while still communicating essential facts about what you are going to be doing.
There is an easy way to review the work you’ve done so far by clicking on the View Progress link. You use your Twitter, Facebook or Google account to login. It would be cool to be able to see the progress view from a global view: everyone’s edits, in realtime perhaps.
The app is very responsive, displaying new parts of the map with sub-second response times.
The webapp looks and works great as a mobile app. I’d love to hear more about how they did this, since they don’t appear to be using anything like Twitter Bootstrap to help. The mobile experience might be improved a little bit if you could zoom and pan with touch gestures.
It uses LeafletJS. I’ve done some very simplistic work with Leaflet in the past, so it is good to see that it can be customized this much.
NYPL is embracing the cloud. Building Inspector is deployed on Heroku, with map tiles on Amazon’s CloudFront. This isn’t a big deal for lots of .com properties, but for libraries (even big research libraries like NYPL) I reckon it is a bigger deal than you might suspect.
The truly hard part of recognizing the outlines of buildings with OpenCV and other tools has been made available by NYPL on Github for other people to play around with.
Another really fun thing about the way this app was put together was its release, with the (apparent) coordination with an article at Wired, and subsequent follow up on the nypl_labs Twitter account.
Or in other words:
Quite a first day! It would be interesting to know what portion of the work this represents. Also, I’d be curious to see if NYPL is able to sustain this level of engagement to get the work done.
Day 2 Update
If I’m doing the math right (double check me if you really care), between those two data points there were 6,499 inspections over 63,000 seconds – so an average of 1.03 inspections/second. Whereas between points 3 and 4 of yesterday it looks like they had an average of 1.91 inspections/second.
Day 3 Update
Just a quick note for my future self, that Verne Harris’ notion of the “archival sliver” seems like a great sanity inducing antidote to the notion of total archives.
The archival record is best understood as a sliver of a sliver of a sliver of a window into process. It is a fragile thing, an enchanted thing, defined not by its connection to “reality”, but by its open-ended layerings of construction and reconstruction.
The Archival Sliver: Power, Memory and Archives in South Africa.
Today is Ada Lovelace Day and I wanted to join libtechwomen in celebrating the contribution of Suzanne Briet. Briet’s thinking helped found the field of Information Science or Documentation Science as it was known then. Documentation was a field of study started by Paul Otlet and Henri La Fontaine which focused on fixed forms of documents (e.g.) books, newspapers, etc. Briet’s contribution expanded the purview of the study of documents to include the social context in which documents are situated. Or as Ronald Day says
Briet’s writings stressed the importance of cultural forms and social situations and networks in creating and responding to information needs, rather than seeing information needs as inner psychological events. She challenges our common assumptions about the role and activities of information professionals and about the form and nature of documents. She speaks to our age of digital libraries, with their multi-documentary forms, but she also challenges the very conceptual assumptions about the form and the organization of knowledge in such digital libraries. Readers of What Is Documentation? will find themselves returning to Briet’s book, again and again, coming upon ever new insights into current problems and ever new challenges to still current assumptions about documents and libraries and about the origins, designs and uses of information management and its systems.
As you may know from previous blog posts here, I’m kind of fascinated with the idea of how the Document is presented in Web Architecture, and how it influences technologies like Linked Data. I spent some time trying to organize my thoughts about this intersection of Libraries, Archives, Information and the Web in a paper Linking Things on the Web: A Pragmatic Examination of Linked Data for Libraries, Archives and Museums. I was lucky to have Dorothea Salo read an early draft of the paper. Among her many useful comments was one which encouraged me to be a bit more precise in my attribution of the term document in information science. I wasn’t even mentioning Briet’s contribution and instead just named Otlet and La Fontaine, with a citation to Michael Buckland. I cited Buckland’s What Is A Document, which funnily enough is partly responsible for raising awareness about Briet’s contribution. Dorothea rightly encouraged me to dig a bit deeper, and to change this paragraph:
The terminology of documents situates Linked Data amidst an even older discourse concerning the nature of documents (Buckland, 1997), or documentation science more generally. Documentation science is a field a field of inquiry established by Paul Otlet and Henri La Fontaine in the 1930s, which was renamed as information science in the 1960s.
The terminology of documents situates Linked Data amidst an even older discourse concerning the nature of documents (Buckland, 1997), or documentation science more generally. Documentation science, or the study of documents is an entire field of study established by Otlet (1934), continued by Briet (1951), and more recently Levy (2001). As the use of computing technology spread in the 1960s documentation science was largely subsumed by the field of information science. In particular, Briet’s contributions expanded the notion of what is commonly understood to be a document, by reorienting the discussion to be in terms of objects that function as organized physical evidence (e.g. an antelope in the zoo, as opposed to an antelope grazing on the African savanna). The evidentiary nature of documents is a theme that is particularly important in archival studies.
So thanks Dorothea, and thank you Suzanne Briet for grounding what I was finding confounding in Web Architecture. Previously my only exposure to Briet’s thinking was revival literature about her, so I decided to take this opportunity to buy a copy of What Is Documentation to have for my bookshelf. It’s also available online on the Web, which seems fitting, right?
At the moment, I don’t have a job. The government has been shut down, and with it my job at the Library of Congress. I’ve had the good fortune to be able to pick up some part time work here and there with a few friends, to help make ends meet. I know I shouldn’t say it, but it has actually been kind of rejuvenating to scramble and brainstorm outside of the “permanent” job mentality that I supposedly have. It’s sounding pretty unlikely that I will be paid when the Federal Government re-opens, and it’s not really even clear at this point when it will re-open. Meanwhile there is a mortgage to pay, mouths to feed, and not a whole lot of wiggle room in our budget, or savings to speak of. But we’ll scrape by, like everyone else in the same boat.
But this post isn’t about the shutdown, and it’s not really about me. It’s about a startup, and it’s about my wife Kesa.
Kesa and I met at a startup in New York City in 2000. It was a magical time. We were helping start a business from the ground up, living in a truly amazing city, in a tiny one room apartment that barely fit us and our bookcase…and we were falling in love. We lived in Brooklyn, but our office was in downtown Manhattan, just off Wall Street, and a few blocks from the World Trade Center.
9/11 was an explosive, searing light that annihilated and destroyed…but somehow it also briefly illuminated delicate, evanescent, and commonplace things, making them easier to see. Most of all, the events of 9/11 made me acutely conscious of how important every day I had with Kesa was. One evening later that year I made Italian Wedding Soup for dinner, and asked her if she wanted to get married. She said yes. I think she liked the soup too.
Around that same time Kesa also decided to return to teaching. She had applied for a job in the Brooklyn Public School system and heard back the morning of September the 11th. I guess the day crystallized some things for her too. She remembered her experience teaching K-3rd grade kids how to read in New Orleans. She remembered what it felt like to help make the world a better place, one student at a time, instead of working her butt off to make some software better, that would (maybe) give some big corporations a competitive edge over some other big corporations, so they could sell more widgets. She inspired me in a way that I needed to be inspired, as our country slipped into pointless retaliation, and war.
Over the last 13 years, Kesa has largely been doing just that: teaching 5th grade in Brooklyn, Chicago and here in Washington DC. She took some time off to be with our kids when they were born, but she went back each time. Her philosophy as a teacher has always been to understand each student for their uniqueness. Don’t get me wrong, she is big into the academics; but at the end of the day, it was about connecting with the kids, and seeing them happy and thriving together. The times I went to her class I got to see the evidence of that first hand.
When Maeve (#3!) was born Kesa decided to give something else a try. She started tutoring kids in the neighborhood to see if she could help make ends meet that way. Somewhere along the way the math and reading transitioned to sewing and other crafts. She had caught the makers bug like millions of other people around the country, who are trying to reconnect our culture. She got talking to people like David and Lina Brunton who are trying to bootstrap a farm outside Annapolis, MD. The kids she taught got a real kick out of learning to make their own pajamas, bags and pillows…unwittingly they encouraged her to do more, and to think a bit bigger. She felt like she was onto something.
So in May of this year Kesa went to Baltimore to register the business Freehands Craft Studio. She found an inexpensive space to rent on the 2nd floor of a strip mall near our house in Silver Spring. The photo to the right was taken when she was painting the walls in the new space. I like it because it captures how earnest she was (and still is) about getting Freehands off the ground.
I watched as she networked on neighborhood discussion lists, talked to friends, and friends of friends, and somehow pulled together a small group of teachers with specialties in knitting, paper making, quilting, sewing, jewelry making and collage. Freehands had a few exploratory classes over the summer to figure out logistics, and this fall the classes started in full swing. Last weekend they were at the Silver Spring Mini Maker’s Faire where they demonstrated how to quickly make reusable lunch sacks, and answered questions for 5 hours from tons of people who were interested in what Freehands was doing.
So Kesa is working at a startup again. But this time it’s her startup. As the politicians fight in Congress about how to do their job, it means so much to me to see her building Freehands Craft Studio with her friends. It is a lot of work. I’m having to look after the kids a lot more when she is off teaching a class, or doing outreach of some kind. The startup expenses have set us back a bit more than we expected, and at an awkward time. There’s still a lot more to do to get the business rolling, to build momentum, and let folks outside of our little corner of Silver Spring know about it. But I can tell it’s what Kesa loves doing, because she is smiling when she’s doing it, she gets energy from doing it, the work illuminates her life, and our little family.
So I wrote this post for two people.
It’s for you Kesa. To let you know that even when I grumble about having to rush home to look after the kids, and scrape together a meal and clean up our house so that it doesn’t look like a tornado hit it– in my heart of hearts I’m so proud of you. Your Freehands experiment gives me hope and purpose. You make me happy, just like back when I made that Italian Wedding Soup.
And this post is also for you. There’s no better time to start things up as when other people are shutting things down right? Take some time to consider or remember what you want to start up. It can just be a side project for now. Who knows what it will grow into?
Oh, and if you want to help Kesa and Freehands Craft Studio please consider donating to their Indiegogo campaign, or sharing information about it with others using your social-media-platform-of-choice. There’s only about a day left, and they could really use your help. You can get a little mug or a reusable lunch sack or handmade card as a thank you … and you will become part of this little dream too.
Earlier this morning Martin Malmsten of the National Library of Sweden asked an interesting question on Twitter:
Martin was asking about the Linked Open Data that the Library of Congress publishes, and how the potential shutdown of the US Federal Government could result in this data being unavailable. If you are interested, click through to the tweet and take a minute to scan the entire discussion.
Truth be told, I’m sure that many could live without the Library of Congress Subject Headings or Name Authority File for a day or two…or honestly even a month or two. It’s not like this data’s currency is essential to the functioning of society, like financial, weather or space data, etc. But Martin’s point is that it raises an interesting general question about the longevity of Linked Open Data, and how it could be made more persistent.
In case you are new to it, a key feature of Linked Data is that it uses the URL to allow a distributed database to grow organically on the Web. So, in practice, if you are building a database about books, and you need to describe the novel Moby Dick, your description doesn’t need to include everything about Herman Melville. Instead it can assert that the book was authored by an entity identified by the URL
When you resolve that URL you can get back data about Herman Melville. For pragmatic reasons you may want to store some of that data locally in your database. But you don’t need to store all of it. If you suspect it has been updated, or need to pull down more data you simply fetch the URL again. But what if the website that minted that URL is no longer available? Or what if the website is still available but the original DNS registration expired, and someone is cybersquatting on it?
Admittedly some work has happened at the Dublin Core Metadata Initiative around the preservation of Linked Data vocabularies. The DCMI is taking a largely social approach to this problem, where vocabulary owners and memory institutions interact within the context of a particular trust framework centered on DNS. But the preservation of vocabularies (which are also identified with URLs) is really just a subset of the much larger problem of Web preservation more generally. Does web preservation have anything to offer for the preservation of Linked Data?
When reading the conversation Martin started I was reminded of a demo my colleague Chris Adams gave that showed how the World Digital Library item metadata can be retrieved from the Internet Archive. WDL embed item metadata as microdata in their HTML, and since the Internet Archive archives that HTML, you can get the metadata back from the Internet Archive.
So take this page from WDL:
It turns out this particular page has been archived by the Internet Archive 27 times. So with a few lines of Python you can use Internet Archive as a metadata service:
import urllib import microdata
In case you missed it, an interesting study by Jonathan Zittrain and Kendra Albert was written up in the New York Times with the provocative title In Supreme Court Opinions, Web Links to Nowhere. In addition to the article, the study itself is worth reading for its compact review of the study of link rot on the Web, and its stunning finding that 49% of the links in US Supreme Court Opinions are broken.
This 49% is in contrast with a similar, recent study by Raizel Liebler and June Liebert of the same links, which found a much lower rate of 29%. The primary reason for this discrepancy was that Zittrain and Albert looked at reference rot in addition to link rot.
The term reference rot was coined by Rob Sanderson, Mark Phillips and Herbert Van de Sompel in their paper Analyzing the Persistence of Referenced Web Resources with Memento. The distinction is subtle but important. Link rot typically refers to when a URL returns an HTTP error of some kind that prevents a browser from rendering the referenced content. This error can be the result of the page disappearing, or the webserver being offline. Reference rot refers to when the URL itself seems to work (returning either a 200 OK or redirect of some kind), but the content that comes back is no longer the content that was being referenced.
The New York Times article includes a great example of reference rot. The website http://ssnat.com/ which was referenced in a Supreme Court Opinion by Justice Alito.
The DNS registration expired, and was picked up someone who knew its significance and turned it into an opportunity to educate people about links in legal documents. The NY Times article calls this nameless person a “prankster” but it is a wonderful hack
One thing the NY Times article didn’t mention is that the website has been captured 140 times by the Internet Archive and the original as referenced by Justice Alito is available still. It seemed like a missed opportunity to highlight the incredibly important work that Brewster Kahle and his merry band of Web archivists are doing. It would be interesting to see how many of the 555 extracted links are available in the Internet Archive. But I couldn’t seem to find the list in or linked to from the article.
Zittrain and Albert on the other hand do mention the Internet Archive’s work in the context of perma.cc which is their proposed solution to the problem of broken links.
… the Internet Archive is dedicated to comprehensively archiving web content, and thus only passes through a given corner of the Internet occasionally, meaning there is no guarantee that a given page or set of content would be archived to reflect what an author or editor saw at the moment of citation. Moreover, the IA is only one organization, and there are long-term concerns around placing all of the Internet archiving eggs into one basket. A system of distributed, redundant storage and ownership might be a better long-term solution.
This seems like a legitimate concern, that there should be some ability to archive a website at a particular point in time. There are 27 founding members of perma.cc. There is a strong legal flavor to some of the participants, but perma.cc doesn’t appear to be only for legal authors, the website states:
perma.cc helps authors and journals create permanent archived citations in their published work. perma.cc will be free and open to all soon.
It’s good to see Internet Archive as one of the founding members. It remains to be seen what perma.cc’s approach to a distributed, redundant storage will be. For the system to actually be distributed there has to be more to it than listing 27 organizations that agree that it’s a good idea. It’s not like Internet Archive operates on its own, since they work closely with the International Internet Preservation Consortium which has 44 organizational members, many of whom are national libraries. I didn’t see the IIPC on the list of founding members for perma.cc.
If perma.cc were to take off I wonder what it would mean for publishers’ web analytics. If lots of publishers start putting perma.cc URLs in their publications what would this mean for the publishers of the referenced content, and their web analytics? Would it be possible for publishers to see how often their content is being used on perma.cc, and a rough approximation of who they are, what browsers they are using, etc?
Nit-picking aside, its awesome to see another player in the Web archiving space, especially from people Web-veterans who understand how it works, and its significance for society.
Update: Leigh Dodds has an excellent post about perma.cc’s terms of service.
Where the Heart Beats: John Cage, Zen Buddhism, and the Inner Life of Artists by Kay Larson My rating: 4 of 5 stars
I’m no expert on John Cage or Zen Buddhism, so I’m not a good person to speak to the accuracy of the material in this book. But Kay Larson provides a very accessible and inspired look at the life of an artist, who found peace and inspiration in the teachings of DT Suzuki, and how he went on to be a formative influence on postmodern art. The story of Cage’s relationship with Merce Cunningham and their inner circle of friends and artists was lovingly told. One of my favorite parts of the book was Larson’s discovery of a set of cards that were typed up for each meeting of “The Club”, which was a gathering of artists and thinkers in Greenwich Village . She used these postcards to piece together the chronology of Cage’s development around the time of his Lecture on Something and Lecture on Nothing. There are so many great Cage quotes scattered throughout the book too. I wish I read the book on my kindle so I could have highlighted more, and included some of them here. I’ve had a copy of Silence for years, and I think I’m going to reread some of it again, now that I know so much more about the context of John Cage’s life. If you’ve ever spent some time living in New York City, this book is bound to make you miss it just a little bit.