Yes, GettyImages have decided to encourage people to embed their images. Despite opinions to the contrary I think this is A Good Thing. So what happens when you embed a Getty image into your HTML? To get something like this in your page:
you need to include a little snippet of HTML in your pages:
<iframe src="//embed.gettyimages.com/embed/81901686?et=4td6Xm2f0k6pMgQVX7pNFA&sig=fhRom4eoepnZbyWjZ0_2N3SdVG1dxQTC2GUAK4XrPjg=" width="462" height="440" frameborder="0" scrolling="no"></iframe>
which in turn embeds this HTML into your page:
<base target="_parent" />
<title>20 - 30 year old female worker pulls box off of warehouse shelf [Getty Images]</title>
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
<!--[if lt IE 10]>
<link rel="stylesheet" type="text/css" href="//embed.gettyimages.com/css/style.css" />
<section id="embed-body" data-asset-id="81901686" data-collection-id="41">
<a href="http://gty.im/81901686" target="_blank"><img src="http://d2v0gs5b86mjil.cloudfront.net/xc/81901686.jpg?v=1&c=IWSAsset&k=2&d=F5B5107058D53DF50D8BA2399504758256BF753C679B89B417A38C0E9F1FBB9F&Expires=1394499600&Key-Pair-Id=APKAJZZHJ4LGWQENK3OQ&Signature=UC1YXxhGwSAY0BduwMZqnFQ7fcAQTdCksDvYu4WVmNWlTou7NktH7rZ8uk7BLbupJ4sp0ijiDaA93Yi2XijnC-TtcUO1Kylcew4nZpM~Al9jD0OSfx5yNe7jcIalweGpLGOdMLTXn0wRs6XfEh3~1fc~csMrAesHJkUayhBqNxo6Xja-35XQLx98d5fg6UXazOsCRT-UzebWA4dFURz~BSxXgq0RtU~LhKVKRZvkUTvl2RrsqBcN4bW3i~dbNMwHKn~7s9dMy5CxH-7k4ELyJaBClWEO2Jgr5WV9cXy~WGBQnNd-5Lb7CMcZclzn88-LbmDnFcO~BVLgtSU5x-KTpw__" /></a>
<li class="gi-logo icon icon-logo"></li>
<li>Bob O'Connor / Stone</li>
<a href="//twitter.com/share" title="Share on Twitter" class="twitter-share-button" data-lang="en" data-count="none" data-url="http://gty.im/81901686"></a>
<a class="icon-tumblr" target="_self" title="Share on Tumblr" href="//www.tumblr.com/share/video?embed=%3Ciframe%20src%3D%22%2f%2fembed.gettyimages.com%2fembed%2f81901686%3fet%3d4td6Xm2f0k6pMgQVX7pNFA%26sig%3dfhRom4eoepnZbyWjZ0_2N3SdVG1dxQTC2GUAK4XrPjg%3d%22%20width%3D%22462%22%20height%3D%22440%22%20frameborder%3D%220%22%20%3E%3C%2Fiframe%3E"></a>
<aside class='modal embed-modal' style='display: none;'>
<a class="icon modal-close icon-close" href="#close" title="Close"></a>
<h3>Embed this image</h3>
<p>Copy this code to your website or blog. <a href="http://www.gettyimages.com/helpcenter" target="_blank" id="learn-more">Learn more</a></p>
Note: Embedded images may not be used for commercial purposes.</p>
By embedding this image, you agree to Getty Images
You can see Amazon’s CloudFront is being used as a CDN for the images, and that Getty are using CloudFront’s Signed URLs to expire the images…it looks like after 24 hours? This isn’t a problem because Getty are serving the page up, but anyone that’s tried to snag the image URL for reuse (Google Images?) will end up getting a 400 error.
I thought it was interesting that the embedded iframe gives you not only the image, author and collection, but also links to re-share the image on Twitter and Tumblr. I guess this is Viral Marketing 101, but it’s smart I think, since it encourages reuse, and the recycling of content on the Web. Conspicuously absent from the reshare buttons is Facebook – maybe there’s a story there? Also, as we’ll see in a second, the description of the image is missing from the embedded view:
20 - 30 year old female worker pulls box off of warehouse shelf
Of course the other big thing the iframe does is gives Getty an idea of where their content is being used. Anyone who uses this one line embed iframe will trigger an HTTP request to a embed.gettyimages.com URL (hosted on Amazon EC2 incidentally). These requests, and their referral information can be stashed away and analyzed, so that Getty can get a picture of who is using their content, and how. Embedded images and the Twitter and Tumblr reshares are automatically linked to Getty’s specific short URLs, such as:
The number used in the short URL is also used in the expanded URL:
But the title text is just there for SEO, it can be changed to anything:
Ordinarily I’d be down on the use of a short URL, but in this case it’s role is more of a permalink. Of course these short URLs have the same problem as Handles and PURLs in that people won’t ordinarily bookmark them. But, Que Sera Sera. As the Verge pointed out these embedded iframes could end up depriving Web content of lead images, if the GettyImages decides to pull the plug on the embeds and they suddenly 404. But their credibility would suffer quite a bit by a decision like that. I think it’s important that they are encouraging the Web to rely on these URLs, and that they are putting their reputation on the line.
Of course lots of inbound links to those pages should do wonders for their PageRank. Plus, following that link allows you to purchase the image, explore other images by the photographer, related images in the GettyImages collection, as well as see some additional metadata about the photo: item number, rights, license type, original file dimensions, size, dots-per-inch. Some of this metadata is even expressed using RDFa (Facebook’s OpenGraph metadata) … which makes the lack of a Facebook share button even more interesting. In addition there is also some minimal use of schema.org HTML microdata for the search engine’s to nibble on. If you are curious, Google’s Structured Data Testing Tool provides a view on this metadata.
It seems like there’s an opportunity to express more information in RDFa or microdata, specifically the details about the original, as well as licensing/rights metadata. Oddly the RDFa doesn’t even mark up the author of the image, I suppose because Facebook’s OpenGraph doesn’t give a way of expressing it. They could start by marking up the author of the image, but what if Getty established photographer pages, so instead of Bob O’Connor linking to:
What if it linked to a vanity URL like:
This would be a perfect place to share links to author’s other social media accounts, a bio, their photographer friends, etc. I’m thinking of the sort of work that National Geographic are doing with their YourShot application, for example this Profile page for Bahareh Mohamadian.
The licensing restrictions and iframes around these images would have ordinarily turned me off. But given Getty’s market position in this space it’s completely understandle, and seems like a useful compromise for now. These landing pages are a perfect place to make more structured metadata available that could be used by integrating applications. Getty should invest in this real estate, not only for the Web, but also for data resuse across their enterprise. The landing pages are an example of just how influential Facebook and Google have been in promoting the use of metadata on the Web. Without them, I think it is safe to assume we wouldn’t have seen any structured metadata on these pages at all.
The news about OCLC’s Linked Data service circulated widely on Twitter yesterday. I’ve never been a big OCLC cheerleader, but the news really hit home for me. I’ve been writing in my rambling way about Linked Data here for about 6 years. Of course there are many others who’ve been at it much longer than I have … and in a way I think librarians and archivists feel a kinship with the effort because it is cooked into the DNA of how we think about the Web as an information space.
This new OCLC service struck me as an excellent development for the library Web community for a few reasons, that I thought I would quickly jot down:
it’s evolutionary: OCLC didn’t let the perfect be the enemy of the good. It’s great to hear links to VIAF, FAST, LCSH, etc are planned. But you have to start somewhere, and there is already significant value in expressing the FRBR workset data they have as Linked Data on the Web for others to use. Also, the domain
experiment.worldcat.org clearly reflects this is an experiment…but they didn’t let anxiety about changing URLs prevent them from publishing what they can now. The future is longer than the past.
it’s snappy: I don’t know if they’ve written about the technical architecture they are using, but the views are quite responsive. Of course I have no idea what kind of load it is under, but so far so good. Update: Ron Buckley of OCLC let me know the service is built on top of a shared Apache HBase Hadoop cluster.
schema.org: OCLC has the brains and the market position to create their own vocabulary for bibliographic data. But they worked hard at engaging openly with the Web community to help clarify and adapt the Schema.org vocabulary so that it can be used by our community. There is lots of thrashing going on in this space at the moment, and OCLC is being a great model in trying to work with the Web we have, and iterating to make it better, instead of trying to take a quantum leap forward.
json-ld: JSON-LD has been cooking for a while, but it’s a brand new W3C standard for representing RDF as idiomatic JSON. RDF has been somewhat plagued in the past by esoteric and/or hard to understand representations. JSON-LD really seems to have hit a sweet-spot between the expressivity of RDF and the usability of the Web. It’s refreshing to see OCLC kicking JSON-LD’s tires.
Rubber Meet Road
So how do you discover these Work URIs? Richard’s post led me to believe I could get them directly from the xID service using an ISBN. But I found it to be a two step process: first get any OCLC Number associated with an ISBN from xID, and then use the OCLC Number to get the Work Identifier from the xID service:
So for example, to discover the Work URI for Tim Berners-Lee’s Weaving the Web you first look up the ISBN:
which should yield:
"author": "Tim Berners-Lee with Mark Fischetti.",
"city": "San Francisco",
"ed": "1st ed.",
"title": "Weaving the Web : the original design and ultimate destiny of the World Wide Web by its inventor",
Then pick one of the OCLC Numbers (oclcnum) at random and use it to do an xID call:
Which should return:
You can then dig out the Work Identifier (owi), trim off the owi prefix, and put it on the end of a URL like:
or, if you want the JSON-LD without doing content negotiation:
This returns a chunk of JSON data that I won’t reproduce here, but do check it out.
Update: After hitting publish on this blog post I’ve corresponded a bit with Stephan Schindehette at OCLC and Alf Eaton about some inconsistencies in my blog post (which I’ve fixed), and uncertainty about what the xID API should be returning. Hopefully xID can be updated to return the OCLC Work Identifier when you lookup by ISBN. I’ll update this blog post if I am notified of a change.
One bit of advice that I was given by Dave Longley on the #json-ld IRC channel, which I will pass along to OCLC, is that it might be better to use CURIE-less properties, e.g.
name instead of
(???) but I think it might make sense to reference an external context document and cut down on the size of the JSON-LD document even more.
It’s wonderful to see that the data is being licensed ODC-BY, but maybe assertions to that effect should be there in the data as well? I think schema.org have steered clear of licensing properties, but cc:license seems like a reasonable property to use, assuming it’s used with the right subject URI.
And one last tiny suggestion I have is that it would be nice to see the service mainstreamed into other parts of OCLC’s website. But I understand all too well the divides between R&D and production … and how challenging it can be to integrate them sometimes, even in the simplest of ways.
I recently ran across Digital Storage: Losing Our Virtual Heritage over on the (surprisingly interesting) SAA Archives & Archivists discussion list. Strangely, the editorial struck me as both emblematic of a problem in the archival community, and a guidepost for how archives need to move forward.
The key point Bromberg makes is that archives will no longer be able to function if our collections (specifically photographic collections) become digital:
Once I do basic work to care for my collections, I can put them on the shelf and pretty much not have to put any more money into their care. You cannot keep a digital file, however, without continually having to put money into it for the constant migration to new forms.
You have to buy new software and equipment, and pay for the labour to migrate them to be able to continue to get access to your images. Right now, I can just walk to the shelf and open a box to get access to my photographs.
This high cost of caring for digital files means that archives and museums which hold much of the world’s recorded history will most likely not be able to afford to care for them. We already have small budgets to care for our materials and that is unlikely to change.
Fear of format obsolescence is real and justified. But as David Rosenthal has been pointing out for a while the shared information space of the Web, and its open-source viewers (browsers), have mitigated some of these concerns. We have yet to see evidence that prospective format migration actually helps preserve content. But our continued obsession with format migration, and describing resources so they can be migrated is making the task of archiving digital content (like photographs) cost prohibitive, especially for smaller archives. Do we really think that billions of JPEGs are going to become unreadable overnight?
Bromberg’s piece contains a useful example:
I can get the Smith family photographs that Grandmother Smith put into a shoebox 50 years ago and forgot about it until her family cleaned out the house. I have packed up photo collections from families, businesses and organisations that contained images well over 100 years old that are perfectly fine. But if Grandmother Smith sticks some photo disks in her shoebox, by the time an archive gets them, they will be long gone.
Is it really useful for us to put our collective head in the sand and say that digital photography is going away? Or would we be better off helping photographers take care of their digital collections, so that when it comes time to donate them, they have a digital equivalent of a box of photographs to hand over? I’m reminded of the rich literature about the post-custodial archive where there is an emphasis on helping content owners manage their content, which in turn makes it easier to eventually transfer to an archive if desired. Personal Digital Archiving Day and DPLA’s Community program are good examples of this sort of effort.
I’m not suggesting that there isn’t work to do on this front. Bromberg is right. In our quest for the holy grail of digital preservation our content management systems have raised the bar way too high, for everyday people, and small libraries and archives to continue to do for digital content what they have for physical content, like photographs. To succeed I think archives and libraries need simple solutions that let them easily collect digital content, manage it, and let it feed into larger collections like DPLA.
By simple solutions I mean mostly a process that content owners and archivists can keep in their heads, that involves very little software, and mostly represents an investment in digital storage and backup systems in the same way that they have invested in physical space, containers, etc. I suspect many individuals and small archives already have storage solutions in operation for their business data, so this won’t be as big a leap as they imagine. But as long as we keep promulgating things like Fedora, DSpace etc as pre-requisites for doing real digital preservation Bromberg will be right.
To put it another way, we need a digital equivalent to the More Product, Less Process manifesto. This is the spirit that BagIt was created in at the Library of Congress. We needed to start processing an influx of digital content from NDIIPP partners, and we didn’t have the time, resources, or collective will to describe everything with METS, PREMIS, MODS and put it into Fedora or iRods, or whatever.
You can think of a Bag as a digital analog for a physical container. It’s just a directory with files in it, that includes a manifest, and some (optional) high level, human readable metadata. Certainly PREMIS, METS, etc can be layered on top of this, and we’ve done just that at LC with some of our some of our internal systems … but BagIt helps with the absolute basics of bundling up data so that it can be moved through space and time.
So I’m not suggesting that people should start using BagIt to manage their digital photographs. I actually think BagIt could be simplified even more (look for more on that later). My point is rather that we need super simple solutions, like BagIt, that involve very little working software, and that everyday people and archives can use. We need to educate, and move forward. A new army of archivist computer scientists isn’t going to solve this problem. In a lot of ways, the computer savvy (but not Luddite) archivists we have are perfect for the job of educating artists, business people, and Grandma Smith since the solutions we give them need to fit inside their heads too. We are doing a disservice to them if we just say we can’t handle the new medium, and pine for the good old days. Or as my colleague Bill Lefurgy said in response to this post:
… archivists at smaller institutions also need to push beyond digital fear and build capacity, even if slowly
This is the text of a talk I gave at the (wonderful) National Digital Forum in Wellington, New Zealand on November 27th, 2013. You can also find my slides here, and the video here. If you do happen to watch the video, you’ll probably notice I spent more time thinking about the text than I did practicing my talk.
Hi there. Thanks for inviting me to NDF 2013, it is a real treat and honor to be here. I’d like to dedicate this talk to Aaron Swartz. Aaron cared deeply about the Web. In a heartbreaking way I think he may have cared more than he was able to. I’m not going to talk much about Aaron specifically, but his work and spirit underly pretty much everything I’m going to talk about today. If there is one message that I would like you to get from my talk today it’s that we need to work together as professionals to care for the Web in the same way Aaron cared for it.
Next year it will be 25 years since Tim Berners-Lee wrote his proposal to build the World Wide Web. I’ve spent almost half of my life working with the technology of the Web. The Web has been good to me. I imagine it has been good to you as well. I highly doubt I would be standing here talking to you today if it wasn’t for the Web. Perhaps the National Digital Forum would not exist, if it was not for the Web. Sometimes I wonder if we need the Web to continue to survive as a species. It’s certainly hard for my kids to imagine a world without the Web. In a way it’s even hard for me to remember it. This is the way of media, slipping into the very fabric of experience. Today I’d like to talk to you about what it means to think about the Web as a preservation medium.
Medium and preservation are some pretty fuzzy, heavy words, and I’m not going to try to pin them down too much. We know from Marshall McLuhan that the medium is the message. I like this definition because it disorients more than it defines. McLuhan reminds us of how we are shaped by our media, just as we shape new forms of media. In her book Always Already New, Lisa Gitelman offers up a definition of media that gives us a bit more to chew on:
I define media as socially realized structures of communication, where structures include both technological forms and their associated protocols, and where communication is a cultural practice, a ritualized collocation of different people on the same mental map, sharing or engaged with popular ontologies of representation.
I like Gitelman’s definition because it emphasizes how important the social dimension is to our understanding of media. The affordances of media, how media are used by people to do things, and how media does things to us, are just as important as the technical qualities of media. In the spirit of Latour she casts media as a fully fledged actor, not as some innocent bystander or tool to be used by the real and only actors, namely people.
When Matthew Oliver wrote to invite me to speak here he said that in recent years NDF had focused on the museum, and that there was some revival of interest in libraries. The spread of the Web has unified the cultural heritage sector, showing how much libraries, archives and museums have in common, despite their use of subtly different words to describe what they do. I think preservation is a similar unifying concept. We all share an interest in keeping the stuff (paintings, sculptures, books, manuscripts, etc) around for another day, so that someone will be able to see it, use it, cite it, re-interpret it.
Unlike the traditional media we care for, the Web confounds us all equally. We’ve traditionally thought of preservation and access as different activities, that often were at odds with each other. Matthew Kirschenbaum dispels this notion:
… the preservation of digital objects is logically inseparable from the act of their creation – the lag between creation and preservation collapses completely, since a digital object may only ever be said to be preserved if it is accessible, and each individual access creates the object anew. The .txtual Condition
Or, as my colleague David Brunton has said, in a McLuhan-esque way:
Digital preservation is access…in the future.
The underlying implication here is that if you are not providing meaningful access in the present to digital content, then you are not preserving it.
In light of these loose definitions I’m going to spend the rest of the time exploring what the Web means as a preservation medium by telling some stories. I’m hoping that they will help illuminate what preservation means in the context of the Web. By the end I hope to convince you of two things: the Web needs us to care for it, and more importantly, we need the Web to do our jobs effectively. For those of you who don’t need convincing about either of these points, I hope to give you a slightly different lens for looking at preservation and the Web. It’s a hopeful and humanistic lens, that is informed by thinking about the Web as an archive. But more on that later.
Everything is Broken
Even the casual user of the Web has run across the problem of the 404 Not Found. In a recent survey of Web citations found in Thompson Reuter’s Web of Science, Hennessey and Ge found that only 69% of the URLs were still available, and the median lifetime for a URL was 9.3 years. The Internet Archive had archived 62% of these URLs. In a similar study of URLs found in recent papers in the popular arXiv pre-print repository Sanderson, Phillips and Van de Sompel found that of the 144,087 unique URLs referenced in papers, only 70% were still available and of these, 45% were not archived in the Internet Archive, Web Citation, the Library of Congress or the UK National Archive.
Bear in mind, this isn’t the World Wild Web of dot com bubbles, failed business plans, and pivots we’re talking about. These URLs were found in a small pocket of the Web for academic research, a body of literature that is built on a foundation of citation, and written by practitioners whose very livelihood is dependent on how they are cited by others.
A few months ago the 404 made mainstream news in the US when Adam Liptak’s story In Supreme Court Opinions, Web Links to Nowhere broke in the New York Times. Liptak’s story spotlighted a recent study by Zittrain and Albert which found that 50% of links in United States Supreme Court opinions were broken. As its name suggests, the Supreme Court is the highest federal court in the United States…it is the final interpreter of our Constitution. These opinions in turn document the decisions of the Supreme Court, and have increasingly referenced content on the Web for context, which becomes important later for interpretation. 50% of the URLs found in the opinions suffered from what the authors call reference rot. Reference rot includes situations of link rot (404 Not Found and other HTTP level errors), but it also includes when the URL appears to technically work, but the content that was cited is no longer available. The point was dramatically and humorously illustrated by the New York Times since someone had bought one of the lapsed domains and put up a message for Justice Alito:
Zittrain and Albert propose a new web archiving project called perma.cc, which relies on libraries to select web pages and websites that need to be archived. As proposed perma.cc would be similar in principle to WebCite, which is built around submission of URLs by scholars. But WebCite’s future is uncertain due to a fund drive to raise money to support its operation. perma.cc also has the potential to offer a governance structure similar to how cultural heritage organizations support the Internet Archive in their crawls of the Web.
Internet Archive was started by Brewster Kahle in 1996. It now contains 366 billion web pages or captures (not unique URLs). In 2008 Google Engineers reported that their index contained 1 trillion unique URLs. That’s 5 years ago now. If we assume it hasn’t grown since then, and overlook the fact that there are often multiple captures of a given URL over time, Internet Archive contains about 37% of the Web. This is overly generous since the Web has almost certainly grown in the past 5 years, and we’re comparing apples and oranges, web captures to unique URLs.
Of course, it’s not really fair (or prudent) to put the weight of preserving the Web on one institution. So thankfully, the Internet Archive isn’t alone. The International Internet Preservation Consortium is a member organization made up of national libraries, universities, and other organizations that do Web archiving. The National Library of New Zealand is a member, and has its own Web archive. According to the list of Web archiving initiatives Wikipedia article the archive is comprised of 346 million URLs. Perhaps someone in the audience has a rough idea of how big this is relative to the size of the Kiwi Web. It’s a bit of a technical problem even to identify national boundaries on the Web. Since the National Library of New Zealand Act of 2003, the National Library has been authorized to crawl the New Zealand portion of the Web. In this regard, New Zealand is light years ahead of the United States, which still is required by law to ask for permission to collect selected, non-governmental websites.
Protocols and tools for sharing the size and makeup of these IIPC collections are still lacking, but the Memento project spurred on some possible approaches out of necessity. For the Memento prototype to work they needed to collect the URL/timestamp combinations for all archived webpages. This turned out to be difficult both for the archive to share, and to aggregate in one place efficiently–and the moment it was done it was already out of date. David Rosenthal has some interesting ideas for aggregators to collect summary data from web archives, which is used to instead provide hints about where a given URL may be archived. Hopefully we’ll see some development in this area, as it’s increasingly important that Web archives do collection development more closely, to encourage diversity of approaches, and ensure that one isn’t a single point of failure.
Even when you consider the work of the International Internet Preservation Consortium, which adds roughly 75 billion URLs (also not unique) we still are only seeing 44% of the Web being archived. And of course this is a very generous guesstimate, since the 366 billion Internet Archive captures are not unique URLs: e.g. a given URL like the BBC homepage has been fetched 13,863 times between December 21, 1996 and November 14, 2013. And there is almost certainly overlap between the various IIPC web archives and the Internet Archive.
The Archival Sliver
I am citing these statistics not to say the job of archiving the Web is impossible, or a waste of resources. Much to the contrary. I raise it here to introduce one of the archival lenses I want to encourage you to look at Web preservation through: Verne Harris’ notion of the archival sliver. Harris is a South African archivist, writer and director of the Archive at the Nelson Mandela Centre of Memory. He participated in the transformation of South Africa’s apartheid public records system, and got to see up close how the contents of archives are shaped by the power structures in which they are embedded. Harris’ ideas have a distinctly post-modern flavor, and contrast with positivist theories of the archive that assert that the archive’s goal is to reflect reality.
Even if archivists in a particular country were to preserve every record generated throughout the land, they would still have only a sliver of a window into that country’s experience. But of course in practice, this record universum is substantially reduced through deliberate and inadvertent destruction by records creators and managers, leaving a sliver of a sliver from which archivists select what they will preserve. And they do not preserve much.
I like Harris’ notion of the archival sliver, because he doesn’t see it as a cause for despair, but rather as a reason to celebrate the role that this archival sliver has in the process of social memory, and the archivist who tends to it.
The archival record … is best understood as a sliver of a sliver of a sliver of a window into process. It is a fragile thing, an enchanted thing, defined not by its connections to “reality,” but by its open-ended layerings of construction and reconstruction. Far from constituting the solid structure around which imagination can play, it is itself the stuff of imagination.
The First URL
So instead of considering the preservation of billions of URLs, lets change tack a bit and take a look at the preservation of one, namely the first URL.
On April 30th, 1993 CERN made (in hindsight) the momentous decision to freely-release the Web technology software that Tim Berners-Lee, Ari Luotonen and Henrik Nielsen created for making the first website. But 20 years later, that website was no longer available. To celebrate the 20th anniversary of the software release Dan Noyes from CERN led a project to bring the original website back online, at the same address using a completely different software stack. The original content was collected from a variety of places: some from the W3C, some from a 1999 backup of Tim Berners-Lee’s NeXT. While the content is how it looked then, the resurrected website isn’t running the original Web server software, it’s running a modern version of Apache.
In a lot of ways I think this work illustrates James Governor’s adage:
Applications are like fish, data is like wine. Only one improves with age.
As any old school LISP programmer will tell you, sometimes code is data and data is code. But it is remarkable that this 20 year old HTML still renders just fine in a modern Web browser. This is no accident, but is the result of thoughtful, just-in-time design that encouraged the evolvability, extensibility and customizability of the Web. I think we as a community still have lots to learn from the Web’s example, and lots more to import into our practices. More about HTML in a bit.
Now obviously this sort of attention can’t be paid to all broken URLs on the Web. But it seems like an interesting example of how an archival sliver of the Web was cared for, respected and valued. Despite popular opinion, the care for URLs is not something foreign to the Web. For example lets take a look at the idea of the permalink that was popularized by the blogging community. As you know, a blog is typically a stream of content. In 2000 Paul Bausch at Blogger came up with a way to assign URLs to individual posts in the stream. This practice is so ubiquitous now it’s difficult to see what an innovation it was at the time. As its name implies, the idea of the permalink is that it is stable over time, so that the content can be persistently referenced. Apart from longevity, permalinks have beneficial SEO characteristics: the more that people link to the page over time, the higher its PageRank, and the more people who will find it in search results.
A couple years before the blogging community started talking about permalinks, Tim Berners-Lee wrote a short W3C design note entitled Cool URIs Don’t Change. In it he provides some (often humorously snarky) advice for people to think about their URLs, and namespaces with an eye to their future. One of Berners-Lee’s great insights was to allow any HTML document to link to any other HTML document, without permission. This decision allowed the Web to grow in a decentralized fashion. It also means that links can break when pages drift apart, and move to new locations, or disappear. But just because a link can break doesn’t mean that it must break.
The idea of Cool URIs saw new life in 2006 when Leo Sauerman and Richard Cyganiak began work on Cool URIs for the Semantic Web, which became a seminal document for the Linked Data movement. Their key insight was that identity (URLs) matters on the Web, especially when you are trying to create a distributed database like the Semantic Web.
Call them permalinks or Cool URIs, the idea is the same. Well managed websites will be rewarded by more links to their content, improved SEO, and ultimately more users. But most of all they will be rewarded by a better understanding of what they are putting on the Web. Organizations, particularly cultural heritage organization should take note – especially their “architects”. Libraries, archives and museums need to become regions of stability on the Web, where URLs don’t capriciously fail because some exhibit is over, or some content management system is swapped out for another. This doesn’t mean content can’t change, move or even be deleted. It just means we need to know when we are doing it, and say where something has moved, or say that what was once there is now gone. If we can’t do it, the websites that do will become the new libraries and archives of the Web.
Clearly there is a space between large scale projects to archive the entire Web, and efforts to curate a particular website. Consider the work of ArchiveTeam, a volunteer organization formed in 2009 that keeps an eye on when websites are in danger of, actually are, closing their doors and shutting down. Using their wiki, IRC chatrooms, and software tools they have built up a community of practice around archiving websites, which have included some 60 sites, such as Geocities and Friendster. They maintain a page called the Death Watch where they list sites that are dying (pining for the fjords), or in danger of dying (pre-emptive alarm bells). These activist archivists run something called the Warrior which is a virtual appliance you can install on a workstation, which gets instructions from the Archive Team tracker about which URLs to download, and coordinates the collection. The tracker then collects statistics, that allow participants to see how much they have contributed relative to others. The collected data is then packed up as WARC files and delivered to the Internet Archive where it is reviewed by an adminstrator, and added to their Web collection.
ArchiveTeam is a labor of love for its creator Jason Scott Sadofsky (aka Jason Scott) who is himself an accomplished documenter of computing history, with films such as BBS: The Documentary (early bulletin board systems), Get Lamp (interactive fiction) and DEFCON: The Documentary. Apart from mobilizing action, his regular talks have raised awareness about the impermanence on the Web, and have connected with other like minded Web archivists in a way that traditional digital preservation projects have struggled to. I suspect that this self-described “collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage” is the shape of things to come for the profession. ArchiveTeam are not the only activists archiving parts of the Web, lets take a look at a few more examples.
Earlier this year Google announced that they were pulling Google Reader offline. This caused much grief and anger to be vented from the blogging community…but it spurred one person into action. Mihai Parparita was an engineer who helped create Google Reader at Google, but he no longer worked there, and wanted to do something to help people retain both their data and the experience of Google Reader. Because he felt that the snapshots of data provided weren’t complete, he quickly put together a project ReaderIsDead, which is also available on GitHub. ReaderIsDead is actually a collection of different tools: one for pulling down your personal Reader data from Google while the Google Reader servers were still alive, and a simple web application called ZombieReader that serves that data up, for when the Google Reader servers actually went dead. Mihai put his knowledge of how the Google Reader talked to backend Web service APIs to build ZombieReader.
Of course .com failures aren’t the only reason why content disappears from the Web. People intentionally remove content from the Web all the time for a variety of reasons. Let’s consider the strange, yet fascinating cases of Mark Pilgrim and Jonathan Gillette. Both were highly prolific software developers, bloggers, authors and well known spokespeople for open source and the Web commons.
Among other things, Mark Pilgrim was very active in the area of feed syndication technology (RSS, Atom). He wrote the feed validator and Universal Feed Parser that makes working with syndicated much easier. He also pushed the boundaries of technical writing by writing Dive Into HTML 5 and Dive Into Python 3 which were published traditionally as books, but also made available on the Web with a CC-BY license. Pilgrim also worked at Google, where he helped promote and evolve the Web with his involvement in the Web Hypertext Application Technology Working Group WHATWG.
Jonathan Gillette, also known as Why the Lucky Stiff or _why, was a prolific writer, cartoonist, artist, and computer programmer who helped popularize the Ruby programming language. His online book Why’s (Poignant) Guide to Ruby introduced people of all ages to the practice of programming with wit and humor that will literally make you laugh out loud as you learn. His projects such like Try Ruby and Hackety Hack! lowered the barriers to getting a working software development environment set up by moving it to the Web. He also wrote a great deal of software such as hpricot for parsing HTML, and the minimalist Web framework camping.
Apart from all these similarities Mark Pilgrim and Jonathan Gillette share something else in common: on October 4, 2011 and August 19, 2009 respectively they both decided to completely delete their online presence from the Web. They committed info-suicide. Their online books, blogs, social media accounts, and github projects were simply removed. No explanations were made, they just blinked out of existence. They are still alive here in the physical world, but they aren’t participating online as they were previously…or at least not using the same personas. I like to think Pilgrim and _why were doing performance art to illustrate the fundamental nature of the Web, its nowness, its fragility, it’s impermanence. As Dan Connolly said once:
The point of the Web arch[itecture] is that it builds the illusion of a shared information space.
If someone decides to turn off a server or delete a website it’s gone for the entire world, the illusion dissolves. Maybe it lives on buried in a Web archive, but it’s previous life out on the Web is over. Or is it?
It’s interesting to see what happened after the info-suicides. Why’s (Poignant) Guide to Ruby was rescued by Mislav Marohnic a software developer living in Croatia. He was able to piece the book back together based on content in the Internet Archive, and put it back online at a new URL, as if nothing had happened. In addition he has continued to curate it: updating code samples to work with the latest version of Ruby, enabling syntax highlighting, converting it to use Markdown, and more.
Similarly Mark Pilgrim’s Dive Into HTML 5 and Dive Into Python 3 were assembled from copies and re-deployed to the Web. Prior to his departure Pilgrim used Github to manage the content for his books. Github is a distributed revision control system, where everyone working with the code has a full copy of it local on their machine. So rather than needing to get content out of the Internet Archive, developers created the diveintomark organization account on Github, and pushed their clones of the original repositories there.
Much of Why and Pilgrim’s code was also developed on GitHub. So even though the master was deleted, many people had clones, and were able to work together to establish a new master. Philip Cromer created the whymirror on Github, which collected _why’s code. Jeremy Ruten created _why’s Estate which is a hypertext archive collects pointers to the various software archives, and writings that have been preserved in Internet Archive and elsewhere.
So, it turns out that the supposedly brittle medium of the Web, where a link can easily break, and a whole website can be capriciously turned off, is a bit more persistent than we think. These events remind me of Matthew Kirschenbaum’s book Mechanisms which deconstructs notions of electronic media being fleeting or impermanent to show how electronic media (especially that which is stored on hard drives) is actually quite resilient and resistent to change. Mechanisms contains a fascinating study of how William Gibson’s poem Agrippa (which was engineered to encrypt itself and become unreadable after a single reading) saw new life on the Internet, as it was copied around on FTP, USENET, email listservs, and ultimately the Web:
Agrippa owes its transmission and continuing availability to a complex network of individuals, communities, ideologies, markets, technologies, and motives … from its example we can see that the preservation of digital media has a profound social dimension that is at least as important as purely technical considerations. Hacking Agrippa
In the forensic spirit of Mechanisms, let’s take a closer look at Web technology, specifically HTML. Remember the first URL and how CERN was able to revive it? When you think about it, it’s kind of amazing that you can still look at that HTML in your modern browser, right? Do you think you could view your 20 year old word processing documents today? Jeff Rothenberg cynically observed
digital information lasts forever—or five years, whichever comes first
Maybe if we focus on the archival sliver instead of the impossibility of everything we’re not doing so bad.
As we saw in the cases of Pilgirm and _why the Internet Archive and other web archiving projects play an important role in snapshotting Web pages. But we are also starting to see social media companies are building tools that allow their users to easily extract or “archive” their content. These tools are using HTML in an interesting new way that is worth taking a closer look at.
How many Facebook users are there here? How many of you have requested your archive? If you navigate to the right place in your settings you can “Download a copy of your Facebook data.” When you click on the button you set in motion a process that gathers together your profile, contact information, wall, photos, synced photos, videos, friends, messages, pokes, events, settings, security and (ahem) ads. This takes Facebook a bit of time, it took a day the last time I tried it, and you get an email when it’s finished which contains a link to download a zip file. The zip file contains HTML, JPEG, MP4 files which you can open in your browser. You don’t need to be connected to the Internet, everything is available locally.
Some of you may remember that the Data Liberation Front (led by Brian Fitzpatrick at Google) and the eventual product offering Google Takeout were early innovators in this area. Google Takeout allows you to download data from 14 of their products as a zip file. The service isn’t without criticism, because it doesn’t include things like your Gmail archive or your search history. The contents of the archive are also somewhat more difficult to work with, compared to the Facebook and Twitter equivalents. For example, each Google+ update is represented as a single HTML file, and there isn’t a notion of a minimal, static application that you can use to browse them. The HTML also references content out on the Web, and isn’t as self-contained as Twitter and Facebook’s archive. But having snapshots of your Youtube videos, and contents of your Google Drive is extremely handy. As Brad Fitzpatrick wrote in 2010, Google Takeout is kind of a remarkable achievement, or realization for a big publicly traded behometh to make:
Locking your users in, of course, has the advantage of making it harder for them to leave you for a competitor. Likewise, if your competitors lock their users in, it is harder for those users to move to your product. Nonetheless, it is far preferable to spend your engineering effort on innovation than it is to build bigger walls and stronger doors that prevent users from leaving. Making it easier for users to experiment today greatly increases their trust in you, and they are more likely to return to your product line tomorrow.
I mention Facebook, Twitter and Google here because I think these archiving services are important for memory institutions like museums, libraries and archives. They allow individuals to download their data from the huge corpus that is available–a sliver of a sliver of a sliver. When a writer or politician donates their papers, what if we accessioned their Facebook or Twitter archive? Dave Winer for example has started collecting Twitter archives that have been donated to him, that meet a certain criteria, and making them public. If we have decided to add someone’s papers to a collection, why not acquire their social media archives and store them along with their other born digital and traditional content? Yes, Twitter (as a whole) is being archived by the Library of Congress, as so called big data. But why don’t we consider these personal archives as small data, where context and original order are preserved with other relevant material in a highly usable way? As Rufus Pollock of the Open Knowledge Foundation said
This next decade belongs to distributed models not centralized ones, to collaboration not control, and to small data not big data.
So how to wrap up this strange, fragmented, incomplete tour through Web preservation? I feel like I should say something profound, but I was hoping these stories of the Web would do that for me. I can only say for myself that I want to give back to the Web the way it has given to me. With 25 years behind us the Web needs us more than ever to help care for the archival slivers it contains. I think libraries, museums and archives that realize that they are custodians of the Web, and align their mission with the grain of the Web, will be the ones that survive, and prosper. Brian Fitzpatrick, Jason Scott, Brewster Kahle, Mislav Marohnic, Philip Cromer, Jeremy Ruten and Aaron Swartz demonstrated their willingness to work with the Web as a medium in need of preservation, as well as a medium for doing the preservation. We need more of them. We need to provide spaces for them to do their work. They are the new faces of our profession.