JavaScript and Archives

Tantek Çelik has some strong words about the use of JavaScript in Web publishing, specifically regarding it’s accessibility and longevity:

… in 10 years nothing you built today that depends on JS for the content will be available, visible, or archived anywhere on the web

It is a dire warning. It sounds and feels true. I am in the middle of writing a webapp that happens to use React, so Tantek’s words are particularly sobering.

And yet, consider for a moment how Twitter make personal downloadable archives available. When you request your archive you eventually get a zip file. When you unzip it, you open an index.html file in your browser, and are provided you with a view of all the tweets you’ve ever sent.

If you take a look under the covers you’ll see it is actually a JavaScript application called Grailbird. If you have JavaScript turned on it looks something like this:

JavaScript On

If you have JavaScript turned off it looks something like this:

JavaScript Off

But remember this is a static site. There is no server side piece. Everything is happening in your browser. You can disconnect from the Internet and as long as your browser has JavaScript turned on it is fully functional. (Well the avatar URLs break, but that could be fixed). You can search across your tweets. You can drill into particular time periods. You can view your account summary. It feels pretty durable. I could stash it away on a hard drive somewhere, and come back in 10 years and (assuming there are still web browsers with a working JavaScript runtime) I could still look at it right?

So is Tantek right about JavaScript being at odds with preservation of Web content? I think he is, but I also think JavaScript can be used in the service of archiving, and that there are starting to be some options out there that make archiving JavaScript heavy websites possible.

The real problem that Tantek is talking about is when human readable content isn’t available in the HTML and is getting loaded dynamically from Web APIs using JavaScript. This started to get popular back in 2005 when Jesse James Garrett coined the term AJAX for building app-like web pages using asynchronous requests for XML, which is now mostly JSON. The scene has since exploded with all sorts of client side JavaScript frameworks for building web applications as opposed to web pages.

So if someone (e.g. Internet Archive) comes along and tries to archive a URL it will get the HTML and associated images, stylesheets and JavaScript files that are referenced in that HTML. These will get saved just fine. But when the content is played back later in (e.g. Wayback Machine) the JavaScript will run and try to talk to these external Web APIs to load content. If those APIs no longer exist, the content won’t load.

One solution to this problem is for the web archiving process to execute the JavaScript and to archive any of the dynamic content that was retrieved. This can be done using headless browsers like PhantomJS, and supposedly Google has started executing JavaScript. Like Tantek I’m dubious about how widely they execute JavaScript. I’ve had trouble getting Google to index a JavaScript heavy site that I’ve inherited at work. But even if the crawler does execute the JavaScript, user interactions can cause different content to load. So does the bot start clicking around in the application to get content to load? This is yet more work for a archiving bot to do, and could potentially result in write operations which might not be great.

Another option is to change or at least augment the current web archiving paradigm by adding curator driven web archiving to the mix. The best examples I’ve seen of this are Ilya Kreymer’s work on pywb and pywb-recorder. Ilya is a former Internet Archive engineer, and is well aware of the limitations in the most common forms of web archiving today. pywb is a new player for web archives and pywb-recorder is a new recording environment. Both work in concert to let archivists interactively select web content that needs to be archived, and then for that content to be played back. The best example of this is his demo service webrecorder.io which composes pywb and pywb-recorder so that anyone can create a web archive of a highly dynamic website, download the WARC archive file, and then reupload it for playback.

The nice thing about Ilya’s work is that it is geared at archiving this JavaScript heavy content. Rhizome and the New Museum in New York City have started working with Ilya to use pywb to archive highly dynamic Web content. I think this represents a possible bright future for archives, where curators or archivists are more part of the equation, and where Web archives are more distributed, not just at Internet Archive and some major national libraries. I think the work Genius are doing to annotate the Web, archived versions of the Web is in a similar space. It’s exciting times for Web archiving. You know, exciting if you happen to be an archivist and/or archiving things.

At any rate, getting back to Tantek’s point about JavaScript. If you are in the business of building archives on the Web definitely think twice about using client side JavaScript frameworks. If you do, make sure your site degrades so that the majority of the content is still available. You want to make it easy for Internet Archive to archive your content (lots of copies keeps stuff safe) and you want to make it easy for Google et al to index it, so people looking for your content can actually find it. Stanford University’s Web Archiving team have a super set of pages describing archivability of websites. We can’t control how other people publish on the Web, but I think as archivists we have a responsibility to think about these issues as we create archives on the Web.

NYPL’s Building Inspector

You probably already saw the news about NYPL’s Building Inspector that was released yesterday. If you haven’t, definitely check it out…it’s a beautiful app. I hope Building Inspector represents the shape of things to come for engagement with the Web by cultural heritage institutions.

I think you’ll find that it is strangely addictive. This is partly because you get to zoom in on random closeups of historic NYC maps: which is like candy if you are a map junkie, or have spent any time in the city. More importantly you get the feeling that you are helping NYPL build and enrich a dataset for further use. I guess you could say it’s gamification, but it feels more substantial than that.

Building Inspector hits a sweet spot for a few reasons:

  1. It has a great name. Building Inspector describes the work you will be doing, and contextualizes the activity with a profession you may already be familiar with.
  2. It opens with some playful yet thoughtfully composed instructions that describe how to do the inspection. The instructions aren’t optional, but can easily be dismissed. They are fun while still communicating essential facts about what you are going to be doing.
  3. There is an easy way to review the work you’ve done so far by clicking on the View Progress link. You use your Twitter, Facebook or Google account to login. It would be cool to be able to see the progress view from a global view: everyone’s edits, in realtime perhaps.
  4. The app is very responsive, displaying new parts of the map with sub-second response times.
  5. The webapp looks and works great as a mobile app. I’d love to hear more about how they did this, since they don’t appear to be using anything like Twitter Bootstrap to help. The mobile experience might be improved a little bit if you could zoom and pan with touch gestures.
  6. It uses LeafletJS. I’ve done some very simplistic work with Leaflet in the past, so it is good to see that it can be customized this much.
  7. NYPL is embracing the cloud. Building Inspector is deployed on Heroku, with map tiles on Amazon’s CloudFront. This isn’t a big deal for lots of .com properties, but for libraries (even big research libraries like NYPL) I reckon it is a bigger deal than you might suspect.
  8. The truly hard part of recognizing the outlines of buildings with OpenCV and other tools has been made available by NYPL on Github for other people to play around with.

Another really fun thing about the way this app was put together was its release, with the (apparent) coordination with an article at Wired, and subsequent follow up on the nypl_labs Twitter account.

6:35 AM

7:22 AM

10:12 AM

6:43 PM

Or in other words:

BuildingInspector

Quite a first day! It would be interesting to know what portion of the work this represents. Also, I’d be curious to see if NYPL is able to sustain this level of engagement to get the work done.

Day 2 Update

2:22 PM

4:07 PM

If I’m doing the math right (double check me if you really care), between those two data points there were 6,499 inspections over 63,000 seconds — so an average of 1.03 inspections/second. Whereas between points 3 and 4 of yesterday it looks like they had an average of 1.91 inspections/second.

Days 1-2

Day 3 Update

Days 1-3

maps on the web with a bit of midlife crisis

TL;DR — I created a JavaScript library for getting GeoJSON out of Wikipedia’s API in your browser (and Node.js). I also created a little app that uses it to display Wikipedia articles for things near you that need a photograph/image or editorial help.


I probably don’t need to tell you how much the state of mapping on the Web has changed in the past few years. I was there. I can remember trying to get MapServer set up in the late 1990s, with limited success. I was there squinting at how Adrian Holovaty reverse engineered a mapping API out of Google Maps at chicagocrime.org. I was there when Google released their official API, which I used some, and then they changed their terms of service. I was there in the late 2000s using OpenLayers and TileCache, which were so much more approachable than MapServer was a decade earlier. I’m most definitely not a mapping expert, or even an amateur–but you can’t be a Web developer without occasionally needing to dabble, and pretend you are.

I didn’t realize until very recently how easy the cool kids have made it to put maps on the Web. Who knew that in 2013 there would be an open source JavaScript library that lets you add a map to your page in a few lines, and that it’s in use by Flickr, FourSquare, CraigsList, Wikimedia, the Wall Street Journal, and others? Even more astounding: who knew there would be an openly licensed source of map tiles and data, that was created collaboratively by a project with over a million registered users, and that it would be good enough to be used by Apple? I certainly didn’t even dream about it.

Ok, hold that thought…

So, Wikipedia recently announced that they were making it easy to use your mobile device to add a photograph to a Wikipedia article that lacked an image.

When I read about this I thought it would be interesting to see what Wikipedia articles there are about my current location, and which lacked images, so I could go and take pictures of them. Before I knew it I had a Web app called ici (French for here) that does just that:

Articles that need images are marked with little red cameras. It was pretty easy to add orange markers for Wikipedia articles that had been flagged as needing edits, or citations. Calling it an app is an overstatement: it is just static HTML, JavaScript and CSS that I serve up. HTML’s geolocation features and Wikipedia’s API (which has GeoData enabled) take care of the rest.

After I created the app I got a tweet from a real geo-hacker, Sean Gillies, who asked:

https://twitter.com/sgillies/status/332185543234441216

Sean is right, it would be really useful to have a GeoJSON output from Wikipedia’s API. But I was on a little bit of a tear, so rather than figuring out how to get GeoJSON into MediaWiki and deployed to all the Wikipedia servers I wondered if I could extract ici’s use of the Wikipedia API into a slightly more generalized JavaScript library, that would make it easy to get GeoJSON out of Wikipedia–at least from JavaScript. That quickly resulted in wikigeo.js which is now getting used in ici. Getting GeoJSON from Wikipedia using wikigeo.js is done in just one line, and then adding the GeoJSON to a map in Leaflet can also be done in one line:

geojson([-73.94, 40.67], function(data) {
    // add the geojson to a Leaflet map
    L.geoJson(data).addTo(map)
});

This call results in callback getting some GeoJSON data that looks something like:

{
  "type": "FeatureCollection",
  "features": [
    {
      "id": "http://en.wikipedia.org/wiki/New_York_City",
      "type": "Feature",
      "properties": {
        "name": "New York City"
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          -73.94,
          40.67
        ]
      }
    },
    {
      "id": "http://en.wikipedia.org/wiki/Kingston_Avenue_(IRT_Eastern_Parkway_Line)",
      "type": "Feature",
      "properties": {
        "name": "Kingston Avenue (IRT Eastern Parkway Line)"
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          -73.9422,
          40.6694
        ]
      }
    },
    {
      "id": "http://en.wikipedia.org/wiki/Crown_Heights_–_Utica_Avenue_(IRT_Eastern_Parkway_Line)",
      "type": "Feature",
      "properties": {
        "name": "Crown Heights – Utica Avenue (IRT Eastern Parkway Line)"
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          -73.9312,
          40.6688
        ]
      }
    },
    {
      "id": "http://en.wikipedia.org/wiki/Brooklyn_Children's_Museum",
      "type": "Feature",
      "properties": {
        "name": "Brooklyn Children's Museum"
      },
"geometry": {
        "type": "Point",
        "coordinates": [
          -73.9439,
          40.6745
        ]
      }
    },
    {
      "id": "http://en.wikipedia.org/wiki/770_Eastern_Parkway",
      "type": "Feature",
      "properties": {
        "name": "770 Eastern Parkway"
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          -73.9429,
          40.669
        ]
      }
    },
    {
      "id": "http://en.wikipedia.org/wiki/Eastern_Parkway_(Brooklyn)",
      "type": "Feature",
      "properties": {
        "name": "Eastern Parkway (Brooklyn)"
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          -73.9371,
          40.6691
        ]
      }
    },
    {
      "id": "http://en.wikipedia.org/wiki/Paul_Robeson_High_School_for_Business_and_Technology",
      "type": "Feature",
      "properties": {
        "name": "Paul Robeson High School for Business and Technology"
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          -73.939,
          40.6755
        ]
      }
    },
    {
      "id": "http://en.wikipedia.org/wiki/Pathways_in_Technology_Early_College_High_School",
      "type": "Feature",
      "properties": {
        "name": "Pathways in Technology Early College High School"
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          -73.939,
          40.6759
        ]
      }
    }
  ]
}

There are options for broadening the radius, increasing the number of results, and fetching additional properties of the Wikipedia article such as article summaries, images, categories, templates used. Here’s an example using all the knobs:

geojson(
  [-73.94, 40.67],
  {
    limit: 5,
    radius: 1000,
    images: true,
    categories: true,
    summaries: true,
    templates: true
  },
  function(data) {
    L.geoJson(data).addTo(map)
  }
);

Which results in GeoJSON like this (abbreviated)

{
  "type": "FeatureCollection",
  "features": [
    {
      "id": "http://en.wikipedia.org/wiki/Silver_Spring,_Maryland",
      "type": "Feature",
      "properties": {
        "name": "Silver Spring, Maryland",
        "image": "Downtown_silver_spring_wayne.jpg",
        "templates": [
          "-",
          "Abbr",
          "Ambox",
          "Ambox/category",
          "Ambox/small",
          "Basepage subpage",
          "Both",
          "Category handler",
          "Category handler/blacklist",
          "Category handler/numbered"
        ],
        "summary": "Silver Spring is an unincorporated area and census-designated place (CDP) in Montgomery County, Maryland, United States. It had a population of 71,452 at the 2010 census, making it the fourth most populous place in Maryland, after Baltimore, Columbia, and Germantown.\nThe urbanized, oldest, and southernmost part of Silver Spring is a major business hub that lies at the north apex of Washington, D.C. As of 2004, the Central Business District (CBD) held 7,254,729 square feet (673,986 m2) of office space, 5216 dwelling units and 17.6 acres (71,000 m2) of parkland. The population density of this CBD area of Silver Spring was 15,600 per square mile all within 360 acres (1.5 km2) and approximately 2.5 square miles (6 km2) in the CBD/downtown area. The community has recently undergone a significant renaissance, with the addition of major retail, residential, and office developments.\nSilver Spring takes its name from a mica-flecked spring discovered there in 1840 by Francis Preston Blair, who subsequently bought much of the surrounding land. Acorn Park, tucked away in an area of south Silver Spring away from the main downtown area, is believed to be the site of the original spring.\n\n",
        "categories": [
          "All articles to be expanded",
          "All articles with dead external links",
          "All articles with unsourced statements",
          "Articles to be expanded from June 2008",
          "Articles with dead external links from July 2009",
          "Articles with dead external links from October 2010",
          "Articles with dead external links from September 2010",
          "Articles with unsourced statements from February 2007",
          "Articles with unsourced statements from May 2009",
          "Commons category template with no category set"
        ]
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          -77.019,
          39.0042
        ]
      }
    },
    ...
  ]
}

I guess this is a long way of saying, if you want to put Wikipedia articles on a map, or otherwise need GeoJSON for Wikipedia articles for a particular location, take a look at wikigeo.js. If you do, and have ideas for making it better, please let me know. Oh, by the way you can npm install wikigeo and use it from Node.js.

I guess JavaScript, HTML5, NodeJS, CoffeeScript are like my midlife crisis…my red sports car. But maybe being the old guy, and losing my edge isn’t really so bad?

I’m losing my edge
to better-looking people
with better ideas
and more talent
and they’re actually
really, really nice.
Jim Murphy

It definitely helps when the kids coming up from behind have talent and are really, really nice. You know?