Category Archives: wikipedia

maps on the web with a bit of midlife crisis

TL;DR — I created a JavaScript library for getting GeoJSON out of Wikipedia’s API in your browser (and Node.js). I also created a little app that uses it to display Wikipedia articles for things near you that need a photograph/image or editorial help.


I probably don’t need to tell you how much the state of mapping on the Web has changed in the past few years. I was there. I can remember trying to get MapServer set up in the late 1990s, with limited success. I was there squinting at how Adrian Holovaty reverse engineered a mapping API out of Google Maps at chicagocrime.org. I was there when Google released their official API, which I used some, and then they changed their terms of service. I was there in the late 2000s using OpenLayers and TileCache, which were so much more approachable than MapServer was a decade earlier. I’m most definitely not a mapping expert, or even an amateur–but you can’t be a Web developer without occasionally needing to dabble, and pretend you are.

I didn’t realize until very recently how easy the cool kids have made it to put maps on the Web. Who knew that in 2013 there would be an open source JavaScript library that lets you add a map to your page in a few lines, and that it’s in use by Flickr, FourSquare, CraigsList, Wikimedia, the Wall Street Journal, and others? Even more astounding: who knew there would be an openly licensed source of map tiles and data, that was created collaboratively by a project with over a million registered users, and that it would be good enough to be used by Apple? I certainly didn’t even dream about it.

Ok, hold that thought…

So, Wikipedia recently announced that they were making it easy to use your mobile device to add a photograph to a Wikipedia article that lacked an image.

When I read about this I thought it would be interesting to see what Wikipedia articles there are about my current location, and which lacked images, so I could go and take pictures of them. Before I knew it I had a Web app called ici (French for here) that does just that:

Articles that need images are marked with little red cameras. It was pretty easy to add orange markers for Wikipedia articles that had been flagged as needing edits, or citations. Calling it an app is an overstatement: it is just static HTML, JavaScript and CSS that I serve up. HTML’s geolocation features and Wikipedia’s API (which has GeoData enabled) take care of the rest.

After I created the app I got a tweet from a real geo-hacker, Sean Gillies, who asked:

Sean is right, it would be really useful to have a GeoJSON output from Wikipedia’s API. But I was on a little bit of a tear, so rather than figuring out how to get GeoJSON into MediaWiki and deployed to all the Wikipedia servers I wondered if I could extract ici’s use of the Wikipedia API into a slightly more generalized JavaScript library, that would make it easy to get GeoJSON out of Wikipedia–at least from JavaScript. That quickly resulted in wikigeo.js which is now getting used in ici. Getting GeoJSON from Wikipedia using wikigeo.js is done in just one line, and then adding the GeoJSON to a map in Leaflet can also be done in one line:

geojson([-73.94, 40.67], function(data) {
    // add the geojson to a Leaflet map
    L.geoJson(data).addTo(map)
});

This call results in callback getting some GeoJSON data that looks something like:

{
  "type": "FeatureCollection",
  "features": [
    {
      "id": "http://en.wikipedia.org/wiki/New_York_City",
      "type": "Feature",
      "properties": {
        "name": "New York City"
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          -73.94,
          40.67
        ]
      }
    },
    {
      "id": "http://en.wikipedia.org/wiki/Kingston_Avenue_(IRT_Eastern_Parkway_Line)",
      "type": "Feature",
      "properties": {
        "name": "Kingston Avenue (IRT Eastern Parkway Line)"
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          -73.9422,
          40.6694
        ]
      }
    },
    {
      "id": "http://en.wikipedia.org/wiki/Crown_Heights_–_Utica_Avenue_(IRT_Eastern_Parkway_Line)",
      "type": "Feature",
      "properties": {
        "name": "Crown Heights – Utica Avenue (IRT Eastern Parkway Line)"
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          -73.9312,
          40.6688
        ]
      }
    },
    {
      "id": "http://en.wikipedia.org/wiki/Brooklyn_Children's_Museum",
      "type": "Feature",
      "properties": {
        "name": "Brooklyn Children's Museum"
      },
"geometry": {
        "type": "Point",
        "coordinates": [
          -73.9439,
          40.6745
        ]
      }
    },
    {
      "id": "http://en.wikipedia.org/wiki/770_Eastern_Parkway",
      "type": "Feature",
      "properties": {
        "name": "770 Eastern Parkway"
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          -73.9429,
          40.669
        ]
      }
    },
    {
      "id": "http://en.wikipedia.org/wiki/Eastern_Parkway_(Brooklyn)",
      "type": "Feature",
      "properties": {
        "name": "Eastern Parkway (Brooklyn)"
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          -73.9371,
          40.6691
        ]
      }
    },
    {
      "id": "http://en.wikipedia.org/wiki/Paul_Robeson_High_School_for_Business_and_Technology",
      "type": "Feature",
      "properties": {
        "name": "Paul Robeson High School for Business and Technology"
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          -73.939,
          40.6755
        ]
      }
    },
    {
      "id": "http://en.wikipedia.org/wiki/Pathways_in_Technology_Early_College_High_School",
      "type": "Feature",
      "properties": {
        "name": "Pathways in Technology Early College High School"
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          -73.939,
          40.6759
        ]
      }
    }
  ]
}

There are options for broadening the radius, increasing the number of results, and fetching additional properties of the Wikipedia article such as article summaries, images, categories, templates used. Here’s an example using all the knobs:

geojson(
  [-73.94, 40.67],
  {
    limit: 5,
    radius: 1000,
    images: true,
    categories: true,
    summaries: true,
    templates: true
  },
  function(data) {
    L.geoJson(data).addTo(map)
  }
);

Which results in GeoJSON like this (abbreviated)

{
  "type": "FeatureCollection",
  "features": [
    {
      "id": "http://en.wikipedia.org/wiki/Silver_Spring,_Maryland",
      "type": "Feature",
      "properties": {
        "name": "Silver Spring, Maryland",
        "image": "Downtown_silver_spring_wayne.jpg",
        "templates": [
          "-",
          "Abbr",
          "Ambox",
          "Ambox/category",
          "Ambox/small",
          "Basepage subpage",
          "Both",
          "Category handler",
          "Category handler/blacklist",
          "Category handler/numbered"
        ],
        "summary": "Silver Spring is an unincorporated area and census-designated place (CDP) in Montgomery County, Maryland, United States. It had a population of 71,452 at the 2010 census, making it the fourth most populous place in Maryland, after Baltimore, Columbia, and Germantown.\nThe urbanized, oldest, and southernmost part of Silver Spring is a major business hub that lies at the north apex of Washington, D.C. As of 2004, the Central Business District (CBD) held 7,254,729 square feet (673,986 m2) of office space, 5216 dwelling units and 17.6 acres (71,000 m2) of parkland. The population density of this CBD area of Silver Spring was 15,600 per square mile all within 360 acres (1.5 km2) and approximately 2.5 square miles (6 km2) in the CBD/downtown area. The community has recently undergone a significant renaissance, with the addition of major retail, residential, and office developments.\nSilver Spring takes its name from a mica-flecked spring discovered there in 1840 by Francis Preston Blair, who subsequently bought much of the surrounding land. Acorn Park, tucked away in an area of south Silver Spring away from the main downtown area, is believed to be the site of the original spring.\n\n",
        "categories": [
          "All articles to be expanded",
          "All articles with dead external links",
          "All articles with unsourced statements",
          "Articles to be expanded from June 2008",
          "Articles with dead external links from July 2009",
          "Articles with dead external links from October 2010",
          "Articles with dead external links from September 2010",
          "Articles with unsourced statements from February 2007",
          "Articles with unsourced statements from May 2009",
          "Commons category template with no category set"
        ]
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          -77.019,
          39.0042
        ]
      }
    },
    ...
  ]
}

I guess this is a long way of saying, if you want to put Wikipedia articles on a map, or otherwise need GeoJSON for Wikipedia articles for a particular location, take a look at wikigeo.js. If you do, and have ideas for making it better, please let me know. Oh, by the way you can npm install wikigeo and use it from Node.js.

I guess JavaScript, HTML5, NodeJS, CoffeeScript are like my midlife crisis…my red sports car. But maybe being the old guy, and losing my edge isn’t really so bad?

I’m losing my edge
to better-looking people
with better ideas
and more talent
and they’re actually
really, really nice.
Jim Murphy

It definitely helps when the kids coming up from behind have talent and are really, really nice. You know?

genealogy of a braeburn

It has been observed that when systems break down we get to actually see how they operate. I wonder what this breakage below says about the use of Freebase and Wikipedia data in Google’s Knowlege Graph.

Yes, that’s an image of Braeburn from My Little Pony to the right, and text about the apple to the left. Interestingly it’s fine at Wikipedia:

And it’s not even there in Freebase (according to a search).

I don’t know if this reveals what’s going on in the flow of entities between Wikipedia, Freebase and Google. But I thought it was interesting. I wonder where to report such an anomaly. Is there a place?

Thanks to Jeff Godin in #code4lib for noticing the breakage in Knowledge Graph.

See also Hilary Mason’s post about how her identity got mixed up on Bing. (Thanks Chris).

Update: 2012-02-04

I thought to check a week later, and the The Knowledge Graph results got even funnier, now it’s a collage of apples and My Little Pony:

archiving wikitweets

Earlier this year I created a little toy webapp called wikitweets that uses the Twitter streaming API to identify tweets that reference Wikipedia, which it then displays realtime in your browser. It was basically a fun experiment to kick the tires on NodeJS and SocketIO using a free, single process Heroku instance.

At the time I announced the app on the wiki-research-l discussion list to see if anyone was interested in it. Out of the responses I received were ones from Emilio Rodríguez-Posada and Taha Yasseri where they asked whether the tweets are archived as they stream by. This struck a chord with me, since I’m a software developer working in the field of “digital preservation”. You know that feeling when you suddenly see one of your huge gaping blindspots? Yeah.

Anyway, some 6 months or so later I finally got around to adding an archive function to wikitweets, and I thought it might be worth writing about very quickly. Wikitweets uses the S3 API at Internet Archive to store every 1000 tweets. So you can visit this page at Internet Archive and download the tweets. Now I don’t know how long Internet Archive is going to be around, but I bet it will be longer than inkdroid.org, so it seemed like a logical (and free) safe harbor for the data.

In addition to being able to share the files Internet Archive also make a BitTorrent seed available, so the data can easily be distributed around the Internet. For example you could open wikitweets_archive.torrent in your BitTorrent client and download a copy of the entire dataset, while providing a redundant copy. I don’t really expect this to happen much with the wikitweets collection, but it seems to be a practical offering in the Lots of Copies Keeps Stuff Safe category.

I tried to coerce several of the seemingly excellent s3 libraries for NodeJS to talk to the Internet Archive, but ended up writing my own very small library that works specifically with Internet Archive. ia.js is bundled as part of wikitweets, but I guess I could put it on npm if anyone is really interested. It gets used by wikitweets like this:

  var c  = ia.createClient({
    accessKey: config.ia_access_key,
    secretKey: config.ia_secret_key,
    bucket: config.ia_bucket
  });
 
  c.addObject({name: "20120919030946.json", value: tweets}, function() {
    console.log("archived " + name);
  });

The nice thing is that you can use s3 libraries that have support for Internet Archive, like boto to programatically pull down the data. For example, here is a Python program that goes through each file and prints out the Wikipedia article title that is referenced by the tweet:

  import json
  import boto
 
  ia = boto.connect_ia()
  bucket = ia.get_bucket("wikitweets")
 
  for keyfile in bucket:
      content = keyfile.get_contents_as_string()
      for tweet in json.loads(content):
          print tweet['article']['title']

The archiving has only been running for last 24 hours or so, so I imagine there will be tweaks that need to be made. I’m considering compression of the tweets as one of them. Also it might be nice to put the files in subdirectories, but it seemed that Internet Archive’s API wanted to URL encode object names that have slashes in them.

If you have any suggestions I’d love to hear them.