I don’t claim to have invented it, because it already existed in nature. I just saw it, recognized the value of it, gave it a name, and a description, and showed its benefits. I don’t claim to be the only person to have discovered it.
Crockford is likeably humble about the origins of JSON. Rather than claiming he invented JSON he instead says he discovered it–almost as if he was a naturalist on an expedition in some uncharted territory. Looking at the Web as an ecosystem with forms of life in it might seem more like a stretched metaphor or sci-fi plot; but I think it’s a useful and pragmatic stance. The Web is a complex space, and efforts to forcibly make it move in particular directions often fail, even when big organizations and trans-national corporations are behind them. Grafting, aligning and cross-fertilizing technologies, while respecting the communities that they grow from will probably feel more chaotic, but it’s likely to yield better results in the long run.
Necessity is the Mother of Invention
Less is More
Crockford tried very hard to strip unnecessary stuff from JSON so it stood a better chance of being language independent. When confronted with push back about JSON not being a “standard” Crockford registered json.org, put up a specification that documented the data format, and declared it as a standard. He didn’t expend a lot of energy in top-down mode, trying to convince industry and government agencies that JSON was the way to go. Software developers discovered it, and started using it in their applications because it made their life easier too. Some people complained that the X in AJAX stood for XML, and therefore JSON should not be used. But this dogmatism faded into the background over time as the benefits of JSON were recognized.
JSON is not versioned, and has no mechanism for revision. JSON cannot change. This probably sounds like heresy to many folks involved in standardization. It is more radical than the WHATWG’s decision to remove the version number from HTML5. It may only be possible because Crockford focused so much on keeping the JSON specification so small and tight. Crockford is realistic about JSON not solving all data representation problems, and speculates that we will see use cases that JSON does not help solve. But when this happens something new will be created instead of extending JSON. This relieves the burden of backwards compatibility that can often drag projects down into a quagmire of complexity. Software designed to work with JSON will always work with whatever valid JSON is found in the wild.
NPR Morning Edition recently ran an interview with Teju Cole about his most recent project called Small Fates. Cole is the recipient of the 2012 Hemingway Foundation/PEN Award for his novel Open City. Small Fates is a series of poetic snippets from Nigerian newspapers, which Cole has been posting on Twitter. It turns out Small Fates draws on a tradition of compressed news stories known as fait divers. The interview is a really nice description of the poetic use of this material to explore time and place. In some ways it reminds me a little of the cut-up technique that William S. Bouroughs popularized; albeit in a more lyrical, less dadaist form.
At about the 3 minute mark in the interview Cole mentions that he has recently been using content from historic New York newspapers using Chronicling America. For example:
Chronicling America is a software project I work on. Of course we were all really excited to hear Cole mention us on NPR. One thing that we were wondering is whether he could include shortened URLs to the newspaper page in Chronicling America in his tweets. Obviously this would be a clever publicity vehicle for the NEH funded National Digital Newspaper Program. It would also allow the Small Fates reader to follow the tweet to the source material, if they wanted more context.
Through the magic of Facebook, Twitter, good old email and Teju’s generosity I got in touch with him to see if he would be willing to include some shortened Chronicling America URLs in his tweets. His response indicated that he had clearly already thought about linking, but had decided not to. His reasons for not linking struck me as really interesting, and he agreed to let me quote them here:
I can’t include links directly in my tweets for three reasons.
The first is aesthetic: I like the way the tweets look as clean sentences. One wouldn’t wish to hyperlink a poem.
The second is artistic: I want people to stay here, not go off somewhere else and crosscheck the story. Why go through all the trouble of compression if they’re just going to go off and read more about it? What’s omitted from a story is, to me, an important part of a writer’s storytelling strategy.
And the third is practical: though I seldom use up all 140 characters, rarely do I have enough room left for a url string, even a shortened one.
I really loved this artistic (and pragmatic) rationale for not linking, and thought you might too.
… the fact that the Smithsonian has gone to CC0 is actually a great step. It means that data owners inside the USG have the latitude to use tools that put USG works into a legal status outside the US that is interoperable with their public domain status inside the US, and that’s an unalloyed Good Thing in my view.
While I helped prototype and bring the first version of id.loc.gov online the licensing of the data was a persistent question that I heard from people who wanted to use the data. The about page at id.loc.gov current says:
The Library of Congress has prepared this vocabulary terminology system and is making it available as a public domain data set. While it has attempted to minimize inaccuracies and defects in the data and services furnished, THE LIBRARY OF CONGRESS IS PROVIDING THIS DATA AND THESE SERVICES “AS IS” AND DISCLAIMS ANY AND ALL WARRANTIES, WHETHER EXPRESS OR IMPLIED.
But as John detailed in his post, this isn’t really enough for folks outside the US. I think even for parties inside the US a CC0 license would add more confidence to using the data set in different contexts, and would help align the Library of Congress more generally with the Web. Last Friday Eric Miller of Zepheira spoke about Linked Data at the Library of Congress (eventually the talk should be made available). Near the end of his talk he focused on things that need to be worked on, and I was glad to hear him stress that work needed to be done on licensing. The issue is really nothing new, and it really transcends the Linked Data space. I’m not saying it’s easy, but I agree with everyone who is saying it is important to focus on…and it’s great to see the advances that others in the federal government are making.
Git and GitHub
The other fun part of the Smithsonian announcement was the use of GitHub as a platform for publishing the data. To do this Cooper-Hewitt established an organizational account on GitHub, which you might think is easy, but is actually no small achievement by itself for an organization in the US federal government. With the account in hand the collection project was created and the collection metadata was released as two CSV files (media.csv and objects.csv) by Micah Walter. The repository was then forked by Aaron Straup Cope. Aaron added some Python scripts for converting the CSV files into record based JSON files. In the comments to the Cooper-Hewitt Labs blog post Aaron commented on why he chose to break up the CSV into JSON. The beautiful thing about using Git and GitHub this way for data is that you have a history view like this:
For digital preservation folks this view of what changed, when, and by who is extremely important for establishing provenance. The fact that you get this for free by using the opensource Git version control system, and pushing your repository to GitHub is very compelling.
Over the past couple of years there has been quite a bit of informal discussion in the digital curation community about using Git for versioning data. Just a couple weeks before the Smithsonian announcement Stephanie Collett and Martin Haye from the California Digital Library reported on the use of Git and Mercurial to version data at Code4lib 2012.
This post was composed over at Storify and exported here.
Because of stuff I’ve been doing at work lately, and some recent conversations at code4lib in Seattle I’ve been getting more and more interested in archival description and the Web. When I first ran across Storify it seemed like it might provide some useful user interface ideas that could be used in archival description. I’ve been thinking how web content such as Wikipedia, authority records, etc could be easily referenced while composing descriptive text about a collection. And once this content has been referenced how can it be baked in so that the content is usable in the future?
Recently I stumbled upon (pun intended) a topic to try out Storify: the emerging conversation going on in Twitter and in blogs about Web curation. I know how meta right? As a software developer working in the cultural heritage sector my interest in curation has already been piqued for some time. But until just now I was completely oblivious to the emerging debate about new mechanics for expressing attribution on the Web. I actually ran across it because this tweet from Matt Langer flitted across my TweetDeck:
The tweet led me over to his blog post on Gizmodo, which rankled my anti-authoritarian sensibilities a bit. This statement in particular got the blood pumping:
“Curation” is an act performed by people with PhDs in art history; the business in which we’re all engaged when we’re tossing links around on the Internet is simple “sharing”.
But getting into an argument about the semantics of “curation” doesn’t seem particularly appealing or useful. One of the reasons why I think “curation” works for Curate Camp, International Journal of Digital Curation and elsewhere is that it has somewhat loose semantics, which allows useful collaboration and conversation to spring up around it. Saying you need a PhD to do curation makes me mad, probably because I don’t have one. Maybe it was a joke. Anyhow, moving on.
Speaking of semantics Langer goes on to say:
But we should not delude ourselves for a moment into bestowing any special significance on this, because when we do this thing that so many of us like to call “curation” we’re not providing any sort of ontology or semantic continuity beyond that of our own whimsy or taste or desire.
So it was time to actually look at the Curator’s Code itself. The instructions are pretty short and brief: use “via” and “HT” or their unicode equivalents ᔥ and ↬ respectively. I’ve already been sub-consciously using “via” for some time now, so a little bit of discussion about seems like a good idea.
Maria Popova has a more detailed description of the rationale behind the use of unicode characters. Strangely the discussion didn’t mention what I thought was going to be the primary reason for them: brevity. There have been similar efforts to use special unicode characters on Twitter (where real estate is scarce) before. There are already bookmarklets for easily creating links that use the correct unicode glyphs. But this led me to another post:
Ingram’s essential point is that if the past is any guide the Web will route around efforts to control the way citations are made, and more importantly that:
… we already have a tool for providing credit to the original source — it’s called the hyperlink.
Which I strongly agree with. That being said we have seen some pretty wide deployment and use of mechanisms like rel=license microformat for expressing the license for a piece of content. Of course typed links between resources is nothing new. It has been the much muddied central message of the Semantic Web movement.
hope the #curation community supports upcoming #w3c spec on provenance - signal attribution, quotation and source w3.org/blog/SW/2011/10/23/…
Groth mentions some serious work that has been going on at the W3C for expressing provenance on the Web. The challenge here I think is to have something with the simplicity of a microformat for expressing these semantics. Perhaps some additions to the Link Relations Registry that would let HTML authors use rel=“via” or whatever…This seems a bit more sensible to me than expecting people to all use the same obscure unicode characters at any rate.
I guess the back story here is that a lot of this discussion is the result of discussions going on at SXSW.
Last week Liam Wyatt emailed me asking if I could add The National Museum of Australia to Linkypedia, which tracks external links from Wikipedia articles to specific websites. Specifically Liam was interested in seeing the list of articles that reference the National Museum, sorted by how much they are viewed at Wikipedia. This presented me with two problems:
I turned Linkypedia off a few months ago, since the site hadn’t been seeing much traffic, and I have not yet figured out how to keep the site going on the paltry Linode VPS I’m using for other things like this blog.
I hadn’t incorporated Wikipedia page view statistics into Linkypedia, because I didn’t know they were available, and even if I had I didn’t have Liam’s idea of using them in this way.
2011 was the year of streaming music for me–specifically using Rdio. Being able to follow what friends and folks I admire are listening to, easily listen along, and then build my own online collection in the cloud was a revelation. Being able to easily do it from my desktop at home, or at work, or on my mobile device for $5/month was just astounding. The world ain’t perfect, but this is damn near close.
Anyhow, here’s some of my favorite music from 2011, in no particular order … a lot of which I probably wouldn’t have listened to if it wasn’t streamable on the Web. You might have to wait a few seconds while the YouTube clips load.
I got a Kindle Touch today for Christmas–thanks Kesa! Admittedly I’m pretty late to this party. As I made ready to purchase my first ebook I hopped over to my GoodReads to-read list, to pick something out. I scanned the list quickly, and my eye came to rest on Stephen Ramsey’s recent book Reading Machines. But I got hung up on something irrelevant: the subtitle was Toward and Algorithmic Criticism instead of Toward an Algorithmic Criticism, the latter of which is clearly correct based on the cover image.
Having recently looked at API services for book data I got curious about how the title appeared on other popular web properties, such as Amazon:
I wasn’t terribly surprised not to find it on OpenLibrary. But it does seem interesting that the exact same typo is present on all these book websites as well, while the title appears correct on the publisher’s website:
It’s hard to tell for sure, but my guess is that Amazon, Barnes & Noble, and GoogleBooks got the error from Bowker Link (the Books in Print data service), and that LibraryThing then picked up the data from Amazon, and similarly GoodReads picked up the data from GoogleBooks. LibraryThing can pull data from a variety of sources, including Amazon; and I’m not entirely sure where GoodReads gets their data from, but it seems likely that it comes from the GoogleBooks API given other tie-ins with Google.
If you know more about the lineage of data in these services I would be interested to hear it. Specifically if you have a subscription to BowkerLink it would be great if you could check the title. It would be nice to live in a world where these sorts of data provenance issues were easier to read.
So the latest in my experiments is nytimestream, which is a visualization (ok, it’s just a list) of New York Times headlines using the Times Newswire API. When I saw Derek Willis recently put some work into a Ruby library for the API I got to thinking what it might be like to use Node.js and Socket.IO to provide a push stream of updates. It didn’t take too long. I actually highly doubt anyone is going to use nytimestream much. So you might be wondering why I bothered to create it at all. I guess it was kind more of an academic exercise than anything to reinforce some things that Node.js has been teaching me.
Normally if you wanted a web page to dynamically update based on events elsewhere you’d have some code running in the browser routinely poll a webservice for updates. In this scenario our clients (c1, c2 and c3) poll the Times Newswire directly:
But what happens if lots of people start using your application? Yup, you get lots of requests going to the web service…which may not be a good thing, particularly if you are limited to a certain number of requests per day.
So a logical next step is to create a proxy for the webservice, which will reduce hits on the Times Newswire API.
But still, the client code needs to poll for updates. This can result in the proxy web service needing to field lots of requests as the number of clients increases. You can poll less, but that will diminish the real time nature of your app. If you are interested in having the real time updates in your app in the first place this probably won’t seem like a great solution.
So what if you could have the proxy web service push updates to the clients when it discovers an update?
This is basically what an event-driven webservice application allows you to do (labelled NodeJS in the diagram above). Node’s Socket.IO provides a really nice abstraction around streaming updates in the browser. If you view source on nytimestream you’ll see a bit of code like this:
Just because I could, I added a little easter egg view in nytimestream, which allows you to see new stories come across the wire as JSON when nytimestream discovers them. It’s similar to Twitter’s stream API in that you can call it with curl. It’s different in that, well, there’s hardly the same amount of updates. Try it out with:
The occasional newlines are there to prevent the connection from timing out.
Thanks to everyone that noticed the Wikistream coverage in the NextWeb article and elsewhere. If you happen to have tweeted about Wikistream in the last 2 days you should see your avatar to the left. Click on it to make it bigger. I’m in there somewhere too :-)
Before Sunday the site hadn’t seen more than 180 unique visitors per day, and on Monday it saw almost 30,000. The site is kind of different since it streams all the Wikipedia updates to the browser as JSON, where it is then displayed. I had some nail-biting moments as I watched Node frequently streaming up to 300 concurrent connections. It was a wild ride for my little Linode VPS with 512MB of RAM, where Wordpress and a Django website were also running…but it seemed to weather the storm OK. Mostly I think I could have used more RAM during peak usage when Node and Redis were wanting enough memory to cause the system to swap. Thanks to Gabe and Chris for helping me get the cache headers set right in Express.
I thought briefly about upgrading to a larger Linode instance, putting the app on EC2, or maybe asking Wikimedia if they wanted to host it. But Wikistream is really more a piece of performance art than it is a useful website. I’m expecting people that have looked at Wikistream once will have seen how much Wikipedia is actively edited, and not feel compelled to look at it again. After a few days I expect the usage to plummet and it can go back to running comfortably on my little Linode VPS to serve as a live prop in presentations about Wikipedia, crowd-sourcing, Web culture, etc.
One of my favorite mentions of Wikistream came from Nat Torkington’s Four Short Links on O’Reilly Radar. Nat described Wikistream as
fascinating and hypnotic and inspirational and appalling and irrelevant all at once
I took this as high-praise of course. I could only get the last two days out of Twitter’s search API, which misses the day that the NextWeb article appeared, followed by it getting picked up on Hacker News. But it was 226 tweets, and provided for a fun little data set to look at. I wrote a little script to look for URLs in the tweets, unshorten them and come up with a list of web pages that mentioned Wikistream in the past few days. One thing that was really interesting to me was the predominance of non-English websites. Here’s a list of some of them if you are interested.
The field of software development, the Web and libraries is changing so fast that there is no way to know everything I need to know to do my job well. Wikipedia continues to be an essential resource for learning about technologies, algorithms, people, and history related to my work. It’s hard to imagine what the world would be like without it. Thanks for another year of awesome Wikipedia! The check is in the mail; well OK it’s actually coming from PayPal … you know the drill.