GoodReads microdata

I’m not sure how long it has been there, but I just happened to notice that GoodReads (the popular social networking site for book lovers to share what they are reading and have read) has implemented HTML5 Microdata to make metadata for books available in the HTML of their pages. GoodReads has chosen to use the the Book type from schema.org vocabulary, most likely because the big three search engines (Google, Bing and Yahoo) announced that they will use the metadata to enhance their search results. So web publishers are motivated to publish metadata in their pages, not because it’s the “right” thing to do, but because they want to drive traffic to their websites.

If you are new to HTML5 Microdata, schema.org and what it means for books, check out Eric Hellman’s recent post Spoonfeeding Library Data to Search Engines. And if you are generally curious about HTML5 Microdata, the chapter in Mark Pilgrim’s Dive into HTML5 is really quite good.

But Microdata and schema.org are not just good for the search engines, they are actually good for the Web ecosystem, and for hackers like you and me. For example, go to the page for Alice in Wonderland:

If you view source on the page, and search for itemtype or itemprop you’ll see the extra Microdata markup. The latest HTML5 specification has a section on how to serialize microdata as JSON, and the processing model is straightforward enough for me to write a parser on top of html5lib in less than 200 lines of Python. So this means you can:

import urllib
import microdata

url = "http://www.goodreads.com/book/show/24220.Alice_s_Adventures_in_Wonderland_and_Through_the_Looking_Glass"
items = microdata.get_items(urllib.urlopen(url))

print items[0].json()

and you’ll see:

{
  "numberOfPages": [
    "400"
  ],
  "isbn": [
    "9780141439761"
  ],
  "name": [
    "Alice's Adventures in Wonderland and Through the Looking-Glass"
  ],
  "author": [
    {
      "url": [
        "http://www.goodreads.com/author/show/8164.Lewis_Carroll",
        "http://www.goodreads.com/author/show/495248.Hugh_Haughton",
        "http://www.goodreads.com/author/show/180439.John_Tenniel"
      ],
      "type": "http://schema.org/Person"
    }
  ],
  "image": [
    "/book/photo/24220.Alice_s_Adventures_in_Wonderland_and_Through_the_Looking_Glass"
  ],
  "inLanguage": [
    "English"
  ],
  "ratingValue": [
    "4.03"
  ],
  "ratingCount": [
    "64,628 ratings"
  ],
  "bookFormatType": [
    "Paperback"
  ],
  "type": "http://schema.org/Book"
}

If you have spent a bit of time writing screenscrapers in the past, this should make your jaw drop a bit. What’s more they’ve also added Microdata to the search results page, so you can see metadata for all the books in the results, for example using Google’s Rich Snippets Testing Tool.

Funnily enough, while I was writing this blog post, over in the #code4lib IRC chat room Chris Beer brought up the fact that some Blacklight developers were concerned that <link rel=“unapi-server”> wasn’t valid HTML5. Chris was wondering if anyone was interested in trying to register “unapi-server” with the WHATWG…

&crickets;

Issues of “valid” HTML5 aside, this discussion highlighted for me just how far the world of metadata on the Web has advanced since a small group of library hackers worked on unAPI. The use of HTML5 Microdata and schema.org by Google, Bing and Yahoo, and the use of RDFa by Facebook are great examples of some mainstream solutions to what some of us were trying to achieve with unAPI. Seeing sites like GoodReads implement Microdata, and announcements like Opera support for Microdata are good reminders that the library software development community is best served by paying attention to mainstream solutions, as they become available, even if they eclipse homegrown stopgap solutions.

It is somewhat problematic that Facebook has aligned with RDFa and the Open Graph Protocol, and Google, Bing and Yahoo have aligned with HTML5 and schema.org. GoodReads has implemented both incidentally. I heard a rumor that Facebook was invited to the schema.org table and declined. I have no idea if that is actually true. I also have heard a rumor that Ian Hickson of Google wrote up the Microdata spec in a weekend because he hates RDFa. I don’t know it that’s actually true either. The company and personality rivalries aside, if you are having trouble deciding which one to more fully support, try writing a program to parse RDFa and Microdata. It will probably help clarify some things…