Passport Photos

Passport Photos

He would see faces in movies, on TV, in magazines, and in books. He thought that some of these faces might be right for him. And through the years, by keeping an ideal facial structure fixed in his mind, or somewhere in the back of his mind, that he might, by force of will, cause his face to approach those of his ideal.

The change would be very subtle. It might take ten years or so. Gradually his face would change its shape. A more hooked nose. Wider, thinner lips. Beady eyes. A larger forehead. He imagined that this was an ability he shared with most other people. They had also molded their faces according to some ideal. Maybe they imagined that their new face would better suit their personality. Or maybe they imagined that their personality would be forced to change to fit the new appearance. This is why first impressions are often correct.

Although some people might have made mistakes. They may have arrived at an appearance that bears no relationship to them. They may have picked an ideal appearance based on some childish whim, or momentary impulse. Some may have gotten halfway there, and then changed their minds. He wonders if he too might have made a similar mistake.

Seen and Not Seen by Talking Heads

metadata from Getty’s Open Content Program (part 2)

A few weeks ago I wrote a brief post about the embedded metadata found in images from the (awesome) Getty Open Content Program. This led to a useful exchange with Brenda Podemski on Twitter, which she gathered together with Storify. I promised her I would write another blog post that showed how the metadata could be expressed a little bit better.

It’s hard to read RDF as XML and Turtle isn’t for everyone, so here’s a picture of part of the XMP RDF metadata that is included in the highres download for a photo by Eugène Atget of a sculpture Bosquet de l’Arc de Triomphe by Jean-Baptiste Tuby. I haven’t portrayed everything in the file since it would clutter up the point I’m trying to make.

Original Description

Depicted here are two resources described in the RDF, the JPEG file itself and what the IPTC vocabulary calls an Artwork or Object. Now, it is good that the description distinguishes between the file and the photograph. The Dublin Core somewhat famously (in metadata circles) call this the One-To-One Principle. But notice how there is a dc:description attached to the file resource with lots of useful information concatenated together as a string? My question to Brenda was whether that string was actually available as structured data, and could it be expressed differently? Her response seemed to indicate that it was.

My suggestion is to unpack and move that concatenated string to describe the photograph, like so:

Unpacked description

Notice how the dimensions, format, type and were broken out into separate assertions about the photograph? I also quickly modified the description to use the Dublin Core vocabulary since it was more familiar to me. I wasn’t able to quickly find good properties for height and width, but I imagine they are out there somewhere, and if not there could be.

Of course, one could go further, and say there are really three resources: the file, the photograph, and the sculpture.

Added Sculpture

But this could be extra work for the Getty, if they don’t have this level of description yet. The half-step of enriching the description by indicating that it is a photograph of particular dimensions in a particular format seems like a useful thing to do for this example though, especially if they have that structured data already. My particular vocabulary choices (dc, foaf, etc) aren’t important compared to hanging the descriptions off of the right resources.

But, and this is a doozy of a but, it looks like from other metadata in the RDF that the metadata is being input with Photoshop. So while it is technically possible to embed this metadata in XMP as RDF, it is quite likely that Photoshop doesn’t give you the ability to enter it. In fact, it is fairly common for some image processing applications to strip parts or all of the embedded metadata. So to embed these richer descriptions into the files one might need to write a small program to do it.

There is another place where the metadata could be embedded though. What if the webpage for the item had embedded RDFa or Microdata in it that expressed this information? If they could commit to a stable URL for the item it would be a perfect place for both the human and machine readable description. All they would have to do would be to link the XMP metadata to it somehow, and adjust the templates they are using that drive the HTML display.

watermark woodcut indigo octavo

You know how sometimes you can get ideas for a subject you are interested in by studying a different but related subject? So, strangely enough, I’ve found myself reading about paper conservation. Specifically, at the moment, a book called Books and Documents: dating, permanence and preservation by Julius Grant. It was printed in 1937, so I guess a lot of the material is dated now (haha)…but somehow it’s only making it that much more interesting to read.

There are long sections detailing experiments on paper and ink to determine their composition, in order to roughly estimate when a document was likely to have been created. On pages 41-44 he provides a list of supplementary evidence that can be used.

There are of course many other minor sources of evidence which may prove helpful in establishing the date of a book or document, but to discuss them all in full detail would bring this volume outside its professed scope. It has, however, been thought desirable to summarize the more important of them in the form of a chronological table, which may be used in conjunction with the information on paper and ink already provided.

The list was so delightful, and oddly thought provoking, that I took the time to transcribe it below. I randomly linked some of the terms and names to Wikipedia to ensure you get completely lost.

Seventh century. The first bound books and introduction of quill pens.

863. The oldest printed book known (printed from blocks by Wang Chieh of Kansau, China).

1020. Beginning of the gradual transition from carbon to iron-gall writing inks.

1282. The earliest known watermark.

1307. Names of paper-makers first incorporated into watermarks.

1341. Invention of printing from movable type (by Pi Sheng) in China.

1400 (circ.). Introduction of alum-tanned white pig-skin bindings.

1440 (circ.). Invention of printing from movable type in Europe (Johann Gutenberg, Mainz).

1445-1500. Alternate light and dark striations in the look-through of paper due to construction of the mould.

1454. The first dated publication produced with movable type.

1457. The first book bearing the name of the printer.

1461. The first illustrated book (crude woodcuts).

1463. The first book with a title-page.

1465. The earliest blotting paper (vide infra, 1800) ; this is sometimes found in old books and manuscripts and its presence may help to date them, although of course, the blotting paper may have been inserted subsequently to the date of origin.

1470 (circ.). Great increase in the number of bound books produced, following the advent of printing ; vellum and leather used principally.

1470. The first book with pagination and headlines.

1472. The first book bearing printed signatures to serve as a guide to the binder.

1474. The first book published in English (by William Caxton, in Bruges).

1476. The first work printed in England (by William Caxton).

1483. The first double watermark.

1500. Introduction of the small octavo.

1500. Introduction of Italics.

1536. The first book printed in America.

1545 (circ.). Introduction of custom of using italics only for emphasis. Mineral oil and rosin first used in printing inks.

1560. Introduction of the sexto decimo.

1570. Introduction of the I2mo.

1570. Introduction of thin papers.

1575 (circ.). The first gold-tooling.

1580. Introduction of the modern forms of “i,” “j,” “u,” and “v.”

1580. (circ.). The first pasteboards.

1600 (circ.). Copper-plate illustration sufficiently perfected to replace crude woodcuts. Introduction of red morocco bindings.

1650. Wood covers (covered with silk, plush or tapestry) used for binding.

1670. Introduction of the hollander.

1720. Perfection of the vignette illustration.

1734. Caslon type introduced.

1750 (circ.). The first coth-backed paper (used only for maps).

1750 (circ.). Gradual disappearance of vellum for binding and introduction of millboard covered with calf; or half-covered with leather and half with marbled paper, etc. The first wove paper (Baskerville).

1763. Logwood inks probably first introduced.

1770. Indigo first used in inks (Eisler).

1780. Steel pens invented.

1796. The first lithographic machine.

1796. The first embossed binding.

1800. Blotting-paper in general use (vide supra, 1465) in England, following an accidental rediscovery at Hagbourne, Berkshire.

1803 (circ.). Metal pens first placed on the market.

1816 (circ.). Coloured inks first manufactured in England using pigments.

1820 (circ.). Linen-canvas first used instead of parchment to hold the back of the book into the cover. Introduction of straight-grained red morocco bindings (see 1600).

1820. The invention of modern type of metal nib.

1825. The first permanent photographic image (Niepce).

1830 (circ.). The first linen cover. Beginning of the era of poor leather bindings which have since deteriorated.

1830. Title printed on paper labels which were stuck on the cloth for the first time.

1835 (circ.). Decoration by machinery introduced.

1836. Introduction of iron-gall inks containing indigo (Stephens).

1839. Invention of photography (Daguerre).

1840. Titles first stamped on cloth.

1845. Linen board cover in common use. At about this time it became usual to trim the edges of books, and the practice of binding in quarter-leather declined.

1852. Invention of photogravure, leading to the development of lithographic etchings, colour prints, line engravings, etc. (Fox-Talbot).

1855. Cotton first used as a cover for binding-boards.

1856. Discovery of the first coal-tar dyestuff (Perkin’s mauve), leading to the use of such dyestuffs in coloured inks.

1860. Beginning of the custom of paring calf binding leathers to the thickness of paper.

1861. Introduction of synthetic indigo for inks.

1878. Invention of the stylographic pen.

1885. Invention of the half-tone process (F. E. Ives).

1905. The first offset litho press.

More about the other subject later …


The Side of Pompidou Center
by Frank Kovalchek

Last week Stephen Wolfram opened the FEDLINK Innovation Talk series at the Library of Congress. Wolfram is truly an intellectual maverick (check out his Wikipedia page) and kind of an archetype for what is typically meant by the word genius. I haven’t read A New Kind of Science, or used Wolfram Alpha or Mathematica very much. Perhaps if I knew more of the details behind his thinking I wouldn’t have left the talk feeling a little bit disappointed.

I would have liked to hear him reflect on what motivated him to spend 25 years building a platform to make “knowledge computable”. He clearly had a vision of this work, and it would’ve been fun to hear about it – and where he sees this type of work in 25 years. Perhaps some discussion of whether there were boundaries to making knowledge computable, and if knowledge itself can be thought of without human intention as part of the equation. It would also have been interesting to hear a few more technical details about the platform itself and how it is orchestrated. Maybe I wasn’t listening closely enough, but what we got instead was a lot of pointing, clicking, typing and talking to Wolfram Alpha: showing off how it could answer questions like a good reference librarian–and some quite funny jokes along the way.

But, to be fair, he did mention a few interesting things, especially during the (too brief) question and answer period at the end. Here are the things I took mental note of, that I still remember a week later.

I hadn’t heard of the Computable Document Format before, but I guess it’s been around for a few years. CDF is Wolfram’s own custom file format that makes data visualizations interactive.

… the CDF standard is a computation-powered knowledge container—as everyday as a document, but as interactive as an app.

It seems to work as a browser plugin that bridges to Mathematica. One nice side effect of CDF is that it makes the underlying data available. Another side effect is that any CDF document that is composed with FreeCDF is automatically licensed CC-BY-SA. You can also pay for EnterpriseCDF, which then provides more licensing options, as well as the ability to add what looks like DRM to CDF documents. The documentation talks about it being a standard, and Wikipedia says it has an assigned MIME type (application/cdf), but I can’t seem to find a specification for it, or even a registration of the mime type at the IETF. Considering the level of interactivity that documents have on the Web now, with open tools like D3 that sit on top of features from HTML 5 and JavaScript it’s hard to get terribly excited about CDF.

Wolfram also mentioned work on something he called the Wolfram Data Format. I can’t seem to find much information about it on the Web. It sounded like something akin to Resource Description Format, for describing entities and their attributes, and relations … and seemed to primarily be used for getting data into and out of the knowledgebase that Wolfram Alpha sits on top of. During the Q/A session someone asked about Wolfram’s views on Linked Data, and he knew enough about it to say that RDF wasn’t expressive enough for his needs. He wasn’t terribly clear on how it wasn’t expressive enough: I remember an example about needing to express the position of Mercury and Venus at various points in a concise way. In my experience I’ve found that RDF gave me plenty of rope.

There is a Pro version of Wolfram Alpha that lets you do a variety of things you can’t do in the free version. The most interesting one of these is that it lets you upload your own data in a bunch of different formats for analysis by Alpha. Presumably this data could be added to the Wolfram Alpha knowledgebase, and help form what Wolfram called the Wolfram Repository.

The R word is pretty charged at my place of work, and I imagine it might be at yours too. Collectively, many dollars have been spent creating systems, certification guidelines, and research about what the digital repository might be. As with many e-this and e-that words, the word digital doesn’t really add so much to the meaning of digital repository as repository does. Wolfram Alpha defines repository as:

A facility where things can be deposited for storage and safekeeping

Historically, repositories of knowledge have been found in the form of libraries, archives and museums that are sometimes part of larger institutions like schools, universities, societies, governments, businesses, or personal collections. So Wolfram wants the research community to use Wolfram Alpha as a repository. The carrot here is that data that is uploaded to the Wolfram Repository will be directly citable in CDF documents. During his response about Linked Data, Wolfram commented on how often URLs break, and how they weren’t suitable for linking papers to their underlying data. The solution that he seemed to propose is that data would be citable as long as writers used his document format, editing tools, and repository.

Indeed, when asked about the role of libraries and the library profession Wolfram responded saying that in his view the role of the librarian will be to help educate people who have data, to help make it computable, by massaging it into the correct format. What he didn’t say (but I heard) was that the correct format was WDF, and that it would be made computable by pushing it into the Wolfram Alpha data repository.

Don’t get me wrong, I think his vision of a future for libraries that help researchers work with data is a compelling one. It’s an extension of a trend over the past 10-15 years where libraries have built statistical, textual or geographic data collections, that are made available with educational services around them. Certainly getting data into and out of Wolfram Alpha, and making it citable by CDF documents could be a component of this work.

But what was missing from Wolfram’s presentation was a vision for how we build data repositories collaboratively, across cultural, corporate and socio-political borders. There were glimpses of an amazing system that he has built, with algorithms and meta-algorithms for choosing them … but it wasn’t clear how to add your own algorithms, to introspect on the decisions that were being made, and see the sources of data that were used in its computations.

Above all, I didn’t hear Wolfram describe how his platform includes the Web as an essential part of its architecture. I know I’m biased towards the Web, but Tim Berners-Lee’s enduring insight is that the design of the Web needed (and still needs) to be open. Sometimes open systems can seem ugly (hence the picture of the Pompidou Center above) since they show you the guts of things. Occasionally things can get nasty when parties have opposing interests. But it’s extremely important to try. How do we build a future where libraries, archives and museums collect locally and build repositories of data for systems like Wolfram Alpha, Wikipedia, Google’s Knowledge Graph, Facebooks OpenGraph in a sustainable way? Libraries aren’t well-equipped to build these types of systems themselves, the state of the art is always changing. But these institutions ought to be in a good position to serve as trusted partners, tied to the interests of particular knowledge communities, that can help make data available to the systems like Wolfram Alpha.

Gendered Archivist

Over the past few years I’ve been trying to deepen my understanding of the literature of and about archives. My own MLIS education was heavy on libraries and light on archives; so I was really quite unaware of how rich the thinking about archives is…and how much more relevant it is for the work of digital preservation.

After not being a member of any professional organization for over ten years I joined the Society of American Archivists two years ago. I really enjoyed when the SAA’s quarterly American Archivist started showing up in my mailbox. Incidentally they have put all their content online for the public, but keep the last 3 years embargoed for SAA members only.

Since I have so much catching up to do I thought it would be interesting to try to harvest some of the article metadata that could be gleaned from the website, to see if I could get my computer to teach me something about the 76 years of content. If you are interested you can find some code I wrote to do this, and the resulting metadata about the 42,432 articles on Github.

As a quick test I thought it would be interesting to throw the first names of authors through genderator to see if the gender of authors has changed over time. My first pass just displays the number of authors per year by their gender.

Since the number of authors per article isn’t constant, and the number of articles per year is also variable the graph is a bit noisy. But if you calculate the percentage of authors per year that were male, female or unknown you get a much smoother graph.

As you can see genderator isn’t perfect: sometimes it can’t even guess the author’s gender 20% of the time. But even with that noise it’s clear to see a gradual increase in the number of women authors, which begins in 1970s and is continuing even to today, where women seem to be represented more than men … although it’s a bit too choppy to tell really.

If you are interested in using this data let me know. I have the publicly available PDF content in an s3 bucket if you have research you’d like to do on it.

Fresh Data

In his talk Secrecy, Archives and the Public Interest in 1970 Howard Zinn famously challenged professional archivists to realize the role of politics in their work. His talk included 7 points of criticism, which are still so relevant today, but the last two really moved me to transcribe and briefly comment on them here:

  1. That the emphasis is on the past over the present, on the antiquarian over the contemporary; on the non-controversial over the controversial; the cold over the hot. What about the transcripts of trials? Shouldn’t these be made easily available to the public? Not just important trials like the Chicago Conspiracy Trial I referred to, but the ordinary trials of ordinary persons, an important part of the record of our society. Even the extraordinary trials of extraordinary persons are not available, but perhaps they do not show our society at its best. The trial of the Catonsville 9 would be lost to us if Father Daniel Berrigan had not gone through the transcript and written a play based on it.

    1. That far more resources are devoted to the collection and preservation of what already exists as records, than to recording fresh data: I would guess that more energy and money is going for the collection and publication of the Papers of John Adams than for recording the experiences of soldiers on the battlefront in Vietnam. Where are the interviews of Seymour Hersh with those involved in the My Lai Massacre, or Fred Gardner’s interviews with those involved in the Presidio Mutiny Trial in California, or Wallace Terry’s interviews with black GIs in Vietnam? Where are the recorded experiences of the young Americans in Southeast Asia who quit the International Volunteer Service in protest of American policy there, or of the Foreign Service officers who have quietly left?

What if Zinn were to ask archivists today about contemporary events? While the situation is far from perfect, the Web has allowed pheomena like Wikipedia, Wikileaks, the Freedom of the Press Foundation and many, many others, to emerge, and substantially level the playing field in ways that we are still grappling with. The Web has widened, deepened and amplified traditional journalism. Indeed, electronic communication media like the Web have copying and distribution cooked into their very essence, and make it almost effortless to share information. Fresh data, as Zinn presciently calls it, is what the Web is about; and the Internet that the Web is built on allows us to largely route around power interests…except, of course, when it doesn’t.

Strangely, I think if Zinn were talking to archivists today he would be asking them to think seriously about where this content will be in 20 years–or maybe even one year. How do we work together as professionals to collect the stuff that needs saving? The Internet Archive is awesome…it’s simply amazing what such a small group of smart people have been able to do. But this is a heavy weight for them to bear alone, and lots of copies keeps stuff safe right? Where are the copies? Yes there is the IIPC, but can we just assume this job is just being taken care of? What web content is being collected? How do we decide what is collected? How do we share our decisions with others so that interested parties can fill in gaps they are interested in? Maybe I’m just not in the know, but it seems like there’s a lot of (potentially fun) work to do.

metadata from Getty's Open Content Program

Lori Phillips recently mentioned on the open-glam discussion list that Getty are starting to make high-resolution images of some of their public domain material available as part of their Open Content Program. The announcement mentions that metadata is included in each file, so I thought I’d take a look. It’s Adobe Extensible Media Platform (XMP) aka RDF. Here’s an example I pulled from The Portrait of Madame Brunet (I apologize if reading RDF as Turtle burns your eyes, but I find it much easier to read than the equivalent RDF/XML).

@prefix rdf:  .
@prefix Iptc4xmpCore:  .
@prefix Iptc4xmpExt:  .
@prefix dc:  .
@prefix photoshop:  .
@prefix xmpRights:  .

    Iptc4xmpCore:CreatorContactInfo [
        Iptc4xmpCore:CiAdrCity "Los Angeles" ;
        Iptc4xmpCore:CiAdrCtry "United States" ;
        Iptc4xmpCore:CiAdrExtadr "1200 Getty Center Drive" ;
        Iptc4xmpCore:CiAdrPcode "90049" ;
        Iptc4xmpCore:CiAdrRegion "California" ;
        Iptc4xmpCore:CiEmailWork "" ;
        Iptc4xmpCore:CiUrlWork ""
    ] ;
    Iptc4xmpExt:ArtworkOrObject [
        a rdf:Bag ;
        rdf:_1 [
            Iptc4xmpExt:AOCreator [
                a rdf:Seq ;
                rdf:_1 "Édouard Manet"
            ] ;
            Iptc4xmpExt:AOSource "The J. Paul Getty Museum, Los Angeles" ;
            Iptc4xmpExt:AOSourceInvNo "2011.53" ;
            Iptc4xmpExt:AOTitle [
                a rdf:Alt ;
                rdf:_1 "Portrait of Madame Brunet"@x-default
    ] ;
    photoshop:Source "The J. Paul Getty Museum, Los Angeles" ;
    xmpRights:UsageTerms [
        a rdf:Alt ;
        rdf:_1 ""@x-default
    ] ;
    dc:creator [
        a rdf:Seq ;
        rdf:_1 "The J. Paul Getty Museum" 
    ] ;
    dc:date [
        a rdf:Seq ;
        rdf:_1 "2013-06-30T15:14:52"
    ] ;
    dc:description [
        a rdf:Alt ;
        rdf:_1 "Portrait of Madame Brunet; Édouard Manet, French, 1832 - 1883; France, Europe; about 1860 -1863, reworked by 1867; Oil on canvas; Unframed: 132.4 x 100 cm (52 1/8 x 39 3/8 in.), Framed: 153.7 x 121.9 x 7.6 cm (60 1/2 x 48 x 3 in.); 2011.53"@x-default
    ] ;
    dc:title [
        a rdf:Alt ;
        rdf:_1 "Portrait of Madame Brunet"@x-default
    ] .

The description is kind of heavy on information about the Getty, but light on information about the painting. For example it’s not clear from the metadata that this is even a painting. You can see from the HTML detail page that there is a fair bit more about it available:

Édouard Manet (French, 1832 - 1883)
Portrait of Madame Brunet, about 1860 -1863, reworked by 1867, Oil on canvas
Unframed: 132.4 x 100 cm (52 1/8 x 39 3/8 in.)
Framed: 153.7 x 121.9 x 7.6 cm (60 1/2 x 48 x 3 in.)
The J. Paul Getty Museum, Los Angeles

Still, it’s an incredible step forward to see these high resolution images being made available on the Web for download.

Yu Too?

Sorry Please Thank You: StoriesSorry Please Thank You: Stories by Charles Yu
My rating: 5 of 5 stars

These short stories were just awesome – funny, smart, incisive, light, beautifully distracting. I must read more Yu now. There were so many good stories in this collection: all of them hauntingly familiar and strangely different. I couldn’t shake a feeling of synesthesia, like I was reading along to my favorite Radiohead songs – elated, disoriented, integrated, fragmented, jettisoned, ordinary, cute and more than a bit scary.

A quote from the story Adult Contemporary

Murray tries to see what Rick is talking about, but all he sees is a kind of factory. A manufacturing process for a way of life. Taking anything, experience, a piece of experimental stuff, a particle of particularity, a sound, a day, a song, a bunch of stuff that happens to people, a thing, that makes you laugh, a visual, a feeling, whatever. A mess. A blob. A chunk. A messy, blobby, chunky glob of stuff. Unformed, raw non-content that gets engineered, honed, and refined until some magical point where it has been processed to sufficient smoothness and can be extruded from the machine: content. A chunk of content, homogenous and perfect for slicing up into Content Units. All of this for the customer-citizens, who demand it, or not even demand it but come to expect it, or not even expect it, as that would require awareness of any alternative to the substitute, an understanding that this was not always so, that, once upon a time, there was the real thing. They don’t demand it or expect it. They assume it. The product is not a product, it’s built into the very notion of who they are. Content Units everywhere, all of it coming from the same source: jingles, news, ads. Ads, ads, ads. Ads running on every possible screen. Screens at the grocery store, in the coffee line, on the food truck, in your car, on top of taxis, on the sides of buses, in the air, on the street signs, in your office, in the lobby, in the elevator, in your pocket, in your home. Content pipelines productive as ever, churning and chugging, pumping out the content day and night, conceptual smokestacks billowing out content-manufacturing waste product emissions, marginal unit cost of content dropping every day, content just piling up, containers full, warehouses full, cargo ships full, the channels stuffed to bursting with content. So much content that they needed to make new markets just to find a place to put all of it, had to create the Town, and after that, another Town, and beyond that, who knew? What were the limits for American Entertainments Inc., and its managed-narrative experiential lifestyle products? How big could the Content Factory get?

I really just wanted to transcribe that–to type it in, and pretend that I wrote it. If typing is writing I guess I did write it. Yu’s voice is infectious, and familiar – like he’s telling you things you already know, but in an interesting way you’ve never really considered before. I’ll give you my copy if you want it. Just send me an email with your mailing address in it. Seriously.


Notes on Social Mobilization and the Networked Public Sphere

If the ecosystem of the Web and graph visualizations are your thing, Yochai Benkler & co’s recent Social Mobilization and the Networked Public Sphere: Mapping the SOPA-PIPA Debate has lots of eye candy for you. I was going to include some here to entice you over there, but it doesn’t seem to be available under terms that would allow for that (sigh).

You probably remember how Wikipedia went dark to protest the Stop Online Piracy Act, and how phone bridges in legislative offices melted down. You may also have watched Aaron Swartz’s moving talk about the role that Demand Progress played in raising awareness about the legislation, and mobilizing people to act. If you haven’t go watch it now.

Benkler’s article really digs in to take a close look at how the “attention backbone” provided by non-traditional media outlets allowed much smaller players to coordinate and cooperate across the usual political boundaries to disrupt and ultimately defeat some of the most powerful lobby groups and politicians in the United States. I’ll risk a quote from the conclusion:

Perhaps the high engagement of young, net-savvy individuals is only available for the politics of technology; perhaps copyright alone is sufficiently orthogonal to traditional party lines to traverse the left-right divide; perhaps Go Daddy is too easy a target for low-cost boycotts; perhaps all this will be easy to copy in the next cyber-astroturf campaign. Perhaps.

But perhaps SOPA-PIPA follows William Gibson’s “the future is already here—it’s just not very evenly distributed.” Perhaps, just as was the case with free software that preceded widespread adoption of peer production, the geeks are five years ahead of a curve that everyone else will follow. If so, then SOPA-PIPA provides us with a richly detailed window into a more decentral-ized democratic future, where citizens can come together to overcome some of the best-funded, best-connected lobbies in Washington, DC.

Obviously, I don’t know what the future holds. But I hope Benkler’s hunch is right, and that we have just started to see how the Web can shift political debate into a much more interactive and productive mode.

I’ve just come off of a stint of reading Bruno Latour, so the data driven network analysis and identification of human and non-human actors participating in a controversy that needed mapping really struck a chord with me. I felt like I wandered across an extremely practical, relevant and timely example of how Latour’s ideas can be put into practice, in a domain I work with every day–the Web.

But I must admit the article was most interesting to me because of some of the technical aspects of how the work was performed. The appendix at the end of the paper provides a very lucid description of how they used a tool called MediaCloud to collect, house and analyze the data that formed the foundation of their study. MediaCloud is an open source project devloped at the Berkman Center and is available on GitHub. MediaCloud allows researchers to collect material from the Web, and then perform various types of analysis on it. Conceptually it’s in a similar space as the (recently IMLS funded) Social Feed Manager that Dan Chudnov is working on at George Washington University, now with help from Yale, the University of North Texas and the Center for Jewish History. The appendix is an excellent picture of the sorts of questions we can expect social science researchers to ask of data collections–thankfully absent the distracting term big data. What’s important is the questions they asked, and how they went about answering them – not whether they were keyword compliant.

A few things about MediaCloud and its use that stood out for me:


Even after all these years, syndication technologies like RSS and Atom continue to be useful. Somewhat paradoxically the death of Google Reader and the increasing awareness of the importance of the indieweb seem to have ushered in a sort of golden age of RSS. Or at least their is increased awareness of its role. I think it’s important for content providers to know that simple technologies like RSS still perform an important function in this age of fit to purpose Web APIs.


Time is a very important dimension to doing this sort of research. Benkler’s analysis hinges on the view of the graph over time as the controversy of the SOPA debate evolved. It’s not just a picture of what the graph of content looks like afterwards. It’s only when looking at it over time that the various actors are revealed. Getting the time out of RSS feeds wasn’t so bad, but the lather-rinse-repeat nature to finding more stories and media outlets on the web, meant that they had to come up with some heuristics for determining where the content should be situated in time.


Similarly not all content and links on pages is relevant for this sort of analysis. Advertisements and boilerplate headers/footers around content can often add unwanted noise to the analysis. The authors drew on some research from 2007 on HTML density to extract just the salient bits of content in the HTML representation. Services like Readability have algorithms and heuristics for doing similar things, but I hadn’t heard it described so succinctly before as it was in Benkler’s article. The more dense the HTML tags, the less likely it is that the text is the primary content of the page. Truth be told, I’m writing this post for my future self when I go looking for this technique. I think I found the MediaCloud function for generating the HTML density. It might be nice to spin this out into a separate reusable piece…


Speaking of code, according to Github, MediaCloud is 85.9% Perl and 10.8% JavaScript, and was coded up mainly by David Larouchelle, Linas Valiukas, Hal Roberts. There is a fair bit of text munging going on in MediaCloud, so I shouldn’t be surprised to see Perl so heavily used…but I was. I don’t reach for Perl much these days, but I imagine there’s no shortage of demand in certain circles (nudge, nudge, wink, wink, say no more) for Perl hackers. Larry Wall was a linguist, and Perl’s roots are in this sort of linguistic analysis cooked together with the medium of the Web – it’s nice to see some constants in the software development world as it so often froths over with the latest tech trend or meme.

So, it was a good read. Highly recommended. 5 stars. Wealth of Networks is now in my queue.