Wolfram


The Side of Pompidou Center
by Frank Kovalchek

Last week Stephen Wolfram opened the FEDLINK Innovation Talk series at the Library of Congress. Wolfram is truly an intellectual maverick (check out his Wikipedia page) and kind of an archetype for what is typically meant by the word genius. I haven’t read A New Kind of Science, or used Wolfram Alpha or Mathematica very much. Perhaps if I knew more of the details behind his thinking I wouldn’t have left the talk feeling a little bit disappointed.

I would have liked to hear him reflect on what motivated him to spend 25 years building a platform to make “knowledge computable”. He clearly had a vision of this work, and it would’ve been fun to hear about it – and where he sees this type of work in 25 years. Perhaps some discussion of whether there were boundaries to making knowledge computable, and if knowledge itself can be thought of without human intention as part of the equation. It would also have been interesting to hear a few more technical details about the platform itself and how it is orchestrated. Maybe I wasn’t listening closely enough, but what we got instead was a lot of pointing, clicking, typing and talking to Wolfram Alpha: showing off how it could answer questions like a good reference librarian–and some quite funny jokes along the way.

But, to be fair, he did mention a few interesting things, especially during the (too brief) question and answer period at the end. Here are the things I took mental note of, that I still remember a week later.

I hadn’t heard of the Computable Document Format before, but I guess it’s been around for a few years. CDF is Wolfram’s own custom file format that makes data visualizations interactive.

… the CDF standard is a computation-powered knowledge container—as everyday as a document, but as interactive as an app.

It seems to work as a browser plugin that bridges to Mathematica. One nice side effect of CDF is that it makes the underlying data available. Another side effect is that any CDF document that is composed with FreeCDF is automatically licensed CC-BY-SA. You can also pay for EnterpriseCDF, which then provides more licensing options, as well as the ability to add what looks like DRM to CDF documents. The documentation talks about it being a standard, and Wikipedia says it has an assigned MIME type (application/cdf), but I can’t seem to find a specification for it, or even a registration of the mime type at the IETF. Considering the level of interactivity that documents have on the Web now, with open tools like D3 that sit on top of features from HTML 5 and JavaScript it’s hard to get terribly excited about CDF.

Wolfram also mentioned work on something he called the Wolfram Data Format. I can’t seem to find much information about it on the Web. It sounded like something akin to Resource Description Format, for describing entities and their attributes, and relations … and seemed to primarily be used for getting data into and out of the knowledgebase that Wolfram Alpha sits on top of. During the Q/A session someone asked about Wolfram’s views on Linked Data, and he knew enough about it to say that RDF wasn’t expressive enough for his needs. He wasn’t terribly clear on how it wasn’t expressive enough: I remember an example about needing to express the position of Mercury and Venus at various points in a concise way. In my experience I’ve found that RDF gave me plenty of rope.

There is a Pro version of Wolfram Alpha that lets you do a variety of things you can’t do in the free version. The most interesting one of these is that it lets you upload your own data in a bunch of different formats for analysis by Alpha. Presumably this data could be added to the Wolfram Alpha knowledgebase, and help form what Wolfram called the Wolfram Repository.

The R word is pretty charged at my place of work, and I imagine it might be at yours too. Collectively, many dollars have been spent creating systems, certification guidelines, and research about what the digital repository might be. As with many e-this and e-that words, the word digital doesn’t really add so much to the meaning of digital repository as repository does. Wolfram Alpha defines repository as:

A facility where things can be deposited for storage and safekeeping

Historically, repositories of knowledge have been found in the form of libraries, archives and museums that are sometimes part of larger institutions like schools, universities, societies, governments, businesses, or personal collections. So Wolfram wants the research community to use Wolfram Alpha as a repository. The carrot here is that data that is uploaded to the Wolfram Repository will be directly citable in CDF documents. During his response about Linked Data, Wolfram commented on how often URLs break, and how they weren’t suitable for linking papers to their underlying data. The solution that he seemed to propose is that data would be citable as long as writers used his document format, editing tools, and repository.

Indeed, when asked about the role of libraries and the library profession Wolfram responded saying that in his view the role of the librarian will be to help educate people who have data, to help make it computable, by massaging it into the correct format. What he didn’t say (but I heard) was that the correct format was WDF, and that it would be made computable by pushing it into the Wolfram Alpha data repository.

Don’t get me wrong, I think his vision of a future for libraries that help researchers work with data is a compelling one. It’s an extension of a trend over the past 10-15 years where libraries have built statistical, textual or geographic data collections, that are made available with educational services around them. Certainly getting data into and out of Wolfram Alpha, and making it citable by CDF documents could be a component of this work.

But what was missing from Wolfram’s presentation was a vision for how we build data repositories collaboratively, across cultural, corporate and socio-political borders. There were glimpses of an amazing system that he has built, with algorithms and meta-algorithms for choosing them … but it wasn’t clear how to add your own algorithms, to introspect on the decisions that were being made, and see the sources of data that were used in its computations.

Above all, I didn’t hear Wolfram describe how his platform includes the Web as an essential part of its architecture. I know I’m biased towards the Web, but Tim Berners-Lee’s enduring insight is that the design of the Web needed (and still needs) to be open. Sometimes open systems can seem ugly (hence the picture of the Pompidou Center above) since they show you the guts of things. Occasionally things can get nasty when parties have opposing interests. But it’s extremely important to try. How do we build a future where libraries, archives and museums collect locally and build repositories of data for systems like Wolfram Alpha, Wikipedia, Google’s Knowledge Graph, Facebooks OpenGraph in a sustainable way? Libraries aren’t well-equipped to build these types of systems themselves, the state of the art is always changing. But these institutions ought to be in a good position to serve as trusted partners, tied to the interests of particular knowledge communities, that can help make data available to the systems like Wolfram Alpha.


Gendered Archivist

Over the past few years I’ve been trying to deepen my understanding of the literature of and about archives. My own MLIS education was heavy on libraries and light on archives; so I was really quite unaware of how rich the thinking about archives is…and how much more relevant it is for the work of digital preservation.

After not being a member of any professional organization for over ten years I joined the Society of American Archivists two years ago. I really enjoyed when the SAA’s quarterly American Archivist started showing up in my mailbox. Incidentally they have put all their content online for the public, but keep the last 3 years embargoed for SAA members only.

Since I have so much catching up to do I thought it would be interesting to try to harvest some of the article metadata that could be gleaned from the website, to see if I could get my computer to teach me something about the 76 years of content. If you are interested you can find some code I wrote to do this, and the resulting metadata about the 42,432 articles on Github.

As a quick test I thought it would be interesting to throw the first names of authors through genderator to see if the gender of authors has changed over time. My first pass just displays the number of authors per year by their gender.

Since the number of authors per article isn’t constant, and the number of articles per year is also variable the graph is a bit noisy. But if you calculate the percentage of authors per year that were male, female or unknown you get a much smoother graph.

As you can see genderator isn’t perfect: sometimes it can’t even guess the author’s gender 20% of the time. But even with that noise it’s clear to see a gradual increase in the number of women authors, which begins in 1970s and is continuing even to today, where women seem to be represented more than men … although it’s a bit too choppy to tell really.

If you are interested in using this data let me know. I have the publicly available PDF content in an s3 bucket if you have research you’d like to do on it.


Fresh Data

In his talk Secrecy, Archives and the Public Interest in 1970 Howard Zinn famously challenged professional archivists to realize the role of politics in their work. His talk included 7 points of criticism, which are still so relevant today, but the last two really moved me to transcribe and briefly comment on them here:

  1. That the emphasis is on the past over the present, on the antiquarian over the contemporary; on the non-controversial over the controversial; the cold over the hot. What about the transcripts of trials? Shouldn’t these be made easily available to the public? Not just important trials like the Chicago Conspiracy Trial I referred to, but the ordinary trials of ordinary persons, an important part of the record of our society. Even the extraordinary trials of extraordinary persons are not available, but perhaps they do not show our society at its best. The trial of the Catonsville 9 would be lost to us if Father Daniel Berrigan had not gone through the transcript and written a play based on it.

    1. That far more resources are devoted to the collection and preservation of what already exists as records, than to recording fresh data: I would guess that more energy and money is going for the collection and publication of the Papers of John Adams than for recording the experiences of soldiers on the battlefront in Vietnam. Where are the interviews of Seymour Hersh with those involved in the My Lai Massacre, or Fred Gardner’s interviews with those involved in the Presidio Mutiny Trial in California, or Wallace Terry’s interviews with black GIs in Vietnam? Where are the recorded experiences of the young Americans in Southeast Asia who quit the International Volunteer Service in protest of American policy there, or of the Foreign Service officers who have quietly left?

What if Zinn were to ask archivists today about contemporary events? While the situation is far from perfect, the Web has allowed pheomena like Wikipedia, Wikileaks, the Freedom of the Press Foundation and many, many others, to emerge, and substantially level the playing field in ways that we are still grappling with. The Web has widened, deepened and amplified traditional journalism. Indeed, electronic communication media like the Web have copying and distribution cooked into their very essence, and make it almost effortless to share information. Fresh data, as Zinn presciently calls it, is what the Web is about; and the Internet that the Web is built on allows us to largely route around power interests…except, of course, when it doesn’t.

Strangely, I think if Zinn were talking to archivists today he would be asking them to think seriously about where this content will be in 20 years–or maybe even one year. How do we work together as professionals to collect the stuff that needs saving? The Internet Archive is awesome…it’s simply amazing what such a small group of smart people have been able to do. But this is a heavy weight for them to bear alone, and lots of copies keeps stuff safe right? Where are the copies? Yes there is the IIPC, but can we just assume this job is just being taken care of? What web content is being collected? How do we decide what is collected? How do we share our decisions with others so that interested parties can fill in gaps they are interested in? Maybe I’m just not in the know, but it seems like there’s a lot of (potentially fun) work to do.


metadata from Getty's Open Content Program


Lori Phillips recently mentioned on the open-glam discussion list that Getty are starting to make high-resolution images of some of their public domain material available as part of their Open Content Program. The announcement mentions that metadata is included in each file, so I thought I’d take a look. It’s Adobe Extensible Media Platform (XMP) aka RDF. Here’s an example I pulled from The Portrait of Madame Brunet (I apologize if reading RDF as Turtle burns your eyes, but I find it much easier to read than the equivalent RDF/XML).

@prefix rdf:  .
@prefix Iptc4xmpCore:  .
@prefix Iptc4xmpExt:  .
@prefix dc:  .
@prefix photoshop:  .
@prefix xmpRights:  .


    Iptc4xmpCore:CreatorContactInfo [
        Iptc4xmpCore:CiAdrCity "Los Angeles" ;
        Iptc4xmpCore:CiAdrCtry "United States" ;
        Iptc4xmpCore:CiAdrExtadr "1200 Getty Center Drive" ;
        Iptc4xmpCore:CiAdrPcode "90049" ;
        Iptc4xmpCore:CiAdrRegion "California" ;
        Iptc4xmpCore:CiEmailWork "rights@getty.edu" ;
        Iptc4xmpCore:CiUrlWork "www.getty.edu"
    ] ;
    Iptc4xmpExt:ArtworkOrObject [
        a rdf:Bag ;
        rdf:_1 [
            Iptc4xmpExt:AOCreator [
                a rdf:Seq ;
                rdf:_1 "Édouard Manet"
            ] ;
            Iptc4xmpExt:AOSource "The J. Paul Getty Museum, Los Angeles" ;
            Iptc4xmpExt:AOSourceInvNo "2011.53" ;
            Iptc4xmpExt:AOTitle [
                a rdf:Alt ;
                rdf:_1 "Portrait of Madame Brunet"@x-default
            ]
        ] 
    ] ;
    photoshop:Source "The J. Paul Getty Museum, Los Angeles" ;
    xmpRights:UsageTerms [
        a rdf:Alt ;
        rdf:_1 "http://www.getty.edu/legal/image_request/"@x-default
    ] ;
    dc:creator [
        a rdf:Seq ;
        rdf:_1 "The J. Paul Getty Museum" 
    ] ;
    dc:date [
        a rdf:Seq ;
        rdf:_1 "2013-06-30T15:14:52"
    ] ;
    dc:description [
        a rdf:Alt ;
        rdf:_1 "Portrait of Madame Brunet; Édouard Manet, French, 1832 - 1883; France, Europe; about 1860 -1863, reworked by 1867; Oil on canvas; Unframed: 132.4 x 100 cm (52 1/8 x 39 3/8 in.), Framed: 153.7 x 121.9 x 7.6 cm (60 1/2 x 48 x 3 in.); 2011.53"@x-default
    ] ;
    dc:title [
        a rdf:Alt ;
        rdf:_1 "Portrait of Madame Brunet"@x-default
    ] .

The description is kind of heavy on information about the Getty, but light on information about the painting. For example it’s not clear from the metadata that this is even a painting. You can see from the HTML detail page that there is a fair bit more about it available:

Édouard Manet (French, 1832 - 1883)
Portrait of Madame Brunet, about 1860 -1863, reworked by 1867, Oil on canvas
Unframed: 132.4 x 100 cm (52 1/8 x 39 3/8 in.)
Framed: 153.7 x 121.9 x 7.6 cm (60 1/2 x 48 x 3 in.)
The J. Paul Getty Museum, Los Angeles

Still, it’s an incredible step forward to see these high resolution images being made available on the Web for download.


Yu Too?

Sorry Please Thank You: StoriesSorry Please Thank You: Stories by Charles Yu
My rating: 5 of 5 stars

These short stories were just awesome – funny, smart, incisive, light, beautifully distracting. I must read more Yu now. There were so many good stories in this collection: all of them hauntingly familiar and strangely different. I couldn’t shake a feeling of synesthesia, like I was reading along to my favorite Radiohead songs – elated, disoriented, integrated, fragmented, jettisoned, ordinary, cute and more than a bit scary.

A quote from the story Adult Contemporary

Murray tries to see what Rick is talking about, but all he sees is a kind of factory. A manufacturing process for a way of life. Taking anything, experience, a piece of experimental stuff, a particle of particularity, a sound, a day, a song, a bunch of stuff that happens to people, a thing, that makes you laugh, a visual, a feeling, whatever. A mess. A blob. A chunk. A messy, blobby, chunky glob of stuff. Unformed, raw non-content that gets engineered, honed, and refined until some magical point where it has been processed to sufficient smoothness and can be extruded from the machine: content. A chunk of content, homogenous and perfect for slicing up into Content Units. All of this for the customer-citizens, who demand it, or not even demand it but come to expect it, or not even expect it, as that would require awareness of any alternative to the substitute, an understanding that this was not always so, that, once upon a time, there was the real thing. They don’t demand it or expect it. They assume it. The product is not a product, it’s built into the very notion of who they are. Content Units everywhere, all of it coming from the same source: jingles, news, ads. Ads, ads, ads. Ads running on every possible screen. Screens at the grocery store, in the coffee line, on the food truck, in your car, on top of taxis, on the sides of buses, in the air, on the street signs, in your office, in the lobby, in the elevator, in your pocket, in your home. Content pipelines productive as ever, churning and chugging, pumping out the content day and night, conceptual smokestacks billowing out content-manufacturing waste product emissions, marginal unit cost of content dropping every day, content just piling up, containers full, warehouses full, cargo ships full, the channels stuffed to bursting with content. So much content that they needed to make new markets just to find a place to put all of it, had to create the Town, and after that, another Town, and beyond that, who knew? What were the limits for American Entertainments Inc., and its managed-narrative experiential lifestyle products? How big could the Content Factory get?

I really just wanted to transcribe that–to type it in, and pretend that I wrote it. If typing is writing I guess I did write it. Yu’s voice is infectious, and familiar – like he’s telling you things you already know, but in an interesting way you’ve never really considered before. I’ll give you my copy if you want it. Just send me an email with your mailing address in it. Seriously.

Also,

https://twitter.com/charles_yu/status/365653971139956737


Notes on Social Mobilization and the Networked Public Sphere

If the ecosystem of the Web and graph visualizations are your thing, Yochai Benkler & co’s recent Social Mobilization and the Networked Public Sphere: Mapping the SOPA-PIPA Debate has lots of eye candy for you. I was going to include some here to entice you over there, but it doesn’t seem to be available under terms that would allow for that (sigh).

You probably remember how Wikipedia went dark to protest the Stop Online Piracy Act, and how phone bridges in legislative offices melted down. You may also have watched Aaron Swartz’s moving talk about the role that Demand Progress played in raising awareness about the legislation, and mobilizing people to act. If you haven’t go watch it now.

Benkler’s article really digs in to take a close look at how the “attention backbone” provided by non-traditional media outlets allowed much smaller players to coordinate and cooperate across the usual political boundaries to disrupt and ultimately defeat some of the most powerful lobby groups and politicians in the United States. I’ll risk a quote from the conclusion:

Perhaps the high engagement of young, net-savvy individuals is only available for the politics of technology; perhaps copyright alone is sufficiently orthogonal to traditional party lines to traverse the left-right divide; perhaps Go Daddy is too easy a target for low-cost boycotts; perhaps all this will be easy to copy in the next cyber-astroturf campaign. Perhaps.

But perhaps SOPA-PIPA follows William Gibson’s “the future is already here—it’s just not very evenly distributed.” Perhaps, just as was the case with free software that preceded widespread adoption of peer production, the geeks are five years ahead of a curve that everyone else will follow. If so, then SOPA-PIPA provides us with a richly detailed window into a more decentral-ized democratic future, where citizens can come together to overcome some of the best-funded, best-connected lobbies in Washington, DC.

Obviously, I don’t know what the future holds. But I hope Benkler’s hunch is right, and that we have just started to see how the Web can shift political debate into a much more interactive and productive mode.

I’ve just come off of a stint of reading Bruno Latour, so the data driven network analysis and identification of human and non-human actors participating in a controversy that needed mapping really struck a chord with me. I felt like I wandered across an extremely practical, relevant and timely example of how Latour’s ideas can be put into practice, in a domain I work with every day–the Web.

But I must admit the article was most interesting to me because of some of the technical aspects of how the work was performed. The appendix at the end of the paper provides a very lucid description of how they used a tool called MediaCloud to collect, house and analyze the data that formed the foundation of their study. MediaCloud is an open source project devloped at the Berkman Center and is available on GitHub. MediaCloud allows researchers to collect material from the Web, and then perform various types of analysis on it. Conceptually it’s in a similar space as the (recently IMLS funded) Social Feed Manager that Dan Chudnov is working on at George Washington University, now with help from Yale, the University of North Texas and the Center for Jewish History. The appendix is an excellent picture of the sorts of questions we can expect social science researchers to ask of data collections–thankfully absent the distracting term big data. What’s important is the questions they asked, and how they went about answering them – not whether they were keyword compliant.

A few things about MediaCloud and its use that stood out for me:

RSS

Even after all these years, syndication technologies like RSS and Atom continue to be useful. Somewhat paradoxically the death of Google Reader and the increasing awareness of the importance of the indieweb seem to have ushered in a sort of golden age of RSS. Or at least their is increased awareness of its role. I think it’s important for content providers to know that simple technologies like RSS still perform an important function in this age of fit to purpose Web APIs.

Time

Time is a very important dimension to doing this sort of research. Benkler’s analysis hinges on the view of the graph over time as the controversy of the SOPA debate evolved. It’s not just a picture of what the graph of content looks like afterwards. It’s only when looking at it over time that the various actors are revealed. Getting the time out of RSS feeds wasn’t so bad, but the lather-rinse-repeat nature to finding more stories and media outlets on the web, meant that they had to come up with some heuristics for determining where the content should be situated in time.

Density

Similarly not all content and links on pages is relevant for this sort of analysis. Advertisements and boilerplate headers/footers around content can often add unwanted noise to the analysis. The authors drew on some research from 2007 on HTML density to extract just the salient bits of content in the HTML representation. Services like Readability have algorithms and heuristics for doing similar things, but I hadn’t heard it described so succinctly before as it was in Benkler’s article. The more dense the HTML tags, the less likely it is that the text is the primary content of the page. Truth be told, I’m writing this post for my future self when I go looking for this technique. I think I found the MediaCloud function for generating the HTML density. It might be nice to spin this out into a separate reusable piece…

Perl

Speaking of code, according to Github, MediaCloud is 85.9% Perl and 10.8% JavaScript, and was coded up mainly by David Larouchelle, Linas Valiukas, Hal Roberts. There is a fair bit of text munging going on in MediaCloud, so I shouldn’t be surprised to see Perl so heavily used…but I was. I don’t reach for Perl much these days, but I imagine there’s no shortage of demand in certain circles (nudge, nudge, wink, wink, say no more) for Perl hackers. Larry Wall was a linguist, and Perl’s roots are in this sort of linguistic analysis cooked together with the medium of the Web – it’s nice to see some constants in the software development world as it so often froths over with the latest tech trend or meme.

So, it was a good read. Highly recommended. 5 stars. Wealth of Networks is now in my queue.


On Snowden and Archival Ethics

Much like you I’ve been watching the evolving NSA Surveillance story following the whistle-blowing by former government contractor Edward Snowden. Watching isn’t really the right word…I’ve been glued to it. I don’t have a particularly unique opinion or observation to make about the leak, or the ensuing dialogue – but I suppose calling it “whistle blowing” best summarizes where I stand. I just wanted to share a thought I had on the train to work, after reading Ethan Zuckerman’s excellent Me and My Metadata - Thoughts on Online Surveillance. I tried to fit it in into 140 characters, but it didn’t quite work.

Zuckerman’s post is basically about the value of metadata in research. He opened up his Gmail archive to his students, and they created Immersion, which lets you visualize his network of correspondence using only email metadata (From, To, Cc and Date). Zuckerman goes on to demonstrate what this visualization says about him. The first comment in the post by Jonathan O’Donnell has a nice list of related research on the importance of metadata to discovery. Zuckerman’s work immediately reminded me of Sudheendra Hangal’s work on MUSE at Stanford, which he and his team have written about extensively. MUSE is a tool that enables scholarly research using email archives. It was then that I realized why I’ve been so fascinated with the Snowden/NSA story.

Over the past few years there has been increasing awareness in the archival community about the role of forensics tools in digital preservation, curation and research. Matt Kirschenbaum’s Mechanisms had a big role in documenting, and spreading the word about how forensics tools can be (and are) used in the digital humanities. The CLIR report Digital Forensics and Born-Digital Content in Cultural Heritage Collections (co-authored by Kirschenbaum) brought the topic directly to cultural heritage organizations, as did the AIMS report. If you’re not convinced, a search in Google Scholar shows just how prevalent and timely the topic is. The introduction to the CLIR report has a nice summary of why forensics tools are of interest to archives that are dealing with born digital content:

The same forensics software that indexes a criminal suspect’s hard drive allows the archivist to prepare a comprehensive manifest of the electronic files a donor has turned over for accession; the same hardware that allows the forensics investigator to create an algorithmically authenticated “image” of a file system allows the archivist to ensure the integrity of digital content once captured from its source media; the same data-recovery procedures that allow the specialist to discover, recover, and present as trial evidence an “erased” file may allow a scholar to reconstruct a lost or inadvertently deleted version of an electronic manuscript—and do so with enough confidence to stake reputation and career.

Digital forensics therefore offers archivists, as well as an archive’s patrons, new tools, new methodologies, and new capabilities. Yet as even this brief description must suggest, <digital forensics does not affect archivists’ practices solely at the level of procedures and tools. Its methods and outcomes raise important legal, ethical, and hermeneutical questions about the nature of the cultural record, the boundaries between public and private knowledge, and the roles and responsibilities of donor, archivist, and the public in a new technological era.

When collections are donated to an archive, there is usually a gift agreement between the donor and the archival organization, which documents how the collection of material can be used. For example, it is fairly common for there to be a period where portions (or all) of the archive are kept dark. Much less often gift agreements can stipulate that the collection must be made open on the Web, and sometimes money can change hands. Born digital content in archives is new enough that cultural heritage organizations are still grappling with the best way to talk to their donors about donating born digital content.

There has been a bit of attention to sharing best practices about born digital content between organizations, and rising awareness about the sorts of issues that need to be considered. As a software developer tasked with building applications that can be used across these archival collections, the special-snowflake nature to these gift agreements has been a bit of annoyance. If every collection of born digital content has slightly different stipulations about what, when and how content can be used it makes building access applications difficult. The situation is compounded somewhat because the gift agreements themselves aren’t shared publicly (at least at my place of work), so you don’t even know what you can and can’t do. I’ve observed that this has a tendency to derail conversations about access to born digital content–and access is an essential ingredient to insuring the long term preservation of digital content. It’s not like you can take a digital file and put it on a server and come back in 25 or even 5 years and expect to open it, and use it.

So, what does this have to do with Zuckerman’s post, and the intrinsic value of metadata to the NSA? When Zuckerman provided his students with access to his email archive he did it in the context of a particular trust scenario. A gift agreement in an archive serves the same purpose, by documenting a trust scenario between the donor and the institution that is receiving the gift. The NSA allegedly has been collecting information from Verizon, Facebook, Google, et al outside of the trust scenario provided by the Fourth Amendment to the Constitution. After looking at things this way, the special-snowflakism of gift agreements doesn’t seem so annoying any more. It is through these agreements that cultural heritage organizations establish their authenticity and trust. And it is by them that they become a desirable place to deposit born digital content. If they have to be unique per-donor, and this hampers unified access to born digital collections, this seems like a price worth paying. Ideally there would be a standard set of considerations to use when putting the gift agreement together. But if we can’t fit everyone into the same framework, maybe that’s not such a bad thing.

The other common place thing that strikes me is that the same technology that can be used for good, say digital humanities research, or forensics discovery, can also be used for ill. Having a strong sense of the ethics, as a professional, as a citizen, and as a human being is extremely important to establishing the context in which technology is used – and negotiating between the three can sometimes require finesse, and in the case of Snowden, courage.


It's your data. It's your life.

I wrote briefly about the Open Science Champions of Change event last week, but almost a week later the impassioned message that Kathy Giusti delivered is still with me. Giusti is the Founder and Chief Executive Officer of the Multiple Myeloma Research Foundation (MMRF), and is herself battling the fatal disease. In her introduction, and later during the panel discussion, she made a strong case for patients to be able to opt-in to open access data sharing. I thought I’d point to these two moments in the 2 hour video stream, and transcribe what she said:

http://www.youtube.com/watch?v=a26cEwbyMGQ#t=1h15m50s

Patients don’t know that data is not shared. They don’t know … If patients knew how long it took to publish, if they knew, it’s your tissue, it’s your data, it’s your life. Believe me, patients would be the first force to start really changing the culture and backing everybody around open access.

http://www.youtube.com/watch?v=a26cEwbyMGQ#t=1h34m54s

Q: A lot of people when they hear about the sharing of clinical data talk about concerns of privacy. How do we start to handle those concerns, and how do we actually encourage patients to contribute their data in meaningful ways to research, so that we can actually continue to drive the successes that we are seeing here?

Giusti: When you’re a patient, and you’re living with a fatal disease, you don’t lie awake and wonder what happens with my data. If patients understand the role they can play in personalizing their own risk taking abilities … We all do this when we work with our banks. There’s certain information that we’re always giving out when we go online, and there’s certain information that we always keep private. And in a future world that’s what patients are going to do. So when you start talking with the patients, and you ask them: “Would you be willing to share your information?” It just depends on the patient, and it depends on how much they would be willing to give. For someone like me, I’m an identical twin, the disease of Myeloma skews in my family, my grandfather had it, my identical twin does not, I would be crazy not to be published … and I’ve done it, and so has my twin … biopsies, whatever we need. Put it in the public domain. I know everybody isn’t going to be like me, but even if you get us half your information we’re making progress, and we can start to match you with the right types of researchers and clinicians that care.


Open Science Champions of Change

I had the opportunity to go the White House yesterday to attend the Open Science Champions of Change award ceremony. I’m not sure why I was invited, perhaps because I nominated Aaron Swartz for it, and happen to be local. Unfortunately, Aaron didn’t win the award. I guess it would’ve been sad to award it to him posthumously. But it’s a sad story. Whatever the reason, I was sure honored to be there.

It was just amazing to see some of my heroes like Paul Ginsparg (arXiv), David Lipman (PubMed, Genbank) and Jeremiah Ostriker (Sloan Digital Sky Survey) in the same room, and on a panel where they could share ideas about the work they’ve done–and what remains to be done. The event was live streamed and is now available on the White House Youtube channel. The full list of the other amazing recipients and their bios is available here.

http://www.youtube.com/watch?v=a26cEwbyMGQ

So many things were said over the two hours, it’s hard for me to summarize here. But I thought I would jot down the main theme that struck me, absent a lot of the details about the projects that were discussed. Hopefully I can look back later and say, oh wow, I went to that.


During his intro, Jerimiah Ostriker talked about how the Sloan Digital Sky Survey was set up from the beginning to require public data sharing on the Internet. He said that it wasn’t easy, but that they made it work. David Lipman talked humbly about how PubMed and GenBank make all publicly funded research and data available at an astonishing rate: millions of users, and many terabytes of data a day. There was much discussion about how to incentivize scientists to share their research. Lipman pointed out that while there was a history of sharing pre-prints in the physics community (which helped Ginsparg realize arXiv) the biomedical field lacks this culture to some degree. Ginsparg acknowledged this, while pointing out that compelling, new applications that change what it means to do research can mitigate this to some degree.

I don’t remember how it came up, but at one point Ostriker was asked what needed to be done to incentivize more public sharing of research and he responded quickly, simply and with a smile:

People like to follow rules.

I think Ostriker was not only referring to the way he helped set up the Sloan Digital Sky Survey, but also to the proposed legislation Fair Access to Science and Technology Research Act (FASTR) or Aaron’s Other Law, which is still pending, and in need of support. People kind of laughed a bit when Jack Andraka (whose story is freakin’ amazing) said he was planning to start a petition to bring down the paywalls in front of publicly funded research. He described how his own research was obstructed by these paywalls. He’s wicked smart and just a kid, and has a humorous way to present the issues–so a bit of laughter was ok I guess. But Ostriker who is 76 and Andraka who is 16 were right on key, given where they were sitting:

The rules need to change. It’s time…there’s still time right?


tiny alien phenomenology review

Alien Phenomenology, or What It's Like to Be a ThingAlien Phenomenology, or What It’s Like to Be a Thing by Ian Bogost
My rating: 3 of 5 stars

I found this book to be quite accessible and totally incomprehensible at the same time. It was kind of a surreal joy to read. I liked how it flipped the artificial intelligence research agenda of getting machines to think (like people), to getting humans to imagine what it was like to be a thing. I also came to appreciate Bogost’s variation on Latour’s litanies, so called tiny ontology. And I really appreciated his emphasis on making things to guide thinking or philosophical carpentry … and the importance of cultivating a sense of wonder. His use of real examples and case studies to demonstrate his thinking was also very helpful–and sometimes quite humorous. I’m wandering back to Latour to read We Have Never Been Modern based on some discussion of it in this book.

So, in the spirit of tiny ontology here are some random quotes I highlighted on my Kindle:

To be sure, computers often do entail human experience and perception. The human operator views words and images rendered on a display, applies physical forces to a mouse, seats memory chips into motherboard sockets. But not always. Indeed, for the computer to operate at all for us first requires a wealth of interactions to take place for itself. As operators or engineers, we may be able to describe how such objects and assemblages work. But what do they experience? What’s their proper phenomenology? In short, what is it like to be a thing?

Theories of being tend to be grandiose, but they need not be, because being is simple. Simple enough that it could be rendered via screen print on a trucker’s cap. I call it tiny ontology, precisely because it ought not demand a treatise or a tome. I don’t mean that the domain of being is small— quite the opposite, as I’ll soon explain. Rather, the basic ontological apparatus needed to describe existence ought to be as compact and unornamented as possible.

For the ontographer, Aristotle was wrong: nature does not operate in the shortest way possible but in a multitude of locally streamlined yet globally inefficient ways.[ 41] Indeed, an obsession with simple explanations ought to bother the metaphysician. Instead of worshipping simplicity, OOO embraces messiness. We must not confuse the values of the design of objects for human use, such as doors, toasters, and computers, with the nature of the world itself. An ontograph is a crowd, not a cellular automaton that might describe its emergent operation. An ontograph is a landfill, not a Japanese garden. It shows how much rather than how little exists simultaneously, suspended in the dense meanwhile of being.

Yet once we are done nodding earnestly at Whitehead and Latour, what do we do? We return to our libraries and our word processors. We refine our diction and insert more endnotes. We apply “rigor,” the scholarly version of Tinker Bell’s fairy dust, in adequate quantities to stave off interest while cheating death. For too long, being “radical” in philosophy has meant writing and talking incessantly, theorizing ideas so big that they can never be concretized but only marked with threatening definite articles (“ the political,” “the other,” “the neighbor,” “the animal”). For too long, philosophers have spun waste like a goldfish’s sphincter, rather than spinning yarn like a charka. Whether or not the real radical philosophers march or protest or run for office in addition to writing inscrutable tomes— this is a question we can, perhaps, leave aside. Real radicals, we might conclude, make things. Examples aren’t hard to find, and some even come from scholars who might be willing to call themselves philosophers.

View all my reviews