On Snowden and Archival Ethics

Much like you I’ve been watching the evolving NSA Surveillance story following the whistle-blowing by former government contractor Edward Snowden. Watching isn’t really the right word…I’ve been glued to it. I don’t have a particularly unique opinion or observation to make about the leak, or the ensuing dialogue – but I suppose calling it “whistle blowing” best summarizes where I stand. I just wanted to share a thought I had on the train to work, after reading Ethan Zuckerman’s excellent Me and My Metadata - Thoughts on Online Surveillance. I tried to fit it in into 140 characters, but it didn’t quite work.

Zuckerman’s post is basically about the value of metadata in research. He opened up his Gmail archive to his students, and they created Immersion, which lets you visualize his network of correspondence using only email metadata (From, To, Cc and Date). Zuckerman goes on to demonstrate what this visualization says about him. The first comment in the post by Jonathan O’Donnell has a nice list of related research on the importance of metadata to discovery. Zuckerman’s work immediately reminded me of Sudheendra Hangal’s work on MUSE at Stanford, which he and his team have written about extensively. MUSE is a tool that enables scholarly research using email archives. It was then that I realized why I’ve been so fascinated with the Snowden/NSA story.

Over the past few years there has been increasing awareness in the archival community about the role of forensics tools in digital preservation, curation and research. Matt Kirschenbaum’s Mechanisms had a big role in documenting, and spreading the word about how forensics tools can be (and are) used in the digital humanities. The CLIR report Digital Forensics and Born-Digital Content in Cultural Heritage Collections (co-authored by Kirschenbaum) brought the topic directly to cultural heritage organizations, as did the AIMS report. If you’re not convinced, a search in Google Scholar shows just how prevalent and timely the topic is. The introduction to the CLIR report has a nice summary of why forensics tools are of interest to archives that are dealing with born digital content:

The same forensics software that indexes a criminal suspect’s hard drive allows the archivist to prepare a comprehensive manifest of the electronic files a donor has turned over for accession; the same hardware that allows the forensics investigator to create an algorithmically authenticated “image” of a file system allows the archivist to ensure the integrity of digital content once captured from its source media; the same data-recovery procedures that allow the specialist to discover, recover, and present as trial evidence an “erased” file may allow a scholar to reconstruct a lost or inadvertently deleted version of an electronic manuscript—and do so with enough confidence to stake reputation and career.

Digital forensics therefore offers archivists, as well as an archive’s patrons, new tools, new methodologies, and new capabilities. Yet as even this brief description must suggest, <digital forensics does not affect archivists’ practices solely at the level of procedures and tools. Its methods and outcomes raise important legal, ethical, and hermeneutical questions about the nature of the cultural record, the boundaries between public and private knowledge, and the roles and responsibilities of donor, archivist, and the public in a new technological era.

When collections are donated to an archive, there is usually a gift agreement between the donor and the archival organization, which documents how the collection of material can be used. For example, it is fairly common for there to be a period where portions (or all) of the archive are kept dark. Much less often gift agreements can stipulate that the collection must be made open on the Web, and sometimes money can change hands. Born digital content in archives is new enough that cultural heritage organizations are still grappling with the best way to talk to their donors about donating born digital content.

There has been a bit of attention to sharing best practices about born digital content between organizations, and rising awareness about the sorts of issues that need to be considered. As a software developer tasked with building applications that can be used across these archival collections, the special-snowflake nature to these gift agreements has been a bit of annoyance. If every collection of born digital content has slightly different stipulations about what, when and how content can be used it makes building access applications difficult. The situation is compounded somewhat because the gift agreements themselves aren’t shared publicly (at least at my place of work), so you don’t even know what you can and can’t do. I’ve observed that this has a tendency to derail conversations about access to born digital content–and access is an essential ingredient to insuring the long term preservation of digital content. It’s not like you can take a digital file and put it on a server and come back in 25 or even 5 years and expect to open it, and use it.

So, what does this have to do with Zuckerman’s post, and the intrinsic value of metadata to the NSA? When Zuckerman provided his students with access to his email archive he did it in the context of a particular trust scenario. A gift agreement in an archive serves the same purpose, by documenting a trust scenario between the donor and the institution that is receiving the gift. The NSA allegedly has been collecting information from Verizon, Facebook, Google, et al outside of the trust scenario provided by the Fourth Amendment to the Constitution. After looking at things this way, the special-snowflakism of gift agreements doesn’t seem so annoying any more. It is through these agreements that cultural heritage organizations establish their authenticity and trust. And it is by them that they become a desirable place to deposit born digital content. If they have to be unique per-donor, and this hampers unified access to born digital collections, this seems like a price worth paying. Ideally there would be a standard set of considerations to use when putting the gift agreement together. But if we can’t fit everyone into the same framework, maybe that’s not such a bad thing.

The other common place thing that strikes me is that the same technology that can be used for good, say digital humanities research, or forensics discovery, can also be used for ill. Having a strong sense of the ethics, as a professional, as a citizen, and as a human being is extremely important to establishing the context in which technology is used – and negotiating between the three can sometimes require finesse, and in the case of Snowden, courage.


It's your data. It's your life.

I wrote briefly about the Open Science Champions of Change event last week, but almost a week later the impassioned message that Kathy Giusti delivered is still with me. Giusti is the Founder and Chief Executive Officer of the Multiple Myeloma Research Foundation (MMRF), and is herself battling the fatal disease. In her introduction, and later during the panel discussion, she made a strong case for patients to be able to opt-in to open access data sharing. I thought I’d point to these two moments in the 2 hour video stream, and transcribe what she said:

http://www.youtube.com/watch?v=a26cEwbyMGQ#t=1h15m50s

Patients don’t know that data is not shared. They don’t know … If patients knew how long it took to publish, if they knew, it’s your tissue, it’s your data, it’s your life. Believe me, patients would be the first force to start really changing the culture and backing everybody around open access.

http://www.youtube.com/watch?v=a26cEwbyMGQ#t=1h34m54s

Q: A lot of people when they hear about the sharing of clinical data talk about concerns of privacy. How do we start to handle those concerns, and how do we actually encourage patients to contribute their data in meaningful ways to research, so that we can actually continue to drive the successes that we are seeing here?

Giusti: When you’re a patient, and you’re living with a fatal disease, you don’t lie awake and wonder what happens with my data. If patients understand the role they can play in personalizing their own risk taking abilities … We all do this when we work with our banks. There’s certain information that we’re always giving out when we go online, and there’s certain information that we always keep private. And in a future world that’s what patients are going to do. So when you start talking with the patients, and you ask them: “Would you be willing to share your information?” It just depends on the patient, and it depends on how much they would be willing to give. For someone like me, I’m an identical twin, the disease of Myeloma skews in my family, my grandfather had it, my identical twin does not, I would be crazy not to be published … and I’ve done it, and so has my twin … biopsies, whatever we need. Put it in the public domain. I know everybody isn’t going to be like me, but even if you get us half your information we’re making progress, and we can start to match you with the right types of researchers and clinicians that care.


Open Science Champions of Change

I had the opportunity to go the White House yesterday to attend the Open Science Champions of Change award ceremony. I’m not sure why I was invited, perhaps because I nominated Aaron Swartz for it, and happen to be local. Unfortunately, Aaron didn’t win the award. I guess it would’ve been sad to award it to him posthumously. But it’s a sad story. Whatever the reason, I was sure honored to be there.

It was just amazing to see some of my heroes like Paul Ginsparg (arXiv), David Lipman (PubMed, Genbank) and Jeremiah Ostriker (Sloan Digital Sky Survey) in the same room, and on a panel where they could share ideas about the work they’ve done–and what remains to be done. The event was live streamed and is now available on the White House Youtube channel. The full list of the other amazing recipients and their bios is available here.

http://www.youtube.com/watch?v=a26cEwbyMGQ

So many things were said over the two hours, it’s hard for me to summarize here. But I thought I would jot down the main theme that struck me, absent a lot of the details about the projects that were discussed. Hopefully I can look back later and say, oh wow, I went to that.


During his intro, Jerimiah Ostriker talked about how the Sloan Digital Sky Survey was set up from the beginning to require public data sharing on the Internet. He said that it wasn’t easy, but that they made it work. David Lipman talked humbly about how PubMed and GenBank make all publicly funded research and data available at an astonishing rate: millions of users, and many terabytes of data a day. There was much discussion about how to incentivize scientists to share their research. Lipman pointed out that while there was a history of sharing pre-prints in the physics community (which helped Ginsparg realize arXiv) the biomedical field lacks this culture to some degree. Ginsparg acknowledged this, while pointing out that compelling, new applications that change what it means to do research can mitigate this to some degree.

I don’t remember how it came up, but at one point Ostriker was asked what needed to be done to incentivize more public sharing of research and he responded quickly, simply and with a smile:

People like to follow rules.

I think Ostriker was not only referring to the way he helped set up the Sloan Digital Sky Survey, but also to the proposed legislation Fair Access to Science and Technology Research Act (FASTR) or Aaron’s Other Law, which is still pending, and in need of support. People kind of laughed a bit when Jack Andraka (whose story is freakin’ amazing) said he was planning to start a petition to bring down the paywalls in front of publicly funded research. He described how his own research was obstructed by these paywalls. He’s wicked smart and just a kid, and has a humorous way to present the issues–so a bit of laughter was ok I guess. But Ostriker who is 76 and Andraka who is 16 were right on key, given where they were sitting:

The rules need to change. It’s time…there’s still time right?


tiny alien phenomenology review

Alien Phenomenology, or What It's Like to Be a ThingAlien Phenomenology, or What It’s Like to Be a Thing by Ian Bogost
My rating: 3 of 5 stars

I found this book to be quite accessible and totally incomprehensible at the same time. It was kind of a surreal joy to read. I liked how it flipped the artificial intelligence research agenda of getting machines to think (like people), to getting humans to imagine what it was like to be a thing. I also came to appreciate Bogost’s variation on Latour’s litanies, so called tiny ontology. And I really appreciated his emphasis on making things to guide thinking or philosophical carpentry … and the importance of cultivating a sense of wonder. His use of real examples and case studies to demonstrate his thinking was also very helpful–and sometimes quite humorous. I’m wandering back to Latour to read We Have Never Been Modern based on some discussion of it in this book.

So, in the spirit of tiny ontology here are some random quotes I highlighted on my Kindle:

To be sure, computers often do entail human experience and perception. The human operator views words and images rendered on a display, applies physical forces to a mouse, seats memory chips into motherboard sockets. But not always. Indeed, for the computer to operate at all for us first requires a wealth of interactions to take place for itself. As operators or engineers, we may be able to describe how such objects and assemblages work. But what do they experience? What’s their proper phenomenology? In short, what is it like to be a thing?

Theories of being tend to be grandiose, but they need not be, because being is simple. Simple enough that it could be rendered via screen print on a trucker’s cap. I call it tiny ontology, precisely because it ought not demand a treatise or a tome. I don’t mean that the domain of being is small— quite the opposite, as I’ll soon explain. Rather, the basic ontological apparatus needed to describe existence ought to be as compact and unornamented as possible.

For the ontographer, Aristotle was wrong: nature does not operate in the shortest way possible but in a multitude of locally streamlined yet globally inefficient ways.[ 41] Indeed, an obsession with simple explanations ought to bother the metaphysician. Instead of worshipping simplicity, OOO embraces messiness. We must not confuse the values of the design of objects for human use, such as doors, toasters, and computers, with the nature of the world itself. An ontograph is a crowd, not a cellular automaton that might describe its emergent operation. An ontograph is a landfill, not a Japanese garden. It shows how much rather than how little exists simultaneously, suspended in the dense meanwhile of being.

Yet once we are done nodding earnestly at Whitehead and Latour, what do we do? We return to our libraries and our word processors. We refine our diction and insert more endnotes. We apply “rigor,” the scholarly version of Tinker Bell’s fairy dust, in adequate quantities to stave off interest while cheating death. For too long, being “radical” in philosophy has meant writing and talking incessantly, theorizing ideas so big that they can never be concretized but only marked with threatening definite articles (“ the political,” “the other,” “the neighbor,” “the animal”). For too long, philosophers have spun waste like a goldfish’s sphincter, rather than spinning yarn like a charka. Whether or not the real radical philosophers march or protest or run for office in addition to writing inscrutable tomes— this is a question we can, perhaps, leave aside. Real radicals, we might conclude, make things. Examples aren’t hard to find, and some even come from scholars who might be willing to call themselves philosophers.

View all my reviews


thoughts on SHARE

My response to Library Journal’s ARL Launches Library-Led Solution to Federal Open Access Requirements that I’m posting here as well, because I spent a bit of time on it. Thanks for the heads up Dorothea,

https://twitter.com/LibSkrat/status/345148738488115201


In principle I like the approach that SHARE is taking, that of leveraging the existing network of institutional repositories, and the amazingly decentralized thing that is the Internet and the World Wide Web. Simply getting article content out on the Web, where it can be crawled, as Harnad suggests, has bootstrapped incredibly useful services like Google Scholar. Scholar works with the Web we have, not some future Web where we all share metadata perfectly using formats that will be preserved for the ages. They don’t use OpenURL, OAI-ORE, SWORD, etc. They do have lots o’ crawlers, and some magical PDF parsing code that can locate citations. I would like to see a plan that’s a bit scruffier and less neat.

Like Dorothea I have big doubts about building what looks to be a centralized system that will then push out to IRs using SWORD, and support some kind of federated search with OpenURL. Most IRs seem more like research experiments than real applications oriented around access, that could sustain the kind of usage you might see if mainstream media or a MOOC happened to reference their content. Rather than a 4 phase plan, with digital library acronym soup,I’d rather see some very simple things that could be done to make sure that federally funded research is deposited in an IR, and it can be traced back to the grant that funded it. Of course, I can’ resist to throw out a straw man.

Requiring funding agencies to have a URL for each grant, which can be used in IRs seems like it would be the first logical step. Pinging that URL (kind of like a trackback) when there is a resource (article, dataset, etc) associated with the grant would allow the granting institution to know when something was published that referenced that URL. The granting organization could then look at its grants and see which ones lacked a deposit, and follow up with the grantees. They could also examine pingbacks to see which ones are legit or not. Perhaps further on down the line these resources could be integrated into web archiving efforts, but I digress.

There would probably be a bit of curation of these pingbacks, but nothing a big Federal Agency can’t handle right? I think putting data curation first, instead of last, as the icing on the 4 phase cake is important. I don’t underestimate the challenge in requiring a URL for every grant, perhaps some agencies already have them. I think this would put the onus on the Federal agencies to make this work, rather than the publishers (who, like or not, have a commercial incentive to not make it too easy to provide open access) and universities (who must have a way of referencing grants if any of their plan is to work). This would be putting Linked Data first, rather than last, as rainbow sprinkles on the cake.

Sorry if this comes off as a bit ranty or incomprehensible. I wish Aaron were here to help guide us… It is truly remarkable that the OSTP memo was issued, and that we have seen responses from the ARL and the AAP. I hope we’ll see responses from the federal agencies that the memo was actually directed at.


recent Wikipedia citations as JSON

Here is a little webcast about some work in progress to stream recent citations out of Wikipedia. It uses previous work I did on the wikichanges Node library. Beware, I say “um” and “uh” a lot while showing you my terminal window. This idea could very well be brain damaged since it pings the Wikipedia API for the diff of each change in selected Wikipedias, to see if it contains one or more citations. On the plus side, it emits the citations as JSON, which is suitable for downstream apps of some dimensions, which I haven’t thought much about yet. Get in touch if you have some ideas.

https://vimeo.com/67893886


maps on the web with a bit of midlife crisis

TL;DR – I created a JavaScript library for getting GeoJSON out of Wikipedia’s API in your browser (and Node.js). I also created a little app that uses it to display Wikipedia articles for things near you that need a photograph/image or editorial help.


I probably don’t need to tell you how much the state of mapping on the Web has changed in the past few years. I was there. I can remember trying to get MapServer set up in the late 1990s, with limited success. I was there squinting at how Adrian Holovaty reverse engineered a mapping API out of Google Maps at chicagocrime.org. I was there when Google released their official API, which I used some, and then they changed their terms of service. I was there in the late 2000s using OpenLayers and TileCache, which were so much more approachable than MapServer was a decade earlier. I’m most definitely not a mapping expert, or even an amateur–but you can’t be a Web developer without occasionally needing to dabble, and pretend you are.

I didn’t realize until very recently how easy the cool kids have made it to put maps on the Web. Who knew that in 2013 there would be an open source JavaScript library that lets you add a map to your page in a few lines, and that it’s in use by Flickr, FourSquare, CraigsList, Wikimedia, the Wall Street Journal, and others? Even more astounding: who knew there would be an openly licensed source of map tiles and data, that was created collaboratively by a project with over a million registered users, and that it would be good enough to be used by Apple? I certainly didn’t even dream about it.

Ok, hold that thought…

So, Wikipedia recently announced that they were making it easy to use your mobile device to add a photograph to a Wikipedia article that lacked an image.

When I read about this I thought it would be interesting to see what Wikipedia articles there are about my current location, and which lacked images, so I could go and take pictures of them. Before I knew it I had a Web app called ici (French for here) that does just that:

Articles that need images are marked with little red cameras. It was pretty easy to add orange markers for Wikipedia articles that had been flagged as needing edits, or citations. Calling it an app is an overstatement: it is just static HTML, JavaScript and CSS that I serve up. HTML’s geolocation features and Wikipedia’s API (which has GeoData enabled) take care of the rest.

After I created the app I got a tweet from a real geo-hacker, Sean Gillies, who asked:

https://twitter.com/sgillies/status/332185543234441216

Sean is right, it would be really useful to have a GeoJSON output from Wikipedia’s API. But I was on a little bit of a tear, so rather than figuring out how to get GeoJSON into MediaWiki and deployed to all the Wikipedia servers I wondered if I could extract ici’s use of the Wikipedia API into a slightly more generalized JavaScript library, that would make it easy to get GeoJSON out of Wikipedia–at least from JavaScript. That quickly resulted in wikigeo.js which is now getting used in ici. Getting GeoJSON from Wikipedia using wikigeo.js is done in just one line, and then adding the GeoJSON to a map in Leaflet can also be done in one line:

geojson([-73.94, 40.67], function(data) {
    // add the geojson to a Leaflet map
    L.geoJson(data).addTo(map)
});

This call results in callback getting some GeoJSON data that looks something like:

{
  "type": "FeatureCollection",
  "features": [
    {
      "id": "http://en.wikipedia.org/wiki/New_York_City",
      "type": "Feature",
      "properties": {
        "name": "New York City"
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          -73.94,
          40.67
        ]
      }
    },
    {
      "id": "http://en.wikipedia.org/wiki/Kingston_Avenue_(IRT_Eastern_Parkway_Line)",
      "type": "Feature",
      "properties": {
        "name": "Kingston Avenue (IRT Eastern Parkway Line)"
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          -73.9422,
          40.6694
        ]
      }
    },
    {
      "id": "http://en.wikipedia.org/wiki/Crown_Heights_–_Utica_Avenue_(IRT_Eastern_Parkway_Line)",
      "type": "Feature",
      "properties": {
        "name": "Crown Heights – Utica Avenue (IRT Eastern Parkway Line)"
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          -73.9312,
          40.6688
        ]
      }
    },
    {
      "id": "http://en.wikipedia.org/wiki/Brooklyn_Children's_Museum",
      "type": "Feature",
      "properties": {
        "name": "Brooklyn Children's Museum"
      },
"geometry": {
        "type": "Point",
        "coordinates": [
          -73.9439,
          40.6745
        ]
      }
    },
    {
      "id": "http://en.wikipedia.org/wiki/770_Eastern_Parkway",
      "type": "Feature",
      "properties": {
        "name": "770 Eastern Parkway"
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          -73.9429,
          40.669
        ]
      }
    },
    {
      "id": "http://en.wikipedia.org/wiki/Eastern_Parkway_(Brooklyn)",
      "type": "Feature",
      "properties": {
        "name": "Eastern Parkway (Brooklyn)"
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          -73.9371,
          40.6691
        ]
      }
    },
    {
      "id": "http://en.wikipedia.org/wiki/Paul_Robeson_High_School_for_Business_and_Technology",
      "type": "Feature",
      "properties": {
        "name": "Paul Robeson High School for Business and Technology"
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          -73.939,
          40.6755
        ]
      }
    },
    {
      "id": "http://en.wikipedia.org/wiki/Pathways_in_Technology_Early_College_High_School",
      "type": "Feature",
      "properties": {
        "name": "Pathways in Technology Early College High School"
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          -73.939,
          40.6759
        ]
      }
    }
  ]
}

There are options for broadening the radius, increasing the number of results, and fetching additional properties of the Wikipedia article such as article summaries, images, categories, templates used. Here’s an example using all the knobs:

geojson(
  [-73.94, 40.67],
  {
    limit: 5,
    radius: 1000,
    images: true,
    categories: true,
    summaries: true,
    templates: true
  },
  function(data) {
    L.geoJson(data).addTo(map)
  }
);

Which results in GeoJSON like this (abbreviated)

{
  "type": "FeatureCollection",
  "features": [
    {
      "id": "http://en.wikipedia.org/wiki/Silver_Spring,_Maryland",
      "type": "Feature",
      "properties": {
        "name": "Silver Spring, Maryland",
        "image": "Downtown_silver_spring_wayne.jpg",
        "templates": [
          "-",
          "Abbr",
          "Ambox",
          "Ambox/category",
          "Ambox/small",
          "Basepage subpage",
          "Both",
          "Category handler",
          "Category handler/blacklist",
          "Category handler/numbered"
        ],
        "summary": "Silver Spring is an unincorporated area and census-designated place (CDP) in Montgomery County, Maryland, United States. It had a population of 71,452 at the 2010 census, making it the fourth most populous place in Maryland, after Baltimore, Columbia, and Germantown.\nThe urbanized, oldest, and southernmost part of Silver Spring is a major business hub that lies at the north apex of Washington, D.C. As of 2004, the Central Business District (CBD) held 7,254,729 square feet (673,986 m2) of office space, 5216 dwelling units and 17.6 acres (71,000 m2) of parkland. The population density of this CBD area of Silver Spring was 15,600 per square mile all within 360 acres (1.5 km2) and approximately 2.5 square miles (6 km2) in the CBD/downtown area. The community has recently undergone a significant renaissance, with the addition of major retail, residential, and office developments.\nSilver Spring takes its name from a mica-flecked spring discovered there in 1840 by Francis Preston Blair, who subsequently bought much of the surrounding land. Acorn Park, tucked away in an area of south Silver Spring away from the main downtown area, is believed to be the site of the original spring.\n\n",
        "categories": [
          "All articles to be expanded",
          "All articles with dead external links",
          "All articles with unsourced statements",
          "Articles to be expanded from June 2008",
          "Articles with dead external links from July 2009",
          "Articles with dead external links from October 2010",
          "Articles with dead external links from September 2010",
          "Articles with unsourced statements from February 2007",
          "Articles with unsourced statements from May 2009",
          "Commons category template with no category set"
        ]
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          -77.019,
          39.0042
        ]
      }
    },
    ...
  ]
}

I guess this is a long way of saying, if you want to put Wikipedia articles on a map, or otherwise need GeoJSON for Wikipedia articles for a particular location, take a look at wikigeo.js. If you do, and have ideas for making it better, please let me know. Oh, by the way you can npm install wikigeo and use it from Node.js.

I guess JavaScript, HTML5, NodeJS, CoffeeScript are like my midlife crisis…my red sports car. But maybe being the old guy, and losing my edge isn’t really so bad?

I’m losing my edge
to better-looking people
with better ideas
and more talent
and they’re actually
really, really nice.
Jim Murphy

It definitely helps when the kids coming up from behind have talent and are really, really nice. You know?


Everything is Data

Reassembling the Social: An Introduction to Actor-Network-TheoryReassembling the Social: An Introduction to Actor-Network-Theory by Bruno Latour
My rating: 4 of 5 stars

I picked this up because folks over on the Philosophy in a Time of Software kicked things off by discussing this book by Latour. So, I’m really not terribly knowledgeable about sociology, but I did a fair bit of reading in the social sciences while getting my library union card studying library/information science. So I wasn’t completely underwater, but I definitely felt like I was swimming in the deep end. I didn’t get the connection to computer programming until quite late in the book, but it was definitely a bit of a lightbulb moment when I did. Latour’s style (at least that of the unmentioned translator) is refreshingly direct, personal, and unabashedly opinionated. He spends much of the book describing just how complicated social science is, and how far it has gone off the tracks…which is quite entertaining at times.

A few things I will take with me from this book and its portrayal of Actor Network Theory:

I will never be able to say or write the word “social” without feeling like I’m glossing over a whole lot of stuff, and that this stuff is what I should actually be researching, talking and writing about. Latour stresses that it’s important not to dumb things down by appealing to established social forces (class, gender, imperialism, etc) but by tracing the actors, their controversies, and their relations. This work requires discipline because it’s tempting to reduce the complexity by using these familiar abstractions instead of expending energy/effort in documenting the scenarios as faithfully as possible. By letting the actors have a voice, and say what they think they are doing, rather than the researcher telling the actor what they are actually doing. I work in libraries/archives, so I particularly liked Latour’s insistence on the importance notebooks, writing, and documentation:

The best way to proceed at this point … is simply to keep track of all our moves, even those that deal with the very production of the account. This is neither for the sake of epistemic reflexivity nor for some narcissist indulgence into one’s own work, but because from now on everything is data: everything from the first telephone call to a prospective interviewee, the first appointment with the advisor, the first corrections made by a client on a grant proposal, the first launching of a search engine, the first list of boxes to tick in a questionnaire. In keeping with the logic of our interest in textual reports and accounting, it might be useful to list the different notebooks one should keep—manual or digital, it no longer matters much. p. 286.

… and that this is the work of “slowciology” – it requires you to slow down, and really describe/dig into things.

The other really interesting thing about this book for me was the insistence that social actors do not need to be human. It is fairly typical for social science research to focus on face-to-face interaction between people as the primary focus. Latour doesn’t dispute the importance of studying human actors, but emphasizes that it’s useful to increase the number of actors under study by studying objects (mediators) as actors. Typically we think of actors as having agency, free will, etc … but objects are typically complex things, with particular affordances, and extensive relations with other things in the field. You get only a very limited view of what is going on if you don’t trace these relations.

Things, quasi-objects, and attachments are the real center of the social world, not the agent, person, member, or participant—nor is it society or its avatars. (p. 237)

As a software developer, I really identified with Latour’s insistence on the role that objects play in our understanding of activities around us; how this view necessarily complicates things a great deal, and requires us to slow down to really understand/describe what is going on. It is hard work. And it’s only when we understand the various actors and their relations, the actual ones, not the abstract ones in the architecture diagram, or in the theory about the software, that we will be in a position to effectively change things or build anew.


#75

When taxes are too high,
people go hungry.
When the government is too intrusive,
people lose their spirit.

Act for the people’s benefit.
Trust them; leave them alone.

Tao Te Ching #75


python heal thyself

https://twitter.com/ginatrapani/status/314552254592069632

After seeing Gina’s tweet, I was curious to see if there was any difference by gender in the tweets directed at (???) over the recent controversy at PyCon. I wasn’t confident I would find anything. It was more a feeble attempt to try to make Python make sense of something senseless that happened at PyCon; or to paraphrase Physician, heal thyself…for Python to heal itself.

I used twarc to collect 13,472 tweets that mentioned (???) from the search API. I then added a utility filter that uses genderator to filter the line oriented JSON based on a guess at the gender (Twitter doesn’t track it). genderator identified 2,433 (18%) tweets from women, 5,268 (39%) from men, and 5,771 (42%) that were of unknown gender. I then added another utility that reads a stream of Tweets and generates a tag cloud as a standalone HTML file using d3-cloud.

I put them all together on the command line like this:

% twarc.py @adriarichards
% cat @adriarichards-20130321200320.json | utils/gender.py --gender male | utils/wordcloud.py > male.html
% cat @adriarichards-20130321200320.json | utils/gender.py --gender female | utils/wordcloud.py > female.html

I realize word clouds aren’t probably the greatest way to visualize the differences in these messages. If you have better ideas let me know. I made the tweet JSON available if you want to try your own visualization.


Looking at these didn’t yield much insight. So instead of visualizing all the words that each gender used, I wondered what the clouds would look like if I limited them to words that were uniquely spoken by each gender. In other words, what words did males use in their tweets which were not used by females, and vice-versa. There were 1,333 (11%) uniquely female words, and 4,767 (39%) uniquely male words, with a shared vocabulary of 5,988 (50%) words.


I’m not sure there is much more insight here either. I guess there is some weak comfort in the knowledge that 1/2 of the words used in these tweets were shared by both sexes.