Links in Obergefell v. Hodges

Last week’s landmark ruling from the Supreme Court on same sex marriage was routinely published on the Web as a PDF. Given the past history of URL use in Supreme Court opinions I thought I would take a quick look to see what URLs were present. There are two, both are in Justice Alito’s dissenting opinion, and one is broken … just four days after the PDF was published. You can see it yourself at the bottom of page 100 in the PDF.

If you point your browser at

http://www.cdc.gov/nchs/data/databrief/db18.pdf

you will get a page not found error:

Sadly even the Internet Archive doesn’t have a snapshot of the page available.

But notice it thinks it can get a copy of it still. That’s because the Center for Disease Control’s website is responding with a 200 OK instead of a 404 Not Found:

zen:~ ed$ curl -I http://www.cdc.gov/nchs/data/databrief/db18.pdf
HTTP/1.1 200 OK
Content-Type: text/html
X-Powered-By: ASP.NET
X-UA-Compatible: IE=edge,chrome=1
Date: Tue, 30 Jun 2015 16:22:18 GMT
Connection: keep-alive

At any rate, it’s not Internet Archive’s fault that they haven’t archived the Webpage originally published in 2009, because the URL is actually a typo. Instead it should be

http://www.cdc.gov/nchs/data/databriefs/db18.pdf

which leads to:

So between the broken URL and the 200 OK for something not found we’ve got issues of link rot and reference rot all rolled up into a one character typo. Sigh.

I think a couple lessons for web publishers can be distilled from this little story:

  • when publishing on the Web include link checking as part of your editorial process
  • if you are going to publish links on the Web use a format that’s easy to check … like HTML.


SKOS and Wikidata

For #DayOfDH yesterday I created a quick video about some data normalization work I have been doing using Wikidata entities. I may write more about this work later, but the short version is that I have a bunch of spreadsheets with names in them (authors) in a variety of formats and transliterations, which I need to collapse into a unique identifier so that I can provide a unified display of the data per unique author. So for example, my spreadsheets have information for Fyodor Dostoyevsky using the following variants:

  • Dostoeieffsky, Feodor
  • Dostoevski
  • Dostoevski, F. M.
  • Dostoevski, Fedor
  • Dostoevski, Feodor Mikailovitch
  • Dostoevskii
  • Dostoevsky
  • Dostoevsky, Fiodor Mihailovich
  • Dostoevsky, Fyodor
  • Dostoevsky, Fyodor Michailovitch
  • Dostoieffsky
  • Dostoieffsky, Feodor
  • Dostoievski
  • Dostoievski, Feodor Mikhailovitch
  • Dostoievski, Feodore M.
  • Dostoievski, Thedor Mikhailovitch
  • Dostoievsky
  • Dostoievsky, Feodor Mikhailovitch
  • Dostoievsky, Fyodor
  • Dostojevski, Feodor
  • Dostoyeffsky
  • Dostoyefsky
  • Dostoyefsky, Theodor Mikhailovitch
  • Dostoyevski, Feodor
  • Dostoyevsky
  • Dostoyevsky, Fyodor
  • Dostoyevsky, F. M.
  • Dostoyevsky, Feodor Michailovitch
  • Dostoyevsky, Feodor Mikhailovich

So, obviously, I wanted to normalize these. But I also want to link the name up to an identifier that could be useful for obtaining other information, such as an image of the author, a description of their work, possibly link to works by the author, etc. I’m going to try to map the authors to Wikidata, largely because there are links from Wikidata to other places like the Virtual International Authority File, and Freebase, but there are also images on Wikimedia Commons, and nice descriptive text for the people. As an example here is the Wikidata page for Dostoyevsky.

To aid in this process I created a very simple command line tool and library called wikidata_suggest which uses Wikidata’s suggest API to interactively match up a string of text to a Wikidata entity. If Wikidata doesn’t have any suggestions as a fallback the utility looks in a page of Google’s search results for a Wikipedia page and then will optionally let you use that text.

SKOS

Soon after tweeting about the utility and the video I made about it I heard from Alberto who works on the NASA Astrophysics Data System and was interested in using wikidata_suggest to try to link up the Unified Astronomy Thesaurus to Wikidata.

Fortunately the UAT is made available as a SKOS RDF file. So I wrote a little proof of concept script named skos_wikidata.py that loads a SKOS file, walks through each skos:Concept and asks you to match the skos:prefLabel to Wikidata using wikidata_suggest. Here’s a quick video I made of what this process looks like:

I guess this is similar to what you might do in OpenRefine, but I wanted a bit more control over how the data was read in, modified and matched up. I’d be interested in your ideas on how to improve it if you have any.

It’s kind of funny how Day of Digital Humanities quickly morphed into Day of Astrophysics…


A Personal Panopticon

Here is how you can use your Google Search History and jq to create a top 10 list of the things you’ve googled for the most.

First download your data from your Google Search History. Yeah, creepy. Then install jq. Wait for the email from Google that your archive is ready and download then unzip it. Open a terminal window in the Searches directory, and run this:

jq --raw-output '.event[].query.query_text' *.json \
  | sort | uniq -c | sort -rn | head -10

Here’s what I see for the 75,687 queries I’ve typed into google since July 2005.

309 google analytics
 130 hacker news
 116 this is my jam
  83 site:chroniclingamerica.loc.gov
  68 jquery
  54 bagit
  48 twitter api
  44 google translate
  37 wikistream
  37 opds

These are (mostly) things that I hadn’t bothered to bookmark, but visited regularly. I suspect there is something more compelling and interesting that could be done with the data. A personal panopticon perhaps.

Oh, and I’d delete the archive from your Google Drive after you’ve downloaded it. If you ever grant other apps the ability to read from your drive they could read your search history. Actually maybe this whole exercise is fraught with peril. You should just ignore it.


VirtualEnv Builds in Sublime Text 3

Back in 1999 I was a relatively happy Emacs user, and was beginning work at a startup where I was one of the first employees after the founders. Like many startups, in addition to owning the company, the founders were hackers, and were routinely working on the servers. When I asked if Emacs could be installed on one of the machines I was told to learn Vi … which I proceeded to do. I needed the job.

Here I am 15 years later, and am finally starting to use Sublime Text 3 a bit more in my work. I’m not be a cool kid anymore, but I can still pretend to be one, eh? The Vintageous plugin lets my fingers feel like they are in Vim, while being able to take advantage of other packages for editing Markdown, interacting with Git and the lovely eye-pleasing themes that are available. I still feel a bit dirty because unlike Vim, Sublime is not opensource ; but at the same time it does feel good to support a small software publisher who is doing good work. Maybe I’ll end up switching back to Vim and supporting it.

Anyway, as a Python developer one thing I immediately wanted to be able to do was to use my project’s VirtualEnv during development, and to run the test suite from inside Sublime. The Virtualenv package makes creating, activating, deactivating, deleting a virtualenv a snap. But I couldn’t seem to get the build to work properly with the virtualenv, even after setting the Build System to Python - Virtualenv

Sublime Text 3 - VirtualEnv

After what felt like a lot of googling around (it was probably just 20 minutes) I didn’t seem to find an answer until I discovered in the Project documentation that I could save my Project, and then go to Project -> Edit Project and add a build_systems stanza like this:

{
 "folders":
 [
  {
   "path": "."
  }
 ],
 "virtualenv": "/Users/ed/.virtualenvs/curio",
 "build_systems": [
  {
   "name": "Test",
   "shell_cmd": "/Users/ed/.virtualenvs/curio/bin/python setup.py test"
  }
 ] 
}

Notice how the shell_cmd is using the Python executable in my VirtualEnv? After saving that I was able to go into Tools -> Build System and set the build system to Test, which matches the name of the build system you added in the JSON. Now a command-B will run my test suite with the VirtualEnv.

Sublime Text 3 - VirtualEnv w/ Build

I guess it would be nice if the VirtualEnv plugin for Sublime did something to make this easier. But rather than go down that rabbit hole I decided to write it down here for the benefit of my future self (and perhaps you).

If you know of a better way to do this please let me know.


Human Nature and Conduct

Human Nature and ConductHuman Nature and Conduct by John Dewey
My rating: 5 of 5 stars

This book came recommended by Steven Jackson when he visited UMD last year. I’m a fan of Jackson’s work on repair, and was curious about how his ideas connected back to Dewey’s Human Nature and Conduct.

I’ve been slowly reading it, savoring each chapter on my bus rides to work since then. It’s a lovely & wise book. Some of the language puts you back into 1920s, but the ideas are fresh and still so relevant. I’m not going to try to summarize it here. You may have noticed I’ve posted some quotes here. Let’s just say it is a very hopeful book and provides a very clear and yet generous view of the human enterprise.

I don’t know if I was imagining it, but I seemed to see a lot of parallels between it and some reading I’m doing about Buddhism. I noticed over at Wikipedia that Dewey spent some time in China and Japan just prior to delivering these lectures. So maybe it’s not so far fetched a connection.

I checked it out of the library, but I need to buy a copy of my own so I can re-read it. You can find a copy at Internet Archive for your ebook reader too.


Method and Materials

Now it is a wholesome thing for any one to be made aware that thoughtless, self-centered action on his part exposes him to the indignation and dislike of others. There is no one who can be safely trusted to be exempt from immediate reactions of criticism, and there are few who do not need to be braced by occasional expressions of approval. But these influences are immensely overdone in comparison with the assistance that might be given by the influence of social judgments which operate without accompaniments of praise and blame; which enable an individual to see for himself what he is doing, and which put him in command of a method of analyzing the obscure and usually unavowed forces which move him to act. We need a permeation of judgments on conduct by the method and materials of a science of human nature. Without such enlightenment even the best-intentioned attempts at the moral guidance and improvement of others often eventuate in tragedies of misunderstanding and division, as is so often seen in the relations of parents and children.

Dewey (1957), p. 321.

Dewey, J. (1957). Human nature and conduct. New York: Modern Library. Retrieved from https://archive.org/details/humannatureandco011182mbp


Something Horrible

There is something horrible, something that makes one fear for civilization, in denunciations of class-differences and class struggles which proceed from a class in power, one that is seizing every means, even to a monopoly of moral ideals, to carry on its struggle for class-power.

Dewey (1957), p. 301.

Dewey, J. (1957). Human nature and conduct. New York: Modern Library. Retrieved from https://archive.org/details/humannatureandco011182mbp


Energies

Human nature exists and operates in an environment. And it is not “in” that environment as coins are in a box, but as a plant is in the sunlight and soil. It is of them, continuous with their energies, dependent upon their support, capable of increase only as it utilizes them, and as it gradually rebuilds from their crude indifference an environment genially civilized.

Dewey (1957), p. 296.

Dewey, J. (1957). Human nature and conduct. New York: Modern Library. Retrieved from https://archive.org/details/humannatureandco011182mbp


Tweets and Deletes

Archives are full of silences. Archivists try to surface these silences by making appraisal decisions about what to collect and what not to collect. Even after they are accessioned, records can be silenced by culling, weeding and purging. We do our best to document these activities, to leave a trail of these decisions, but they are inevitably deeply contingent. The context for the records and our decisions about them unravels endlessly.

At some point we must accept that the archival record is not perfect, and that it’s a bit of a miracle that it exists at all. But in all these cases it is the archivist who has agency: the deliberate or subliminal decisions that determine what comprises the archival record are enacted by an archivist. In addition the record creator has agency, in their decision to give their records to an archive.

Perhaps I’m over-simplifying a bit, but I think there is a curious new dynamic at play in social media archives, specifically archives of Twitter data. I wrote in a previous post about how Twitter’s Terms of Service prevent distribution of Twitter data retrieved from their API, but do allow for the distribution of Tweet IDs and relatively small amounts of derivative data (spreadsheets, etc).

Tweet IDs can then be hydrated, or turned back into raw original data, by going back to the Twitter API. If a tweet has been deleted you cannot get it back from the API. The net effect this has is of cleaning, or purging, the archival record as it is made available on the Web. But the decision of what to purge is made by the record creator (the creator of the tweet) or by Twitter themselves in cases where tweets or users are deleted.

For example lets look at the collection of Twitter data that Nick Ruest has assembled in the wake of the attack on the offices of Charlie Hebdo earlier this year. Nick collected 13 million tweets mentioning four hashtags related to the attacks, for the period of January 9th to January 28th, 2015. He has made the tweet IDs available as a dataset for researchers to use (a separate file for each hashtag). I was interested in replicating the dataset for potential researchers at the University of Maryland, but also in seeing how many of the tweets had been deleted.

So on February 20th (42 days after Nick started his collecting) I began hydrating the IDs. It took 4 days for twarc to finish. When it did I counted up the number of tweets that I was able to retrieve. The results are somewhat interesting:

hashtag archived tweets hydrated deletes percent deleted
#JeSuisJuif 96,518 89,584 6,934 7.18%
#JeSuisAhmed 264,097 237,674 26,423 10.01%
#JeSuisCharlie 6,503,425 5,955,278 548,147 8.43%
#CharlieHebdo 7,104,253 6,554,231 550,022 7.74%
Total 13,968,293 12,836,767 1,131,526 8.10%

It looks like 1.1 million tweets out of the 13.9 million tweet dataset have been deleted. That’s about 8.1%. I suspect now even more have been deleted. While the datasets themselves are significantly smaller the number of deletes for #JeSuiAhmed and #JeSuisJuif seem quite a bit higher than #JeSuisCharlie and #CharlieHebdo. Could this be that users were concerned about how their tweets would be interpreted by parties analyzing the data?

Of course, it’s very hard for me to say since I don’t have the deleted tweets. I don’t even know who sent them. A researcher interested in these questions would presumably need to travel to York University to work with the dataset. In a way this seems to be how archives usually work. But if you add the Web as a global, public access layer into the mix it complicates things a bit.


The Adventure of Experiment

Love of certainty is a demand for guarantees in advance of action. Ignoring the fact that truth can be bought only by the adventure of experiment, dogmatism turns truth into an insurance company. Fixed ends upon one side and fixed “principles” – that is authoritative rules – on the other, are props for a feeling of safety, the refuge of the timid, and the means by which the bold prey upon the timid.

Dewey (1957), p. 237.

Dewey, J. (1957). Human nature and conduct. New York: Modern Library. Retrieved from https://archive.org/details/humannatureandco011182mbp