Last week’s landmark ruling from the Supreme Court on same sex marriage was routinely published on the Web as a PDF. Given the past history of URL use in Supreme Court opinions I thought I would take a quick look to see what URLs were present. There are two, both are in Justice Alito’s dissenting opinion, and one is broken … just four days after the PDF was published. You can see it yourself at the bottom of page 100 in the PDF.
If you point your browser at
you will get a page not found error:
Sadly even the Internet Archive doesn’t have a snapshot of the page available.
But notice it thinks it can get a copy of it still. That’s because the Center for Disease Control’s website is responding with a 200 OK instead of a 404 Not Found:
zen:~ ed$ curl -I http://www.cdc.gov/nchs/data/databrief/db18.pdf
HTTP/1.1 200 OK
Date: Tue, 30 Jun 2015 16:22:18 GMT
At any rate, it’s not Internet Archive’s fault that they haven’t archived the Webpage originally published in 2009, because the URL is actually a typo. Instead it should be
which leads to:
So between the broken URL and the 200 OK for something not found we’ve got issues of link rot and reference rot all rolled up into a one character typo. Sigh.
I think a couple lessons for web publishers can be distilled from this little story:
- when publishing on the Web include link checking as part of your editorial process
- if you are going to publish links on the Web use a format that’s easy to check … like HTML.
For #DayOfDH yesterday I created a quick video about some data normalization work I have been doing using Wikidata entities. I may write more about this work later, but the short version is that I have a bunch of spreadsheets with names in them (authors) in a variety of formats and transliterations, which I need to collapse into a unique identifier so that I can provide a unified display of the data per unique author. So for example, my spreadsheets have information for Fyodor Dostoyevsky using the following variants:
- Dostoeieffsky, Feodor
- Dostoevski, F. M.
- Dostoevski, Fedor
- Dostoevski, Feodor Mikailovitch
- Dostoevsky, Fiodor Mihailovich
- Dostoevsky, Fyodor
- Dostoevsky, Fyodor Michailovitch
- Dostoieffsky, Feodor
- Dostoievski, Feodor Mikhailovitch
- Dostoievski, Feodore M.
- Dostoievski, Thedor Mikhailovitch
- Dostoievsky, Feodor Mikhailovitch
- Dostoievsky, Fyodor
- Dostojevski, Feodor
- Dostoyefsky, Theodor Mikhailovitch
- Dostoyevski, Feodor
- Dostoyevsky, Fyodor
- Dostoyevsky, F. M.
- Dostoyevsky, Feodor Michailovitch
- Dostoyevsky, Feodor Mikhailovich
So, obviously, I wanted to normalize these. But I also want to link the name up to an identifier that could be useful for obtaining other information, such as an image of the author, a description of their work, possibly link to works by the author, etc. I’m going to try to map the authors to Wikidata, largely because there are links from Wikidata to other places like the Virtual International Authority File, and Freebase, but there are also images on Wikimedia Commons, and nice descriptive text for the people. As an example here is the Wikidata page for Dostoyevsky.
To aid in this process I created a very simple command line tool and library called wikidata_suggest which uses Wikidata’s suggest API to interactively match up a string of text to a Wikidata entity. If Wikidata doesn’t have any suggestions as a fallback the utility looks in a page of Google’s search results for a Wikipedia page and then will optionally let you use that text.
Soon after tweeting about the utility and the video I made about it I heard from Alberto who works on the NASA Astrophysics Data System and was interested in using wikidata_suggest to try to link up the Unified Astronomy Thesaurus to Wikidata.
Fortunately the UAT is made available as a SKOS RDF file. So I wrote a little proof of concept script named skos_wikidata.py that loads a SKOS file, walks through each skos:Concept and asks you to match the skos:prefLabel to Wikidata using wikidata_suggest. Here’s a quick video I made of what this process looks like:
I guess this is similar to what you might do in OpenRefine, but I wanted a bit more control over how the data was read in, modified and matched up. I’d be interested in your ideas on how to improve it if you have any.
It’s kind of funny how Day of Digital Humanities quickly morphed into Day of Astrophysics…
Here is how you can use your Google Search History and jq to create a top 10 list of the things you’ve googled for the most.
First download your data from your Google Search History. Yeah, creepy. Then install jq. Wait for the email from Google that your archive is ready and download then unzip it. Open a terminal window in the Searches directory, and run this:
jq --raw-output '.event.query.query_text' *.json \
| sort | uniq -c | sort -rn | head -10
Here’s what I see for the 75,687 queries I’ve typed into google since July 2005.
309 google analytics
130 hacker news
116 this is my jam
48 twitter api
44 google translate
These are (mostly) things that I hadn’t bothered to bookmark, but visited regularly. I suspect there is something more compelling and interesting that could be done with the data. A personal panopticon perhaps.
Oh, and I’d delete the archive from your Google Drive after you’ve downloaded it. If you ever grant other apps the ability to read from your drive they could read your search history. Actually maybe this whole exercise is fraught with peril. You should just ignore it.