For #DayOfDH yesterday I created a quick video about some data normalization work I have been doing using Wikidata entities. I may write more about this work later, but the short version is that I have a bunch of spreadsheets with names in them (authors) in a variety of formats and transliterations, which I need to collapse into a unique identifier so that I can provide a unified display of the data per unique author. So for example, my spreadsheets have information for Fyodor Dostoyevsky using the following variants:
- Dostoeieffsky, Feodor
- Dostoevski, F. M.
- Dostoevski, Fedor
- Dostoevski, Feodor Mikailovitch
- Dostoevsky, Fiodor Mihailovich
- Dostoevsky, Fyodor
- Dostoevsky, Fyodor Michailovitch
- Dostoieffsky, Feodor
- Dostoievski, Feodor Mikhailovitch
- Dostoievski, Feodore M.
- Dostoievski, Thedor Mikhailovitch
- Dostoievsky, Feodor Mikhailovitch
- Dostoievsky, Fyodor
- Dostojevski, Feodor
- Dostoyefsky, Theodor Mikhailovitch
- Dostoyevski, Feodor
- Dostoyevsky, Fyodor
- Dostoyevsky, F. M.
- Dostoyevsky, Feodor Michailovitch
- Dostoyevsky, Feodor Mikhailovich
So, obviously, I wanted to normalize these. But I also want to link the name up to an identifier that could be useful for obtaining other information, such as an image of the author, a description of their work, possibly link to works by the author, etc. I’m going to try to map the authors to Wikidata, largely because there are links from Wikidata to other places like the Virtual International Authority File, and Freebase, but there are also images on Wikimedia Commons, and nice descriptive text for the people. As an example here is the Wikidata page for Dostoyevsky.
To aid in this process I created a very simple command line tool and library called wikidata_suggest which uses Wikidata’s suggest API to interactively match up a string of text to a Wikidata entity. If Wikidata doesn’t have any suggestions as a fallback the utility looks in a page of Google’s search results for a Wikipedia page and then will optionally let you use that text.
Soon after tweeting about the utility and the video I made about it I heard from Alberto who works on the NASA Astrophysics Data System and was interested in using wikidata_suggest to try to link up the Unified Astronomy Thesaurus to Wikidata.
Fortunately the UAT is made available as a SKOS RDF file. So I wrote a little proof of concept script named skos_wikidata.py that loads a SKOS file, walks through each skos:Concept and asks you to match the skos:prefLabel to Wikidata using wikidata_suggest. Here’s a quick video I made of what this process looks like:
I guess this is similar to what you might do in OpenRefine, but I wanted a bit more control over how the data was read in, modified and matched up. I’d be interested in your ideas on how to improve it if you have any.
It’s kind of funny how Day of Digital Humanities quickly morphed into Day of Astrophysics…
Here is how you can use your Google Search History and jq to create a top 10 list of the things you’ve googled for the most.
First download your data from your Google Search History. Yeah, creepy. Then install jq. Wait for the email from Google that your archive is ready and download then unzip it. Open a terminal window in the Searches directory, and run this:
jq --raw-output '.event.query.query_text' *.json \
| sort | uniq -c | sort -rn | head -10
Here’s what I see for the 75,687 queries I’ve typed into google since July 2005.
309 google analytics
130 hacker news
116 this is my jam
48 twitter api
44 google translate
These are (mostly) things that I hadn’t bothered to bookmark, but visited regularly. I suspect there is something more compelling and interesting that could be done with the data. A personal panopticon perhaps.
Oh, and I’d delete the archive from your Google Drive after you’ve downloaded it. If you ever grant other apps the ability to read from your drive they could read your search history. Actually maybe this whole exercise is fraught with peril. You should just ignore it.
Back in 1999 I was a relatively happy Emacs user, and was beginning work at a startup where I was one of the first employees after the founders. Like many startups, in addition to owning the company, the founders were hackers, and were routinely working on the servers. When I asked if Emacs could be installed on one of the machines I was told to learn Vi … which I proceeded to do. I needed the job.
Here I am 15 years later, and am finally starting to use Sublime Text 3 a bit more in my work. I’m not be a cool kid anymore, but I can still pretend to be one, eh? The Vintageous plugin lets my fingers feel like they are in Vim, while being able to take advantage of other packages for editing Markdown, interacting with Git and the lovely eye-pleasing themes that are available. I still feel a bit dirty because unlike Vim, Sublime is not opensource ; but at the same time it does feel good to support a small software publisher who is doing good work. Maybe I’ll end up switching back to Vim and supporting it.
Anyway, as a Python developer one thing I immediately wanted to be able to do was to use my project’s VirtualEnv during development, and to run the test suite from inside Sublime. The Virtualenv package makes creating, activating, deactivating, deleting a virtualenv a snap. But I couldn’t seem to get the build to work properly with the virtualenv, even after setting the
Build System to
Python - Virtualenv
After what felt like a lot of googling around (it was probably just 20 minutes) I didn’t seem to find an answer until I discovered in the Project documentation that I could save my Project, and then go to
Project -> Edit Project and add a
build_systems stanza like this:
"shell_cmd": "/Users/ed/.virtualenvs/curio/bin/python setup.py test"
Notice how the shell_cmd is using the Python executable in my VirtualEnv? After saving that I was able to go into
Tools -> Build System and set the build system to
Test, which matches the name of the build system you added in the JSON. Now a command-B will run my test suite with the VirtualEnv.
I guess it would be nice if the VirtualEnv plugin for Sublime did something to make this easier. But rather than go down that rabbit hole I decided to write it down here for the benefit of my future self (and perhaps you).
If you know of a better way to do this please let me know.