For #DayOfDH yesterday I created a quick video about some data normalization work I have been doing using Wikidata entities. I may write more about this work later, but the short version is that I have a bunch of spreadsheets with names in them (authors) in a variety of formats and transliterations, which I need to collapse into a unique identifier so that I can provide a unified display of the data per unique author. So for example, my spreadsheets have information for Fyodor Dostoyevsky using the following variants:
- Dostoeieffsky, Feodor
- Dostoevski, F. M.
- Dostoevski, Fedor
- Dostoevski, Feodor Mikailovitch
- Dostoevsky, Fiodor Mihailovich
- Dostoevsky, Fyodor
- Dostoevsky, Fyodor Michailovitch
- Dostoieffsky, Feodor
- Dostoievski, Feodor Mikhailovitch
- Dostoievski, Feodore M.
- Dostoievski, Thedor Mikhailovitch
- Dostoievsky, Feodor Mikhailovitch
- Dostoievsky, Fyodor
- Dostojevski, Feodor
- Dostoyefsky, Theodor Mikhailovitch
- Dostoyevski, Feodor
- Dostoyevsky, Fyodor
- Dostoyevsky, F. M.
- Dostoyevsky, Feodor Michailovitch
- Dostoyevsky, Feodor Mikhailovich
So, obviously, I wanted to normalize these. But I also want to link the name up to an identifier that could be useful for obtaining other information, such as an image of the author, a description of their work, possibly link to works by the author, etc. I’m going to try to map the authors to Wikidata, largely because there are links from Wikidata to other places like the Virtual International Authority File, and Freebase, but there are also images on Wikimedia Commons, and nice descriptive text for the people. As an example here is the Wikidata page for Dostoyevsky.
To aid in this process I created a very simple command line tool and library called wikidata_suggest which uses Wikidata’s suggest API to interactively match up a string of text to a Wikidata entity. If Wikidata doesn’t have any suggestions as a fallback the utility looks in a page of Google’s search results for a Wikipedia page and then will optionally let you use that text.
Soon after tweeting about the utility and the video I made about it I heard from Alberto who works on the NASA Astrophysics Data System and was interested in using wikidata_suggest to try to link up the Unified Astronomy Thesaurus to Wikidata.
— Alberto Accomazzi ((aaccomazzi?))
Fortunately the UAT is made available as a SKOS RDF file. So I wrote a little proof of concept script named skos_wikidata.py that loads a SKOS file, walks through each skos:Concept and asks you to match the skos:prefLabel to Wikidata using wikidata_suggest. Here’s a quick video I made of what this process looks like:
I guess this is similar to what you might do in OpenRefine, but I wanted a bit more control over how the data was read in, modified and matched up. I’d be interested in your ideas on how to improve it if you have any.
It’s kind of funny how Day of Digital Humanities quickly morphed into Day of Astrophysics…