Étudier

TL;DR - étudier is a command line utility for saving an article’s network of citing literature from Google Scholar as a GEXF file for viewing in Gephi, and as a (currently very bare bones) D3 network visualization for viewing in a web browser.

Periodically I discover a new piece of research I really like, and I want to know who else is citing it to see how it fits into the larger research landscape. Ever since Eugene Garfield dreamed up the citation index, we’ve been able to trace citation links forwards, to see who is citing an article. Of course a side effect of this has been the contentious field of scientometrics in which the impact of research is purportedly measured. Fifty years later Google Scholar offers this same functionality, by scraping scholarly publication metadata from publisher and repository websites, and building a database that lets see what research is citing something with the click of a link.

It’s kind of poignant that Google provide this functionality because their own PageRank algorithm for ranking web search results was directly influenced by Garfield, who died last year.

Got Data?

I recently went looking around for techniques to collect this network of close relationships around an article from Google Scholar. Ideally Scholar would have an API that provided structured metadata for research publications. But Google does not make an API available, probably because of the privileged access it receives from publishers like Elsevier, who have a vested interest in there not being a Google Scholar API.

I briefly looked at the Web of Science API, which has an endpoint for looking up citing articles. However it appears that access to the API is not guaranteed even if you are lucky enough to belong to an institution that pays for access to Web of Science. Futhermore API access is tiered (lite or premium) and tracing the links between articles requires premium access. Lastly the endpoint uses SOAP which is a little bit painful to use. I can understand why full access might be limited to business partners, but I was really just interested in looking at the immediate network of relationships around an article, not in getting some global picture of citations.

Then I ran across Jimmy Tidey’s excellent Scraping Google Scholar to write your PhD literature chapter which describes his tool Bibnet which does exactly what I wanted: collects the data around a particular publication from Google Scholar. I did manage to get it running locally, but couldn’t quite seem to get it to work. It involves running a local web application in combination with a Chrome extension. Tidey also made the application available as a service at whocites.com, but I kind of got cold feet after seeing the invalid SSL certificate. Also the application seemed to do a lot more than I wanted: it has a backend database where citations are stored. I just wanted to get the network of citations out and move on to visualizing them and doing the reading.

Étudier

So, somewhat predictably, I decided to write my own tool. étudier is a command line utility written in Python that uses Selenium and requests-html to automatically drive a browser to collect a citation graph around a particular Scholar citation. The resulting network is written out as a Gephi file and a D3 visualization using networkx. The D3 visualization could use some work, so if you add style to it please submit a pull request.

To use etudier.py you first need to find a citation you are interested in on Google Scholar, and click on its Cited By link. For example here is the Cited By URL for Sherry Ortner’s influential Theory in Anthropology since the Sixties:

https://scholar.google.com/scholar?cluster=17950649785549691519&hl=en&as_sdt=20000005&sciodt=0,21

With this URL in hand you can run etudier.py:

etudier.py 'https://scholar.google.com/scholar?cluster=17950649785549691519&hl=en&as_sdt=20000005&sciodt=0,21'

This will collect the ten citations on the page, and then examine each one to see what cites them. If you want you can collect more than the first page using the –pages option:

etudier.py --pages 2 'https://scholar.google.com/scholar?cluster=17950649785549691519&hl=en&as_sdt=20000005&sciodt=0,21'

And you can also collect the research that cites the research that cites your article by using the –depth option:

etudier.py --depth 2 'https://scholar.google.com/scholar?cluster=17950649785549691519&hl=en&as_sdt=20000005&sciodt=0,21'

It’s unlikely you’ll want to use a –depth greater than 2, because only examining the first page of results will build a network of up to 1000 citations, which starts to get difficult to visualize.

If you are wondering why it uses a non-headless browser it’s because Google is quite protective of this data and routinely will ask you to solve a CAPTCHA (identifying street signs, cars, etc in photos). étudier will allow you to complete these tasks when they occur and then will continue on its way collecting data.

D3

The network is written out as an HTML file that uses D3 to visualize the data. Here you can see a network generated for the Ortner article I mentioned above.

I’m hovering over an article that is at the center of a cluster of citations so I can see its title: Situated Learning: Legitimate peripheral participation by Lave and Wenger. Clicking on the nodes should open a new page and bring your browser to what Google Scholar thinks is the webpage for the publication.

Gephi

Hopefully the D3 visualization can be improved a bit, but given that the resulting networks can be different it might be difficult to find a one size fits all solution. So the data is also saved as a GEXF file that can be loaded into Gephi where it can be massaged. Here is a visualization I made of 693 publications that I collected for the Ortner article with –depth 2.

When I opened the Gephi file it looked like a giant hairball, which is not unusual. Describing how to use Gephi is a bit beyond the scope of this post, but there are lots of videos on YouTube for working with Gephi. I bookmarked a few that I found particularly useful.

In the example above I filtered the network to only include articles that had 10 or more citations (in-degree >= 10). I then ran community detection and colored the nodes based on their community and applied a Force Atlas layout to arrange the nodes. Finally I made the node size relative the number of inbound citations. With a little finagling you can hover over the nodes to see what they are.

Search Results

You can also visualize generic search results. So for example [this search] for cscw memory to find papers from CSCW that mention the word memory. I collect the first three pages of results and generated this graph of citations that allows me to see the clusters of results:

etudier.py --pages 3 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C21&q=cscw+memory&btnG='

This particular graph was constructed in a similar way to the Gephi graph above except that the nodes are weighted by the total number of times they are cited in the literature. In theory these little snapshots can help guide reading when investigating a new domain.

The Metadata

If you peak in the output.html you’ll see the JSON metadata which is also present in the GEXF file. Here’s an example:

{
    "id": "17379254502391608120",
    "url": "https://dl.acm.org/citation.cfm?id=358992",
    "title": "FieldWise: a mobile knowledge management architecture",
    "authors": "H Fagrell, K Forsberg, J Sanneblad",
    "year": "2000",
    "cited_by": 102,
    "cited_by_url": "https://scholar.google.com/scholar?cites=17379254502391608120&as_sdt=20000005&sciodt=0,21&hl=en"
}

You can use this metadata in the visualization, as I did above when I wanted to vary the size of the node in Gephi using the number of times a publication was cited: cited_by.

So, if you are looking for some help doing a literature review and have a chance to try out etudier I’d be interested to hear how it works for you. As with many scraping applications it is quite brittle, and if Google changes its HTML markup it’s liable to break. So please file an issue in GitHub if you noticed that has happened.