Back in 2018 I wrote a small Python program called étudier which scraped citation data out of Google Scholar and presented the network as a GEXF file for use in Gephi. It also wrote the network data out as an HTML file that included a very rudimentary D3 visualization.

At the time I left a note in the README saying that I was looking for ideas on how to improve the D3 visualization. A year later I actually got a pull request from Thomas Anderson who had used another D3 visualization to modify the one that was hard coded into étudier.

Thomas’ pull request had some good ideas, like zooming and text labels, which were missing from the first version. But it also seemed to be hacked together quickly to serve a particular need, so it took me some time to get around to disentangling a few things, pruning unused code, adding some additional features, and merging it in. A few too many years later I got around to finishing it and the new D3 visualization is now out in v0.1.0.

Once you pip install --upgrade etudier you can collect a citation network either from Google Scholar search results, or by examining the citations to a particular publication. For example here’s how I collected the network for Sherry Ortner’s Theory in Anthropology since the Sixties:

$ etudier ',21&hl=en'

I’ve embedded the resulting output.html here using an <iframe>.

A few things to note about the visualization:

  • The size of each node is relative to the total number of citations to the publication (not just the ones that are visible in the graph).
  • The color of each node indicates which cluster it belongs to (generated with networkx).
  • The directed arrow indicates which publication is citing which.
  • Hovering over a node reveals its title and the titles of other nodes it is immediately connected to.
  • Click and hold on a node to foreground just connected nodes.
  • Double-click on a node to open tab with the publication in it.
  • Drag nodes around to make the network easier to read.
  • Zoom in on the image to examine particular parts of the network.

This is actually a pretty simple graph, since I ran it with étudier’s defaults, which are to collect just one level of citations and just one page of results at each level. This means étudier will look at the 10 citations on the page you supply, and will click into each publication using the cited by link to find the first 10 citations that cite it. Effectively the defaults will pull in up to 100 citations.

It’s worth reflecting on how important Google’s relevance ranking is in shaping the visualization since only some of the citations can be crawled, and the ones that are ranked higher have a higher chance of getting picked up.

You can get more using the --depth and --pages command line options. Just be careful especially with --depth since it can exponentially increase the number of results. Here’s what the same Ortner graph looks like with --depth 2:

You probably will need to zoom out to see the whole thing. Obviously this is pushing the limits of what you can do with D3 without customizing things more. I think the GEXF and GraphML files will probably be most useful if you’ve collected a really large network and want to control how it looks and prune things.

Tucked away in the HTML file is a JSON representation of the network, which could be repurposed for other things I guess. If you get a chance to use étudier please let me know. It might be fun to create a little gallery in the repository. It also could be useful to create a tool that loads the citation data into Zotero, or adds it to WikiCite. But that’s for another day.