OpenAlex is a database of metadata about scholarly publishing that just had a beta release yesterday. It replaces the discontinued Microsoft Academic Graph (MAG), and is made available by the OurResearch project as a 500 GB dataset of tabular (TSV) files, that appears to be exported from an Amazon Redshift database.
500 GB is a lot to download. I guess it could be significantly improved by compressing the data first. Fortunately OurResearch are planning on making the data available via an API, which should make it easier to work with for most tasks. But having the bulk data available is very useful for data integration, and getting a picture of the dataset as a whole. It’s a few years old now but Herrmannova & Knoth (2016) has a pretty good analysis of the metadata fields used in the original MAG dataset, especially how they compare to similar sources.
I’ve never looked at MAG before, but after glancing at the list of tables I thought it could be interesting to take a quick look at the URLs, since it’s a bit more manageable at 44 GB, and can be fetched easily from AWS:
aws s3 cp --request-payer s3://openalex/data_dump_v1/2021-10-11/mag/PaperUrls.txt ~/Data/OpenAlex/PaperUrls.txt
The table has the following columns:
Just eyeballing the data it appears that most columns are sparsely populated except for the first four. The original MAG dataset was built from Microsoft’s crawl of the web, which then used machine learning techniques to extract the citation data (Wang et al., 2020). Of course the web is a big place, so I thought it could be interesting to see what domains are present in the data. These domains tell an indirect story about how Microsoft crawled the web, and provide a picture of academic publishing on the web.
Once you download it
wc -l shows that there are
448,714,897 rows in PaperUrls.txt. Unless you are using Spark or
something you probably don’t want to pull all that into memory. Over in
notebook I simply read in the data line by line, extracted the
domain, and counted them. tldextract is pretty
handy for getting the registered domain:
>>> import tldextract >>> u = tldextract.extract('https://pubmed.ncbi.nlm.nih.gov/2195785/') >>> print(u.registered_domain) nih.gov
This found 243,726 domains, of which the top 25 account for over half. Below is a chart of how these top 25 break down. You can click on “Other” to toggle it off/on to get more of a view.
I’m not sure if there are any big surprises here. The prominence of nih.gov and europepmc.org point to the significant influence of biological sciences and government. It’s also interesting to see harvard.edu edging out major publishers like Wiley, IEEE, Taylor & Francis, and Sage. The domain counts dataset is available here if you want to take a look yourself.