OpenAlex Domains
OpenAlex is a database of metadata about scholarly publishing that just had a beta release yesterday. It replaces the discontinued Microsoft Academic Graph (MAG), and is made available by the OurResearch project as a 500 GB dataset of tabular (TSV) files, that appears to be exported from an Amazon Redshift database.
500 GB is a lot to download. I guess it could be significantly improved by compressing the data first. Fortunately OurResearch are planning on making the data available via an API, which should make it easier to work with for most tasks. But having the bulk data available is very useful for data integration, and getting a picture of the dataset as a whole. Itās a few years old now but Herrmannova & Knoth (2016) has a pretty good analysis of the metadata fields used in the original MAG dataset, especially how they compare to similar sources.
Iāve never looked at MAG before, but after glancing at the list of tables I thought it could be interesting to take a quick look at the URLs, since itās a bit more manageable at 44 GB, and can be fetched easily from AWS:
aws s3 cp --request-payer s3://openalex/data_dump_v1/2021-10-11/mag/PaperUrls.txt ~/Data/OpenAlex/PaperUrls.txt
The table has the following columns:
- PaperId
- SourceType
- SourceUrl
- LanguageCode
- UrlForLandingPage
- UrlForPdf
- HostType
- Version
- License
- RepositoryInstitution
- OaiPmhId
Just eyeballing the data it appears that most columns are sparsely populated except for the first four. The original MAG dataset was built from Microsoftās crawl of the web, which then used machine learning techniques to extract the citation data (Wang et al., 2020). Of course the web is a big place, so I thought it could be interesting to see what domains are present in the data. These domains tell an indirect story about how Microsoft crawled the web, and provide a picture of academic publishing on the web.
Once you download it wc -l
shows that there are
448,714,897 rows in PaperUrls.txt. Unless you are using Spark or
something you probably donāt want to pull all that into memory. Over in
this
notebook I simply read in the data line by line, extracted the
domain, and counted them. tldextract is pretty
handy for getting the registered domain:
>>> import tldextract
>>> u = tldextract.extract('https://pubmed.ncbi.nlm.nih.gov/2195785/')
>>> print(u.registered_domain)
nih.gov
This found 243,726 domains, of which the top 25 account for over half. Below is a chart of how these top 25 break down. You can click on āOtherā to toggle it off/on to get more of a view.
Iām not sure if there are any big surprises here. The prominence of nih.gov and europepmc.org point to the significant influence of biological sciences and government. Itās also interesting to see harvard.edu edging out major publishers like Wiley, IEEE, Taylor & Francis, and Sage. The domain counts dataset is available here if you want to take a look yourself.