License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: April 12, 2024
Last Modified: May 24, 2024
Protocol Integer ID: 100556
Funders Acknowledgement:
University of Bologna
Grant ID: https://ror.org/01111rn36
Abstract
We present a step-by-step methodology for tracking the coverage of the IRIS dataset in the OpenCitations Corpus.
The methodology filters the original IRIS dataset and the OpenCitations Meta and Index dumps, producing two novel dataset containing the data that are then queried to answer the 5 research questions.
Guidelines
To allow complete reproducibility of the protocol, links to the data used are provided here.
We also recommend, in case you would want to run the optional step of matching the id-less entities to Meta through the matching of the title, to create a .env file in the root of the folder and place your OpenCitations API key (you can obtain here) like so:
OC_APIKEY="<YOUR_API_KEY>"
We suggest to make sure you have Python3.x installed on your computer.
In order to correctly execute the provided scripts, you also must install the required libraries:
requirements.txt. You can do so by running the following command:
pip install -r requirements.txt
IRIS Dataset Preparation
IRIS Dataset Preparation
Make sure that the IRIS dump is present in the work directory.
Download the iris dump and place it in a 'data/' folder in your work directory. It is not required to unzip the archive.
Dataset
UNIBO IRIS bibliographic data dump, dated 14 March 2024
This step will create a version of the OpenCitations Meta dump that is transformed and filtered according to the elements in the IRIS dump. This new dataset is stored in a parquet format that makes it lean and fast to query.
If you want to skip the creation of the dataset, you can download the final dataset here
The purpose of this step is to read each CSV file in the Meta dump and process it by applying the following operations:
Select the ['id', 'title', 'type'] columns
Extract from the 'id' column the omid, and the doi, isbn and pmid if present through a regex pattern search. These 4 different elements are inserted into a new column created for each.
Create a new 'id' column by combining the 'doi', 'isbn', and 'pmid' columns, preferring the first non-null value.
Get rid of the 'doi', 'isbn', and 'pmid' columns
Remove null values from the new 'id' column
Perform an inner join with the dois_isbns_pmids dataframe
Note
The dois_isbns_pmids dataframe is created before the manipulation of the first Meta dump file.
It contains all valid DOI, ISBN and PMID present in the IRIS dump, along with their iris_id identifier.
7. Write the resulting dataframe to a .parquet file.
You can perform this step by using the following command:
After the program has finished processing all the files, a 'iris_in_meta' folder should have appeared in 'data/'.
The dataset should have the following shape:
15m
It is also possible to attempt to retrieve and enrich the Iris in Meta dataset with the identifiers of the elements in the IRIS dump that do not have any DOI, ISBN, or PMID. This is done by querying the OpenCitations Meta SPARQL endpoint to search for each entitity by their title.
From our tests this optional step was able to retrieve 150 additional entities.
This is an optional step. This step has not been performed in the presented state of our research as its result can vary and it could lead to reproducibility incongruences.
We decided to report this only for completeness' sake.
WARNING: this will take around 3 hours to complete.
3h
Create iris_in_index dataset
Create iris_in_index dataset
2h 30m
This step will create a version of the OpenCitations Index dump that is transformed and filtered according to the elements in the Iris in Meta dataset. This new dataset is also stored in a parquet format.
If you want to skip the creation of the dataset, you can download the final dataset here
WARNING: this will take around 1.5 hours to complete.
Expected result
After the program has finished processing all the files, a 'iris_in_index' folder should have appeared in 'data/'.
The dataset should have the following shape:
2h 30m
Research Question answering
Research Question answering
Each substep in this step will explain the answering process of each of the research questions mentioned in the abstract of this protocol.
You can decide to run the code for answering to a specific RQ by specifying it in the command used to run the script. It is also possible to answer all research questions at once by not specifying a specific one to the script, like so:
Command
python3 answer_research_questions.py
RQ1: What is the coverage of the publications available in IRIS, that strictly concern research conducted within the University of Bologna, in OpenCitations Meta?
This research question is answered by simply computing the length of the Iris in Meta dataframe.
You can run the code to answer to this research question using the following command:
Command
python3 answer_research_questions.py -rq 1
Expected result
117764
RQ2: What are the types of publications that are better covered in the portion of OpenCitations Meta covered by IRIS?
This research question is answered by grouping the Iris in Meta dataframe by the 'type' column and then counting the length of the resulting dataframe.
You can run the code to answer to this research question using the following command:
Command
python3 answer_research_questions.py -rq 2
Expected result
│ journal article ┆ 104539 │
│ proceedings article ┆ 5608 │
│ book chapter ┆ 4482 │
│ book ┆ 1482 │
│ no type ┆ 1450 │
│ … ┆ … │
│ dataset ┆ 6 │
│ dissertation ┆ 2 │
│ series ┆ 1 │
│ computer program ┆ 1 │
│ book series ┆ 1 │
RQ3: What is the amount of citations (according to OpenCitations Index) the IRIS publications included in OpenCitations Meta are involved in (as citing entity and as cited entity)?
This research question is answered by simply computing the length of the Iris in Index dataframe.
You can run the code to answer to this research question using the following command:
Command
python3 answer_research_questions.py -rq 3
Expected result
7859226
RQ4: How many of these citations come from and go to publications that are not included in IRIS?
This research question is answered by filtering each 'citing' and 'cited' column of the Iris in Index dataset to remove all rows in which elements from the aforementioned omids_list are present. The length of each of the two resulting dataframes are then computed to get the final answer.
You can run the code to answer to this research question using the following command:
Command
python3 answer_research_questions.py -rq 4
Expected result
│ citing ┆ cited │
════════╪═══════╡
│3562668 ┆ 3950823 │
RQ5: How many of these citations involve publications in IRIS as both citing and cited entities?
This research question is answered by filtering the Iris in Index dataset to keep only the rows in which elements from the aforementioned omids_list are present in either the 'citing' or in the 'cited' columns. The length of the resulting dataframe is then computed to get the final answer.
You can run the code to answer to this research question using the following command: