Coverage of DOAJ journals' citations through OpenCitations - Protocol

Constance Dami; Alessandro Bertozzi; Chiara  Manca; Umut Kucuk

Dec 14, 2022

Version 5

Coverage of DOAJ journals' citations through OpenCitations - Protocol V.5

DOI

dx.doi.org/10.17504/protocols.io.n92ldz598v5b/v5

¹University of Bologna

Chiara Manca

DOI: dx.doi.org/10.17504/protocols.io.n92ldz598v5b/v5

Protocol Citation: Constance Dami, Alessandro Bertozzi, Chiara Manca, Umut Kucuk 2022. Coverage of DOAJ journals' citations through OpenCitations - Protocol. protocols.io https://dx.doi.org/10.17504/protocols.io.n92ldz598v5b/v5Version created by Constance Dami

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: Working

We use this protocol and it's working

Created: July 13, 2022

Last Modified: December 14, 2022

Protocol Integer ID: 66635

Keywords: citations, OpenCitations, DOAJ, open access, journals, open science

Disclaimer

This protocol refers to a research done for the Open Science course 21/22 of the University of Bologna.

Abstract

This is the protocol for the research of the coverage of DOAJ journals' citations through OpenCitations. 

Our goal is to find out:
about the coverage of articles from open access journals in DOAJ journals as citing and cited articles,
how many citations do DOAJ journals receive and do, and how many of these citations involve open access articles as both citing and cited entities,
as well as the presence of trends over time of the availability of citations involving articles published in open access journals in DOAJ journals.

Our research focuses on DOAJ journals exclusively, using OpenCitations as a tool. Previous research has been made on open citations using COCI (Heibi, Peroni & Shotton 2019), and on DOAJ journals' citations (Saadat and Shabani 2012), paving the grounds for our present analysis. 

After careful considerations on the best way to retrieve data from DOAJ and OpenCitations, we opted for downloading the public data dumps. Using the API resulted in a way too long running time, and the same problem arose for using the SPARQL endpoint of OpenCitations. 

Minimal Bibliography

Björk, B.-C.; Kanto-Karvonen, S.; Harviainen, J.T. "How Frequently Are Articles in Predatory Open Access Journals Cited." Publications, 8, 17. (2020) https://doi.org/10.3390/publications8020017

Heibi, I.; Peroni, S.; Shotton, D. "Crowdsourcing open citations with CROCI -- An analysis of the current status of open citations, and a proposal" arXiv:1902.02534 (2019) https://doi.org/10.48550/arXiv.1902.02534

Pandita, R., & Singh, S. "A Study of Distribution and Growth of Open Access         Research Journals Across the World. Publishing Research Quarterly" (2022), 38(1), 131–149. https://doi.org/10.1007/s12109-022-09860-x

Saadat, R., A. Shabani. "Investigating the citations received by journals of Directory of Open Access Journals from ISI Web of Science’s articles." International Journal of Information Science and Management (IJISM) 9.1 (2012): 57-74.

Solomon, D. J., Laakso, M., Björk, B.-C. "A longitudinal comparison of citation rates and growth among open access journals", Journal of Informetrics, 7, 3 (2013): 642-650. https://doi.org/10.1016/j.joi.2013.03.008.

Materials

This protocol uses the following Python libraries: tarfile, pandas, JSON, pickle, DateTime, zip file, and plotly.

The GitHub repository for our research software, including all python code mentioned in the protocol, is available here.

We used the data dump of DOAJ articles of May 01, 2022 and the data dump of DOAJ journals of May 07, 2022. The most recent ones can be found on the DOAJ website. 

For Open Citations data, we used the COCI dump of March 2022. This dump, as well as the most recent one, is available on the OpenCitations website.

Before start

Make sure to have Python 3.9 installed on your device.
All the dependecies of the script can be installed using the requirements.txt file stored into the github repository.

Computer technical specifications:
CPU: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz   2.59 GHz
RAM: 20,0 GB (19,9 GB usable) 2666 mhz

Data Gathering: DOAJ

Collecting data from DOAJ: we download data about journals and articles from the DOAJ website, and then refine it excluding all information that we are not interested in.
Expected result
doi.json, containing a dictionary of all the journals as key and some information including the list of all the DOIs of the articles published in this journal as value,
articles_without_dois.json, containing all the articles excluded from our research, due to the lack of DOI provided,
dois_articles_journals.pickle, a dictionary with every DOI of the articles as key and the unique identifier for the journal publishing it as value. 

We download the data dumps from DOAJ in .tar.gz. format.
Dataset
DOAJ articles public data dump
NAME
https://doaj.org/public-data-dump/article
LINK

Dataset
DOAJ journals public data dump
NAME
https://doaj.org/public-data-dump/journal
LINK
Both datasets contain metadata that is not useful for our research, so we need to filter only the necessary data.

From the DOAJ dump, we create a unique key for each journal by concatenating the issn and the eissn, having as values: the issn (if it is present), eissn (if it is present), the title of the journal, the subject of the journal and the list of all the articles' DOIs.

After opening the tarfile containing the data, for every journal, we extract only the information about issn and eissn, first making sure that there is always at least one of the two for each record in the dump:
for journal in p:
     try:
        if journal["bibjson"]["pissn"]:
           journal_issn = journal["bibjson"]["pissn"]
     except KeyError:
        journal_issn=""
     try:
        if journal["bibjson"]["eissn"]:
          journal_eissn = journal["bibjson"]["eissn"]
     except KeyError:
         journal_eissn=""
We then add to the set of journals our unique identifier "issn+eissn"
key_dict = f"{journal_issn}{journal_eissn}"
journals.add(key_dict)

We extract the data of the articles: we open the file with tarfile, then for each article, we collect the information about issn and eissn of the journal publishing it, as well as the DOI of the article:

for article in p:
    for el in article["bibjson"]["identifier"]:
          if el["type"] == "pissn":
               journal_issn = el["id"]
          if el["type"] == "eissn":
               journal_eissn = el["id"]
          if el["type"] == "doi" or el["type"] == "DOI":
               try:
                   art_doi = el["id"]
               except KeyError:
                   art_doi = ""

If the article doesn't have any DOI registered, we add it to a list that we will store separately.


Otherwise we handle cases where the issn and eissn have been wrongly registered in the articles dump by aligning data with the journals set previously created. 

if art_doi=="":
    art_without_doi.append(article)
else:
    journal_title=article["bibjson"]["journal"]["title"]

    key_dict = f"{journal_issn}{journal_eissn}"

    # if the issn and/or eissn from the articles dump don't match the journals dump
    if key_dict not in journals:

        # if there is only the issn registered: align with the journals metadata 
        if journal_issn in journals:
                key_dict = journal_issn
        # if there is only the eissn registered: align with the journals metadata 
        elif journal_eissn in journals:
                 key_dict = journal_eissn
        else:
                for issn in journals:
                      if journal_issn != "" and journal_issn in issn:
                          key_dict = issn
                          break
                      elif journal_eissn !="" and journal_eissn in issn:
                          key_dict = issn
                          break
We collect the subject of the journal.
journal_subject = article["bibjson"]["subject"]

Once all of the information are collected, we add them to our final json, adding a new key if it doesn't exist or adding it to the list of dois for the journal.
if key_dict in doi_json:
    doi_json[key_dict]["dois"].append(art_doi)
else:
    doi_json[key_dict] = {"title": journal_title, "pissn": journal_issn, "eissn":   journal_eissn,"dois": [art_doi], "subject": journal_subject}

An example of an element in the final file:
{
    "1779-627X1779-6288": {
        "title": "International Journal for Simulation and Multidisciplinary Design Optimization", 
        "pissn": "1779-627X", 
        "eissn": "1779-6288", 
        "subject": [
            {"code": "T55.4-60.8", "scheme": "LCC", "term": "Industrial engineering. Management                  engineering"}, 
            {"code": "T11.95-12.5", "scheme": "LCC", "term": "Industrial directories"}]
        "dois": [
            "10.1051/ijsmdo:2008025",
            "10.1051/smdo/2019012", 
            "10.1051/smdo/2020004",  
            "10.1051/smdo/2020001", 
            "10.1051/smdo/2016003",
            ...
        ]
    }, 
    ...
}

Expected result
doi.json
articles_without_dois.json

We create a file containing a dictionary with all DOAJ articles' DOIs from DOAJ as keys and the "issn+eissn" identifier of the journal who published it as value, to simplify the next steps.
Expected result
 dois_articles_journals.pickle

Data Gathering: OpenCitations

Collecting and filtering data from OpenCitations: we take the data from the download section, on the OpenCitations website, and then refine them using the files obtained from the previous step.
Dataset
COCI March 2022 Dump
NAME
https://doi.org/10.6084/m9.figshare.6741422.v14
LINK

Expected result
by_journal.json: a file containing all information extracted by Open Citations about DOAJ journals divide by year and journal name. Inside the file, the researcher can find these fields:

A group of fields describing the selected journal.
the code of the journal (obtained by merging together the journal's ISSN and EISSN).
The year which all citations' metrics belong to.
The number of citations received.
The number of citations done.
The ratios between citations done and received.
The number of citations received from other DOAJ journals.
The number of citations done to other DOAJ journals.
The ratios between citations done and received from and to DOAJ journals.

Expected result
normal.json: a file containing all information extracted from Open Citations about DOAJ journals divided only by year. Inside the file, the researcher can find these fields:

The year which all citations' metrics belong to.
The number of citations received.
The number of citations done.
The ratios between citations done and received.
The number of self-citations made by DOAJ inside Open Citations.
The ratio between the self-citation and the total citations received and done by DOAJ.

Expected result
erros.json: a file containing all errors obtained from computations. Inside the file, the researcher can find these fields:

errors about records that don't have any specified date (null dates).
errors about records that have impossible dates (wrong dates).
errors about articles that don't have any specified Dois.
errors about Open Citations records that don't have any Dois in the citing or cited fields.

Expected result
DOAJ_metrics.json: a file containing metrics about DOAJ and Open Citations, obtained by computations. Inside the file, the researcher can find these fields:

Number of journals with dois.
Number of articles which have been processed during computations.
Number of used Dois. All dois (with no repetition) which are used for the adding journal operationion in the second pipeline step.
Number of repeated Dois. All dois which are repeated inside the same or in another journal.
Number of accepted Dois. All articles (with repetition) which have both a defined journal and a defined doi.

Filter Open Citations

We iterate all the records from the Open Citations dump, which have at least one doi in either the citing or cited column. For each directory:

1. We unpack all the zip directory files in a temporary folder and iterate all over the unzip CSV files:
for csv in iterator

2. We split the CSV file in two dataframes. For each dataframe we delete all the records that have a null value on the citing or cited column:
df_cited, df_null_cited = csv_manager.delete_null_values(df, 'cited')

df_citing, df_null_citing = csv_manager.delete_null_values(df, 'citing')

3. For each dataframe, we filter all records which have a DOAJ doi either in the citing or the cited column:
df_cited = csv_manager.refine(df_cited, ['oci', 'creation', 'cited'], 'cited', data_json)

df_citing = csv_manager.refine(df_citing, ['oci', 'creation', 'citing'], 'citing', data_json)
4. We add the journal name that matches the doi in the citing or cited column. Additionally, we add a column for both the cited (isDOAJ_cited) and the citing column (isDOAJ_citing), for identifying which doi belongs to DOAJ for each record (only the one in the cited column, the one in citing column, or in the dois in both columns):
df_cited = csv_manager.add_journal(df_cited, "cited", data_json)

df_citing = csv_manager.add_journal(df_citing, "citing", data_json)
5. We merge the two dataframe in a single one with an outer join.
df_result = df_citing.merge(df_cited, how='outer').reset_index(drop=True).convert_dtypes().drop_duplicates()

Expected result
null_citing: a directory containing all files which have a null value in the citing column.

Expected result
null_cited: a directory containing all files which have a null value in the cited column.

Expected result
filtered: a directory containing all files filtered on both the citing and cited columns, which have at least one Dois from DOAJ journals dump.

Group By Open Citations results

We iterate on each file of the filtered directory and for each one:

1. We transform the creation column into a date format:
df = csv_manager.add_year(df, 'creation')
2. We save and discard all the records that don't have any creation dates or have a date bigger than 2024:
df, df_null, df_wrong = csv_manager.save_errors(df, name_file)
3. We split the main dataframe in two sub-dataframe: one for the groupBy with only the year (df_normal); another one for the groupBy with both year and journal (df_by_journal).
df_normal = csv_manager.groupBy_year(df)

df_by_journal = csv_manager.groupBy_year_and_journal(df)

Expected result
normal: a directory where each file matches a file in the filtered directory. Each file inside this repository is a grouped version of the filtered repository ones. 
These files list the following fields:
year 
number of citations received by DOAJ inside Open Citations (cited)
number of citations done by DOAJ inside Open Citations (citing)
number of citations done to itself by DOAJ inside Open Citations (self-citations)

Expected result
by_journal: a directory where each file matches a file in the filtered directory. Each file inside this repository is a grouped version of the filtered repository ones. These files list the following fields:
year
code of the journal (ISSN + EISSN)
number of citations received by the DOAJ journal inside Open Citations (cited)
number of citations done by the DOAJ journal inside Open Citations (citing)
number of citations done to itself by the DOAJ journal inside Open Citations (self-citations)
number of citations done by the DOAJ journal to another DOAJ journal inside Open Citations (citations to DOAJ)
number of citations received by the DOAJ journal from another DOAJ journal inside Open Citations (cited by DOAJ)

Expected result
null_dates: a directory containing all files which have a null date (= Null) in the creation column.
wrong_dates: a directory containing all files which have a wrong date (>= 2025) in the creation column.

Concatenate all results

We concatenate, using the Pandas library, all the files in the normal repository and in the by_journal repository, to summarize all values in two dataframes:
df_normal = csv_manager.concat_csv_normal(all_csv_normal)

df_by_journal = csv_manager.concat_csv_journal(all_csv_byJournal)
We add to the df_by_journal the group of fields extracted from DOAJ for each journal, which adds useful information about the journal:
df_by_journal = csv_manager.add_to_journals_DOAJ_descriptions(df_by_journal, df_journals_description)
Finally, we concatenate all error files into one single file:
df_null_dates = csv_manager.concat_csv(all_csv_null_dates)

df_wrong_dates = csv_manager.concat_csv(all_csv_wrong_dates)

df_null_citing = csv_manager.concat_csv(all_csv_null_citing)

df_null_cited = csv_manager.concat_csv(all_csv_null_cited)

df_articles_without_dois = pd.read_json(all_articles_without_dois, orient='records')

df_errors = pd.DataFrame({'type_of_error': ['null_dates', 'wrong_dates',
'null_citing', 'null_cited', 'articles_without_dois'],
'count': [sum(df_null_dates['oci']), sum(df_wrong_dates['oci']),
len(df_null_citing), len(df_null_cited),
len(df_articles_without_dois)]})

Expected result
normal.json: a file where each record lists the following fields:

year
cited: total number of citations received by DOAJ in Open Citations
citing: total number of citations done by DOAJ in Open Citations
self_citation: total number of citations done by a DOAJ journal to another DOAJ journal

Expected result
by_journal.json: a file where each record lists the following fields:

year
journal
cited: total number of citations received by the journal in Open Citations
citing: total number of citations done by the journal in Open Citations
self_citation: total number of citations done by a DOAJ journal to another DOAJ journal
citations_to_DOAJ:  total number of citations done by a DOAJ journal to another DOAJ journal
cited_by_DOAJ: total number of citations received by a DOAJ journal from another DOAJ journal

Expected result
errors.json: a file that summarizes all the errors found during the previous computation:

null_dates
wrong_dates
null_citing
null_cited
articles_without_dois

Add Ratios to the final results

1. We add ratios to the normal.json:
normal_json = csv_manager.make_ratio(normal_json)
2. We add ratios to the by_journal.json:
by_journal_json = csv_manager.make_ratio_journal(by_journal_json)

Expected result
by_journal.json: the same file as the previous one with in addition a ratio within metrics:
citing_cited_pcent
citations_to_DOAJ_pcent
cited_by_DOAJ_pcent
self_citation_pcent
citing_cited_ratio
citations_to_DOAJ_ratio
cited_by_DOAJ_r: the same file as the previous one with in addition a ratio within metrics
self_citation_ratio

Expected result
normal.json: the same file as the previous one with in addition a ratio within metrics.
citing_cited_pcent
self_citation_pcent
citing_cited_ratio
self_citation_ratio

Add useful metrics

We add the following metrics to a JSON file, in order to provide a summary of useful research information about dois processed from DOAJ.
Expected result
DOAJ_metrics.json: a file where the reasearcher can find some information about Dois processed from DOAJ
Number of journals with dois.
Number of articles that have been processed during computations.
Number of used Dois: all the dois (with no repetition) which are used for the adding journal operation in the second pipeline step.
Number of repeated Dois: all dois which are repeated inside the same or in another journal.
Number of accepted Dois: all articles (with repetition) which have both a defined journal and a defined doi.

Data Visualization

We visualize our results in line, bar and scatter graphs with the use of the plotly Python library. 
We load our json data from the queried folder in DataFrames of the pandas library. 

import pandas as pd
import plotly.express as px

final_df_years = pd.read_json('../../queried/final_output/normal.json')
final_df_journal = pd.read_json('../../queried/final_output/by_journal.json')
errors = pd.read_json('../../queried/final_output/errors.json')

We query the final_df_journal data frame to find the biggest of DOAJ in terms of the most number of citations, references, citations to DOAJ journals and citations from DOAJ journals.

group_journals = final_df_journal.groupby(['title'])['cited','citing','citations_to_DOAJ','cited_by_DOAJ'].sum()

group_journals.idxmax()

Expected result
cited                PLoS ONE 
citing               PLoS ONE 
citations_to_DOAJ    PLoS ONE 
cited_by_DOAJ        PLoS ONE
We create the final_df_journal_1 data frame with the result of the query.
final_df_journal_1 = final_df_journal[final_df_journal['title']==group_journals['citing'].idxmax()]

To have a better understanding of our data, we examine the most recurring subjects among DOAJ journals.
final_df_journal['subject'] = final_df_journal['subject'].apply(lambda x: [y['term'] for y in x])

#some journals have several subjects, so we separate them to be able to plot it
exploded = final_df_journal.explode('subject') 
#group journals by subject
grouped_df = exploded.groupby(['subject', 'year']
                          ).size().reset_index(name="num_journals")
grouped_df = grouped_df.sort_values(['num_journals'], ascending=False)
#selecting journals to plot and limiting the results to the 21st century.
most_journals_by_subject = grouped_df.drop_duplicates(subset=['subject']).head(20)['subject'].tolist()
grouped_df = grouped_df.loc[grouped_df['year']>1999].sort_values(['year','num_journals'], ascending=False)
We represent it with a line plot.
Expected result
Timeline of the 20 most recurring subjects among DOAJ journals

In order to examine the citations made by journals overall, regardless of the year, we group the journals by title and sum the relevant columns.
group_journals = final_df_journal.groupby(['title'], as_index=False).agg({'dois_count':'first', 'subject':'first','cited':'sum', 'citing':'sum', 'self_citation':'sum', 'citations_to_DOAJ':'sum', 'cited_by_DOAJ':'sum'})
We then use bar plots to visualize the citations data about DOAJ journals
Expected result
Bar plot of the 30 DOAJ journals doing the most citations.
Bar plot of the 30 DOAJ journals doing the most citations to DOAJ journals.
Bar plot of the 30 DOAJ journals getting cited the most.
Bar plot of the 30 DOAJ journals getting cited the most by DOAJ journals.

We repeat step 3.3 with scatter plots, including information about the number of articles per journal.

Expected result
Scatter plot of the 30 DOAJ journals doing the most citations, with size by number of articles.
Scatter plot of the 30 DOAJ journals doing the most citations to DOAJ journals, with size by number of articles.
Scatter plot of the 30 DOAJ journals getting cited the most, with size by number of articles.
Scatter plot of the 30 DOAJ journals getting cited the most by DOAJ journals, with size by number of articles.

We examine the journals doing the most self-citations by year, using a line plot.
self_citations_df = final_df_journal.sort_values(["self_citation"], ascending =False)
list_journals = self_citations_df.drop_duplicates(['journal']).head(20)['title'].tolist()
most_self_cit = self_citations_df.loc[self_citations_df['title'].isin(list_journals)]
most_self_cit = most_self_cit[most_self_cit.year > 1999].sort_values(['year'])

Expected result
Timeline of journals doing the most self-citations since 2000.

To have a better comparison between the citing and cited of DOAJ journals in the last 20 years, we do some bar plots that stack the two amounts in the same column.
Expected result
Timeline of comparison between citing and cited of all DOAJ journals in the last 20 years,
Timeline of comparison between citations and references of the biggest DOAJ journal in the last 20 years,
Timeline of comparison between the number of citations and references from the biggest DOAJ journal to DOAJ journals in the last 20 years,
Timeline of comparison between the percentage of citations and references from the biggest DOAJ journal to DOAJ journals in the last 20 years.

We use a bar plot to visualize the timeline, in the last 20 years, of the number of citations involving DOAJ journals as both citing and cited entities and the percentage of it inside the number of general citations.
Expected result
Timeline of number of citations both coming from and going to DOAJ journals in the last 20 years,
Timeline of percentage of citations both coming from and going to DOAJ journals in the last 20 years
Timeline of percentage of citations both coming from and going to DOAJ journals in the last 20 years, 20 years,
Timeline of percentage of citations going to DOAJ journals from the biggest DOAJ journal in the last 20 years.

We use a bar plot to show the number of errors we encountered in the project divided by category.

Expected result
Types of errors and their count.

Publishing data

We publish the following JSON files in Zenodo and also in our Github repository (queried folder).

Public workspaceCoverage of DOAJ journals' citations through OpenCitations - Protocol V.5

Coverage of DOAJ journals' citations through OpenCitations - Protocol V.5