PROTOCOL – Availability of Open Access Metadata from Open Journals – A case study in DOAJ and Crossref

Davide Brembilla; Chiara Catizone; Giulia Venditti

May 24, 2022

Version 4

PROTOCOL – Availability of Open Access Metadata from Open Journals – A case study in DOAJ and Crossref V.4

DOI

dx.doi.org/10.17504/protocols.io.kxygxz7ywv8j/v4

¹Alma Mater Studiorum - Università di Bologna

Open Science

Davide Brembilla

DOI: dx.doi.org/10.17504/protocols.io.kxygxz7ywv8j/v4

Protocol Citation: Davide Brembilla, Chiara Catizone, Giulia Venditti 2022. PROTOCOL – Availability of Open Access Metadata from Open Journals – A case study in DOAJ and Crossref. protocols.io https://dx.doi.org/10.17504/protocols.io.kxygxz7ywv8j/v4Version created by Davide Brembilla

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: Working

We use this protocol and it’s working

Created: May 21, 2022

Last Modified: May 24, 2022

Protocol Integer ID: 62985

Keywords: Open Science

Abstract

This protocol is for the research about the availability of Metadata from Open Journals in the DOAJ in Crossref.
The goal is to find out how many papers from DOAJ journals are available on Crossref, whether their metadata is available and the origin of their references' DOIs. This will provide us a clearer picture about the state of Open Access research.

Our research involves the articles from Open Access journals in DOAJ and their data on Crossref. We scouted the availability of these articles’ reference lists, the presence of IDs such as DOIs and the entities responsible for their specification. This analysis was carried using DOAJ article metadata dump, then modified and enriched thought Crossref’s APIs DOI requests. The data will be further analysed to verify the distribution of the provenance of the articles' DOIs.

Guidelines

This methodology cleans and populates the batches of articles downloadable from the DOAJ website and populates them with the infromations from Crossref's API.
All the software to perform this methodologies can be found in the Github Repository for it. All the software is distributed under the MIT licence.
To describe the data in input and output, we produced a Data Management Plan.
The main software is in multithread_populating.py, while main.py is the script used to launch the query from the command line. To populate, we also used the batches_cleaner.py and journals_cleaner.py softwares. At the end, we used also the populator.py script to add the missing details and the data_viz.ipynb notebook to create visualisations.
If you want to query Crossref through a DOI, you can create a populateJson instance and query the API with just the DOI. If you are interested in more complete access to the API, you can find here a more complete list of libraries that have a wider scope; another useful software that was of inspiration for the one used in here is oc_graphenricher.

Materials

All the software to perform this methodologies can be found in the Github Repository for it. All the software is distributed under the MIT licence.
This methodology cleans and populates the batches of articles downloadable from the DOAJ website and populates them with the infromations from Crossref's API.

Safety warnings

Querying Crossref may be a long process, varying depending on your connection and on the power of your PC and internet connection.

Before start

All the software to perform this methodologies can be found in the Github Repository for it. In order to use it, you will need Python 3 as well as install the libraries used in the requirements.txt  file. To install them, you can use the command:
pip install -r requirements.txt
pip3 install -r requirements.txt

‱ Data Gathering

Download DOAJ articles metadata

For starting our research we first focused on retrieving and cleaning DOAJ articles' data as we identified it as the starting point from which we could look for an answer to our research questions.

We downloaded of DOAJ public data dump containing article metadata in tar.gz format and extracted the files contained.

The dump is structured as a single directory of the form doaj_article_data_[date generated] where are listed 75 files with names of the form article_batch_[number].json. Each file contains up to 100,000 records for a total size of 270Mb. 

To decrease the size of each batch, we filtered each key to get only the information useful for our research: 

DOI of the article
year of publication 
ISSNs journal they belong to 

To do so we used the batches_cleaner.py software.


#Scripts to execute:
py -m batches_cleaner doaj_article_data_[date generated] #windows
python3 -m batches_cleaner doaj_article_data_[date generated] #macos


Result example
{
"doi": {
"year": "XXXX",
"issns": [
    "XXXX-XXXX",
    "XXXX-XXXX"
        ]
    },
...

Collecting Crossref data

In order to better reply to our research question the cleaned dump had to be enriched with information retrieved by Crossref REST API. To do so we used the main.py script that launches the multithread_populating.py software, both developed in python.

All methods developed for managing Crossref requests are listed as follows:

To run the whole software we launched from the shell the following command:

py -m main [path to files] #windows 
python3 -m main [path to files] #macos

The result of populating will be saved in /temp/completed

populate method.

Populate iterates over all files in the directory doaj_article_data_[date generated] controlling if they are temporary files. In the end it produces a JSON file as output named batch.json in the output directory. 

_read_json method iterates over articles stored in article_batch_[number].json.
ISSNs are here used as indexing numbers for the DOIs, grouping them under the journal they belong to.

If the response status code is not equal to 200 the DOI is not present in Crossref, so we add those key value:
"'crossref': 0
"reference": 0

In other cases we read the response message and we add those key value:
"crossref": 1
"reference": 0

Then, if in the response are presents reference values we read the reference list to read reference articol information. Id DOI is present we add this key value to our reco:
"doi": "article-doi"
"doi": "not-specified" (if not present)

Then we look for the entities responsible for DOI specification that crossref saved under the key "doi-asserted-by". Consequentially, we add to reference informations this key value:
"doi-asserted-by": "publisher"
"doi-asserted-by": "crossref"
"doi-asserted-by": "not-specified" (if not present)

When the file is finally populated a temporary copy is saved in temp/compleated to save time whether we need to restart the process.

For each article DOI, we sent a request through query_crossref method. 
As the process could encounter  some errors we coded our software in a way that saves a temporary file in the directory /temp. This latter shrewdness avoids us losing any progress we made, as the whole project was already long to undertake.


query_crossref method launches requests to Crossref REST API in the format:

 https://api.crossref.org/works/{doi}

The method uses requests_cache, backoff, and ReadTimeout to avoid getting blocked and speed up the process. 

def__init__(self) -> None:
    requests_cache.install_cache('multithread_cache')
    self.api = "https://api.crossref.org/works/"

    @backoff.on_exception(backoff.expo, requests.exceptions.ReadTimeout,                                     max_tries=20)
def query_crossref(self, doi):

    query = self.api + doi

    req = requests.get(query, timeout=60)
    return req, doi

To compute files compatible with data analysis libraries, we used the stats script, that creates csv tables that will be used to create the final pickle file. each row of the table will contain the information of one article. 
To launch it on terminal the command is :
py -m stats path #windows
python3 -m stats path #macos

As we recognised some valuable information was missing from DOAJ's cleaned articles, a third researcher added the additional information coming from DOAJs' journals data and double-checked the work already done on the cleaned articles files,.

Data added in this step is 
Country of provenance of the journal in ISO alpha-2 format
Subject field in the LCC classification

py -m cleaner doaj_article_data_[date generated] #macos
python3 -m cleaner doaj_article_data_[date generated] #macos


Result example
{
    "XXXX-XXXX": {
        "code": "XXXX", #eissn or pissn
        "country": "XX", country code alpha 2
        "subject": [
            {
                "code": "X",
                "scheme": "XXX",
                "term": "some_term"
            },
            {
                "code": "X",
                "scheme": "XXX",
                "term": "some_term"
            },
            {
                "code": "X",
                "scheme": "XXX",
                "term": "some_term"
            }
        ]
    },
...

Safety information
When dealing with subjects data, we actually started from the subject codes provided by Crossref, selecting the first item in the subject's array item ['subject']['code'] as value for this additional column.

Example node: 

"subject": [{"code": "L", "scheme": "LCC", "term": "Education"}

Populate the final dataset with populator.py.
This script hat uses information from the journals and adds it to the articles. w
With this script, we added the information about country of origin of the journal and the subject of the journal.

command to launch it from terminal:

py -m populator path

Results template:

[‘issn', 'doi', 'doi-num', 'on-crossref', 'reference', 'asserted-by-cr', 'asserted-by-pub', 'ref-undefined', 'ref-num', 'year', 'country', 'subject']

Finally, we created a pickle file to have the dataframe in one place simplifying the following processes.

Data Analysis & Visualisation

Starting from our final dataframe, [20220521_la_chouffe_aggregated_data_v_0_0_1.pkl], we loaded it on our pyton  used the commands:

df.describe #for describing the whole df
-----------
df[df["column-name"] == value].describe() #for describing a subset of the df 

For getting statistical data on the df columns, such as numerical values means, counts 

With our research we used multiple measures that we repeated in the different scopes of our research, in particular:

RQ1: How many articles published in the open access journals in DOAJ are included in Crossref?
RQ2: How many of these articles include information about their reference lists?
RQ3: How many references have a DOI specified?
RQ4: How many of these DOIs have been specified by the publishers? And how many by Crossref?

As we printed the description results for the whole dataset and its subgroup having 'reference' == 1, we decided to first look for a general description of the variables involved in answering to each one of our research questions.

Here, we selected the first kind of data visualization we wanted to look at, to get a general understanding on what is actually happening in the OA field, regarding our study objectives. 

Plotly is the python library used at this stage.

To answer RQ1, we present the percentage of articles also present on Crossref over the total number of articles in DOAJ.
The data used is:
totdoi = len(working) #tot number of dois over our dataframe
oncross = working['on-crossref'].sum() # tot DOIS also on Crossref
notOn = totdoi - oncross # tot DOIS not on crossref

To answer RQ2, the percentage of DOAJ articles on Crossref having reference list over the tot amount  of DOAJ articles present there was plotted.
The data used is:
noRef=working['on-crossref'].sum()-working['reference'].sum()#DJ arts on Cr w/out ref
totdoi = working['reference'].sum() # tot DOAJ articles on Cr having references
 
To answer RQ3, the percentage of DOAJ articles on Crossref having reference list over the tot amount  of DOAJ articles present there was plotted.
The data used is:
ref_defined = working['ref-num']-working['ref-undefined'] # ref defined by someone
ref_defined = ref_defined.sum() # tot references w/ DOI defined by someone
ref_undefined = working['ref-undefined'].sum() # tot references w/out DOI

To answer RQ4, we present the percentages of DOAJ articles having: references' DOIs asserted by Crossref,  references' DOIs asserted by Publisher,  no references' DOIS  specified at all.
The data used is:

ass_cross = working['asserted-by-cr'].sum() #tot num of DOIs asserted by Crossref
ass_pub = working['asserted-by-pub'].sum()  #tot num of DOIs asserted by publishers
und = working['ref-undefined'].sum()  #tot num of DOIs not asserted

After setting these counts as the values we wanted to focus on, we made a pie-chart for each  question following the the templates you can find in our notebook on Data Analysis.
Note
Here we took a break from our code.
This hands-off time was invested  on making some observations on the new insights into our data we got thanks to data viz!

We recommend you to do the same here!

Further Inquiry

As we moved towards more granular description of our data we started introducing additional variables such as country and subject – i.e., the research field – by doing this we also introduce new kind of visualizations to be later selected, depending on the relevance of the information we could infer from our data evidence.

We start re-iterating all of our research questions.

The first variable introduced  in this new iteration is the subject of the OA journal the articles belong to, then we also introduce country and finally explore the trend over time of percentage of DOAJ dois on Crossref 


RQ1 

    1. Box-plot

This visualization was selected to show the distribution of the DOAJ articles registered on Crossref, we presented in the first pie chart. 
     
First we grouped the frame by year and summed the numerical values in it:

frame2 = frame2.groupby('year').sum()


Y axis: percentage of dois present on Crossref 

frame2['perc_cr'] = (frame2['on-crossref']/frame2['doi-num'])*100 # Y axis data (floats)


    2. Scatter-plot by subject
 
In this second  graph we grouped the working df  by 'subject' and summed the numerical values in it

frame1 = frame1.groupby('subject').sum()

Safety information
To make subjects' labels more straightforward to readers we just replaced X axis' tick labels' array with a new one having at the corresponding label's position the names of each subjects, accordingly to  the Library of Congress Classification (LCC)
 

X axis: subjects of DOAJ  journals
Y axis: percentage of dois present on Crossref 
Color scale: tot number of dois in the subject
Bubble size: tot number of references

frame['subject'] # Y axis data (categorical)
frame['perc_cr'] = (frame['on-crossref']/frame['doi-num'])*100 # X axis data (floats)
frame['doi-num'] # Color scale data (int)
frame['ref-num'] # Bubble size data (int)

    3. Scatter-plot by country

In this scatterplot we grouped the working df  by 'country' and summed the numerical values in it.

#import library that changes the country codes (alpa-3) into names
import country_converter as coco
import pycountry_convert as pc

frame3 = frame3.groupby('country').sum()
frame3['country-name'] = coco.convert(names=frame3.index, to="name")

We add the continent names column 

for x in frame3.index:
    country_continent_code = pc.country_alpha2_to_continent_code(x)
    country_continent_name = pc.convert_continent_code_to_continent_name(country_continent_code)
    continent_name.append(country_continent_name)
frame3['continent'] = continent_name

X axis: countries of DOAJ  journals
Y axis: percentage of dois present on Crossref 
Color scale: continent 

frame3['country-name'] # Y axis data (categorical)
frame3['perc_cr'] = (frame3['on-crossref']/frame3['doi-num'])*100 # X axis data (floats)
frame3['continent'] # Color scale data (categorical)

    3b. Bar-chart – focus on countries with lowest percentages of DOIs on Crossref

In this bar-chart we grouped the working df  by 'country' and summed the numerical values in it.
Then, we filtered out rows (i.e. countries) having percentages of dois on Crossref greater than 0 and lower than 80%
 
frame4 = frame4[(frame4.perc_cr < 80) & (frame4.perc_cr > 0)]

X axis: countries of DOAJ  journals
Y axis: percentage of dois present on Crossref 
Color scale: continent of dois in the subject

frame4['country-name'] # Y axis data (categorical)
frame4['perc_cr'] = (frame4['on-crossref']/frame4['doi-num'])*100 # X axis data (floats)
frame4['continent'] # Color scale data (categorical)

    4. Line-chart – by years

In this chart we grouped the working df  by 'year' and summed the numerical values in it and filtered out dates < 1950 and > 2022.

frame7 = frame7[(frame7.year >= 1950)&(frame7.year < 2022)]
frame7 = frame7.groupby('year').sum()

X axis: pub year of DOAJ  articles
Y axis: percentage of dois present on Crossref 

frame7['year'] # X axis data (ordinal)
frame7['perc_cr'] = (frame['on-crossref']/frame['doi-num'])*100 # Y axis data (floats)


    4b. Scatterplot  – focus on the last 2 decades

In this chart we grouped the working df  by 'year' and summed the numerical values in it further filtered out dates not comprehended in the past 22 years.

frame8 = frame8[(frame8.year >= 2000)&(frame8.year < 2022)]


X axis: pub year of DOAJ  articles
Y axis: percentage of dois present on Crossref
Size: tot number of DOIs on DOAJ 

frame8['year'] # X axis data (ordinal)
frame8['perc_cr'] = (frame['on-crossref']/frame['doi-num'])*100 # Y axis data (floats)
frame8['doi-num'] # Size data (int)

Safety information
Additionally we tried plotting some map visualization to better explore differences in the use of Crossref among countries, but it provided no valuable information in addition to what we already have.

Remember: LESS IS MORE!


RQ2 

    1. Box-plot

This visualization was selected to show the distribution of the articles having metadata on references, over the total number of DOAJ's articles listed on Crossref .  
   
First we filtered out articles having year < 1950 and or > 2022 grouped the frame by year and summed the numerical values in it.

frame2a = frame2a[(frame2a.year >= 1950)&(frame2a.year < 2022)]
frame2a = frame2a.groupby('year').sum()


Y axis: percentage of dois present on Crossref having a reference list

frame2a['perc_ref'] = (frame2a['reference']/frame2a['on-crossref'])*100 # Y axis data (floats)

    2. Bar-chart – by subject

In this chart we present the data frame divided into subjects, each described by the percentage of references list of their articles on Crossref, and by the total count of references

We grouped the working df  by 'subject' and summed the numerical values in it.

frame9 = frame.groupby('subject').sum()

X axis: subjects of DOAJ  journals
Y axis: percentage of references present on Crossref 
Color scale: Tot number of references

frame9['subject'] # X axis data (categorical)
frame9['perc_ref']= (frame9['reference']/frame9['on-crossref'])*100 # Y axis data (floats)
frame9['ref-num'] # Color scale data (int)

    3. Bar-chart – by country/continent

In this chart we present the data frame divided into countries, each described by the percentage of references list of their articles on Crossref, and by the total count of references

We grouped the working df  by 'subject' and summed the numerical values in it.


frame13a = frame13a.groupby('country').sum()
frame13a['perc_ref'] = (frame13a['reference']/frame13a['on-crossref'])*100
frame13a['country-name'] = coco.convert(names=frame13a.index, to="name")


X axis: countries of DOAJ  journals
Y axis: percentage of references present on Crossref 
Color scale: Tot number of references

frame13a['subject'] # X axis data (categorical)
frame13a['perc_ref']= (frame9['reference']/frame9['on-crossref'])*100 # Y axis data (floats)
frame13a['ref-num'] # Color scale data (int)

    3b. Scatter-map – by country/continent

As we deal with geographical data, we also decided to exploit scatter-map visualizations provided by plotly to discriminate the countries that were doing their best on the overall globe.

First here, subset the working frame by country, compute ref_precentages as above and change our iso-alpha 2 codes into iso-alpha 3  and finally add the continent


frame14 = frame14.groupby('country').sum()
frame14['perc_ref'] = (frame14['reference']/frame14['on-crossref'])*100
frame14['iso_alpha'] = coco.convert(names=frame14.index, to='ISO3')


Locations: countries of DOAJ  as iso-alpha 3 codes
Color: percentage of references present on Crossref 
Size: percentage references on Crossref


frame14['iso_alpha'] # Locations data (categorical)
frame14['continent'] # Color scale data (categorical)
frame14'['perc_ref']= (frame14['reference']/frame14['on-crossref'])*100 # Size data (floats)


    4. Line-chart – by year

In this chart we grouped the working df  by 'year' and summed the numerical values in it further filtered out dates not comprehended in the year range 1950-222.
frame16 = frame16[(frame16.year >= 1950)&(frame16.year < 2022)]
frame16 = frame16.groupby('year').sum()

X axis: pub year of DOAJ  articles
Y axis:   percentage references on Crossref
frame16['year'] # X data (ordinal)
frame16['perc_ref'] = (frame16['reference']/frame15['on-crossref'])*100 # Y data (floats)


    4b. Scatterplot  – focus on the last 2 decades

In this chart we grouped the working df  by 'year' and summed the numerical values in it further filtered out dates not comprehended in the past 22 years.
frame15 = frame15[(frame15.year >= 1950)&(frame15.year < 2022)]
frame15 = frame15.groupby('year').sum()

X axis: pub year of DOAJ  articles
Y axis:   percentage references on Crossref
Size: tot number of DOAJ articles having reference lists on Crossref 
frame15['year'] # X axis data (ordinal)
frame15['perc_ref'] = (frame['on-crossref']/frame['doi-num'])*100 # Y axis data (floats)
frame15['ref-num'] # Size data (int)


    5.  Line-chart – by year

This visualization is characterized by  2 traces having the purpose of comparing the percentages of both DOAJ dois and their reference lists on Crossref, again on the year span 1950-2022.
frame18 = frame18[(frame18.year >= 1950)&(frame18.year < 2022)]
frame18 = frame18.groupby('year').sum()

X axis: pub year of DOAJ  articles
Y axis:   percentage dois and reference lists on Crossref
frame18['perc_ref'] = (frame18['reference']/frame18['on-crossref'])*100
frame18['perc_cr'] = (frame18['on-crossref']/frame18['doi-num'])*100

frame18['year'] # X axis data (ordinal)
frame18['perc_ref'] # Y axis first trace data (floats)
frame18['perc_cr'] # Y axis second trace data (floats)



RQ3

    1. Stacked bar-chart – by subject

This also double traced visualization compares the number of references having specified DOIs and those not having one defined over subjects. 
frame20 = frame20.groupby('subject').sum()

We calculate the needed percentages 
frame20['perc_ref_nodoi'] = (frame20['ref-undefined']/frame20['ref-num'])*100
frame20['perc_ref_doi'] = 100 - frame20['perc_ref_nodoi']

X axis: subjects of DOAJ  articles
Y axis:   percentage of DOIs specified and percentage of DOIs not specified
frame20['subject'] # X axis data (categorical)
frame20['perc_ref_nodoi'] # Y axis first trace data (floats)
frame20['perc_ref_doi'] # Y axis first trace data (floats)


    2. Scatter-map 

In this visualization we exploit two different traces to make a scatter map with concentric bubbles

First here, subset the working frame by country, compute ref_precentages as above and change our iso-alpha 2 codes into iso-alpha 3  
frame22['iso_alpha'] = coco.convert(names=frame22.index, to='ISO3')
frame22['country-name'] = coco.convert(names=frame22.index, to='name')
frame22['perc_ref_nodoi'] = (frame22['ref-undefined']/frame22['ref-num'])*100
frame22['perc_ref_doi'] = 100 - frame22['perc_ref_nodoi']

Locations: countries of DOAJ  as iso-alpha 3 codes
Color: percentage of references with DOIs on Crossref and  percentage of references without DOIs on Crossref 
Size: percentage references on Crossref
frame22['iso_alpha'] # Locations data (categorical)
frame22['perc_ref_nodoi'] # Color first trace data (float)
frame22['perc_ref_doi'] # Color first trace data (float)


Finally we merged the two maps together and got the final one with concentric bubbles


    3.  Histogram – by year 

With this visualization we compare the trends of references with and without DOIs over the time range 1950-2022
frame24 = frame24[(frame24.year >= 1950)&(frame24.year < 2022)]
frame24 = frame24.groupby('year').sum()

X axis: years of DOAJ  articles
Y axis and color:   percentage of DOIs specified and percentage of DOIs not specified
frame24['year'] # X axis data (ordinal)
frame24['perc_ref_nodoi'] # Y axis and Color first trace data (float)
frame24['perc_ref_doi'] # Y axis and Color first trace data (float)


 
RQ4

  1. Stacked bar-chart – by subject

This also double traced visualization compares the number of references having specified DOIs by crossref, publisher or not specified at all, focusing on the subjects. 
frame26 = frame26.groupby('subject').sum()

We calculate the needed percentages 
frame26['perc_asserted_cr'] = (frame26['asserted-by-cr']/frame26['ref-num'])*100
frame26['perc_asserted_pub'] = (frame26['asserted-by-pub']/frame26['ref-num'])*100
frame26['perc_ref_nodoi'] = (frame26['ref-undefined']/frame26['ref-num'])*100

X axis: subjects of DOAJ  articles
Y axis:   percentage of DOIs specified by crossref, publisher or not specified at all
frame26['subject'] # X axis data (categorical)
frame26['perc_asserted_cr'] # Y axis first trace data (floats)
frame26['perc_ref_nodoi'] # Y axis first trace data (floats)
frame26['perc_ref_doi'] # Y axis first trace data (floats)


  2. Stacked bar-chart – by country

This also double traced visualization compares the number of references having specified DOIs by crossref, publisher or not specified at all, focusing on the country. 
frame27 = frame27.groupby('subject').sum()

We calculate the needed percentages 
frame27['perc_asserted_cr'] = (frame27['asserted-by-cr']/frame27['ref-num'])*100
frame27['perc_asserted_pub'] = (frame27['asserted-by-pub']/frame27['ref-num'])*100
frame27['perc_ref_nodoi'] = (frame27['ref-undefined']/frame27['ref-num'])*100

X axis: country of DOAJ  articles
Y axis:   percentage of DOIs specified and percentage of DOIs not specified
frame27['subject'] # X axis data (categorical)
frame27['perc_asserted_cr'] # Y axis first trace data (floats)
frame27['perc_ref_nodoi'] # Y axis first trace data (floats)
frame27['perc_ref_doi'] # Y axis first trace data (floats)


    3.  Histogram – by year 

With this visualization we compare the tr ends of references  s the number of references having specified DOIs by crossref, publisher or not specified at all over the timerange 1950-2022
frame28 = frame28[(frame28.year >= 1950)&(frame28.year < 2022)]
frame28 = frame28.groupby('year').sum()

X axis: years of DOAJ  articles
Y axis and color:   percentage of references having specified DOIs by crossref, publisher or not specified at all 

frame28['year'] # X axis data (ordinal)
frame28['perc_asserted_cr'] # Y axis first trace data (floats)
frame28['perc_ref_nodoi'] # Y axis first trace data (floats)
frame28['perc_ref_doi'] # Y axis first trace data (floats)

Publishing Data

Software, dataset, and metadata publication


Metadata and datasets have been published in compressed formats on Zenodo, the software can be found on GitHub at this address:  https://github.com/open-sci/2021-2022-la-chouffe-code and on Zenodo
20220521_la_chouffe_clean_articles_v_0_0_1.tar.gz and 20220521_la_chouffe_articles_populated_v_0_0_1.tar are both in permissive licences, CC0. 
20220521_la_chouffe_aggregated_data_v_0_0_1.pkl has a Cretive Common ShareAlike licence 4.0.
The software created  has an MIT licence.

Public workspacePROTOCOL – Availability of Open Access Metadata from Open Journals – A case study in DOAJ and Crossref V.4

PROTOCOL – Availability of Open Access Metadata from Open Journals – A case study in DOAJ and Crossref V.4