During the initial setup, we use the "argparse" module to handle instructions directly from the command line. It prompts for important details such as the path of the ZIP file, where to save the output CSV files, and how many files to handle at once (given that a single chunk typically comprises 1600 files.json.gz). This configuration ensures readiness for subsequent steps. Below is a list of the required arguments:
"zip_filename": the path to the ZIP file containing compressed JSON files.
"output_filenames": one or more filenames for the output CSV files.
"batch_size": number of files to process concurrently in each batch (default is 10).
"max_files": maximum number of files to process.
"max_workers": number of worker threads for concurrent processing (default is 2).
PeerExtractor() Class Initialization:
After the setup, the PeerExtractor class is initialized. This class takes the following parameters:
"zip_filename": the path to the ZIP file containing compressed JSON files.
"batch_size": the number of files to process concurrently in each batch, with a default value of 10.
"max_workers": the maximum number of threads for concurrent processing, with a default value of 2.
With the setup complete, the script begins working through the ZIP file. This method opens the ZIP file, identifies the relevant files, and then divides them into manageable batches (the size of which is determined by the user). This method oversees the entire process, ensuring smooth operation from start to finish.
Here's a detailed list of the arguments it takes:
- "csv_writer": an instance of the CSVWriter function used for writing CSV files. This logic was implemented primarily for testing purposes because, on average, every 1600 files need CSV conversion.
- If "max_files" is specified, the function limits the number of files processed, mainly for testing purposes. After that, the function processes files in batches using the "process_batch" method.
When dealing with a large number of files, often the best choice is to handle them in smaller groups. The batch method manages to divide the task into smaller parts that the script can handle more efficiently, without overloading the system, especially if one does not possess a performative machine. Here's a detailed list of the arguments it takes:
- "n": the size of each batch.
This generator method helps in splitting a large list into smaller chunks or batches, each of size n. It iterates over the input list and yields successive slices of the list, allowing for batch processing. As said, this method is useful in scenarios where processing the entire list at once would be inefficient or consume too much memory.
Once the files are grouped together, it's time to start working on them. This method concurrently processes a given batch of files using a thread pool managed by the ThreadPoolExecutor module. For each file in the batch, a separate thread is created to handle the file processing, which involves reading and extracting peer review items. The results from each thread are collected and combined into a single list. The use of concurrent threads allows multiple files to be processed simultaneously, significantly speeding up the overall processing time compared to sequential processing.
Now, this method reads and unpacks each file contained in the bigger ZIP file, getting it ready for further work. This careful approach ensures each file is handled properly. Here's a detailed list of the arguments it takes:
- "zip_file": the opened ZIP file object.
- "file_to_process": the file name to process.
"process_json_data" method:
The purpose of this method can be explained as essentially decompressing and decoding the JSON data. It then attempts to parse the string as JSON, handling potential parsing errors by logging any problematic data. Once parsed, the method checks the structure of the JSON data to determine if it contains a dictionary with an "items" key or is a list directly. This logic was implemented due to the fact that the structure of the data resulted in differences between peer review information and non-peer review information. Later, the method extracts peer review items from the JSON data by filtering the relevant JSON data.
OciProcess() class initialization:
This class is responsible for preparing the necessary tools and data for converting DOIs into OCIs (Open Citation Identifiers). Here's a detailed explanation of the initialization process and the arguments it takes:
- "lookup_csv": the CSV file with character-to-code mappings (defaults to LOOKUP_CSV).
- "crossref_code": the prefix for CrossRef DOIs (defaults to CROSSREF_CODE).
During initialization, the class first reads the CSV file specified by the "lookup_csv" argument to set up a dictionary. This dictionary maps characters found in DOIs to specific codes. Next, it sets the CrossRef prefix using the "crossref_code" argument, which will be used in the conversion of DOIs to OCIs (Open Citation Identifiers). This initialization ensures that the "OciProcess" instance is equipped with the necessary data and tools to initiate the conversion process, making subsequent citation processing tasks more efficient.
"init_lookup_dic" method:
This method is responsible for setting up the lookup dictionary by reading data from a CSV file and populating it with character-to-code mappings. This is an essential part of the preparation process, ensuring the dictionary is ready for later use. Here's a detailed list of the arguments it takes:
- "lookup_csv": the path to the CSV file containing character-to-code mappings.
The method uses csv.DictReader() to read the CSV file, which allows for easy lookup of the mappings. It also initializes the "lookup_code" to the highest code found in the CSV. This step ensures that any new codes can be assigned sequentially without conflicts. By extracting the character-to-code mappings from the CSV file, the "init_lookup_dic" method populates the "lookup_dic" dictionary and sets the "lookup_code". This ensures the lookup dictionary is correctly populated with existing mappings and prepared to handle new characters as needed.
"calc_next_lookup_code" method:
This method basically ensures each code is unique and follows a logical order. It calculates the next available code for the lookup dictionary by first ensuring that each code is assigned sequentially and handles transitions between ranges, such as moving from 089 to 100. This systematic approach prevents duplicate codes and maintains a logical order for the codes. Here a detailed list of the arguments it takes:
This purpose of this method is to check if a character is already in the dictionary, and if not, it adds it with a code. By consequence its functionality is to update the lookup dictionary with a new character-to-code mapping if needed. Here a detailed list of the arguments it takes:
The purpose of this method is to check if a character, specified by "c" is already in the dictionary, and if not, it adds it with a code. The"update_lookup" method updates the lookup dictionary with a new character-to-code mapping if needed. It calculates the next available code by calling "calc_next_lookup" and appends the new mapping to the CSV file using "write_txtblock_on_csv" This ensures that the lookup dictionary remains up-to-date and consistent with the stored CSV file.
"write_txtblock_on_csv" method:
This method ensures data is stored in the right place for later use by appnding a text block to a CSV file and creating directories if necessary. Here a detailed list of the arguments it takes:
This method ensures data is stored in the right place for later use by appending a text block to a CSV file and creating directories if necessary. The method follows a specific flow: uses "os.makedirs" to ensure the directory exists before writing the file and handles potential errors related to directory creation, raising exceptions if issues occur. By writing a block of text to the CSV file specified by "csv_path" and ensuring the necessary directories are created if they do not already exist, this method helps maintain persistent storage of the lookup dictionary.
"convert_doi_to_ci" method:
Now comes the main task of this section of functions: converting data. This method takes a DOI and turns it into a different format, ready for use by converting a DOI string to an OCI string using the lookup dictionary. Here a detailed list of the arguments it takes:
By calling the "match_str_to_lookup" converts a DOI, the "doi_str", to an OCI by translating each character in the DOI to its corresponding code using the lookup dictionary. It adds a prefix (CrossRef code) to the translated string, producing a standardized OCI. This conversion is crucial for uniquely identifying and processing DOIs in a consistent format.
"match_str_to_lookup" method:
This method checks a dictionary and assigns a code for each character, making sure everything is translated correctly. It works by converting a substring of the DOI into its corresponding code sequence. Here a detailed list of the arguments it takes:
The flow of the method is the following: it translates a substring of the DOI into its corresponding code sequence using the lookup dictionary. If a character is encountered for the first time, it updates the lookup dictionary to include the new character-to-code mapping. By systematically translating each character, it generates a code sequence that uniquely represents the DOI substring. This ensures that DOIs can be consistently mapped to OCI strings for citation identification and processing purposes.
"CSVWriter" class initialization
Before writing data into CSV files, there's a setup phase. This part of the code prepares the CSVWriter class. It's like getting the right tools ready before starting work. The class is designed to handle the output filenames. If there's only one filename provided, it stores it as a list. Otherwise, it keeps them as they are. This ensures that the class can handle both single and multiple output files efficiently. Here a detailed list of the arguments it takes:
Once everything is set up, it's time to write data into the CSV files. This method opens each output file and starts putting data into them. For each piece of data, it creates a row in the CSV file. Here a detailed list of the arguments it takes:
This process repeats until all data is written. Additionally, this method is responsible for creating a unique identifier for each peer review relationship using the Open Citation Identifier (OCI) format. It ensures that each relationship between citing and cited entities has a distinct identifier. This helps in organizing and referencing the peer review data accurately.
"remove_duplicates" method:
After writing data, there's the cleanup step. This method ensures that there are no duplicate entries in the CSV files. It reads the CSV file, removes any duplicate rows based on the OCI (Open Citation Identifier), and saves the cleaned-up data into a new file. Here a detailed list of the arguments it takes:
This method basically ensures that the final CSV files contain unique peer review items, avoiding any redundancy or repetition.
Main function argument, parsing, and execution flow:
The main function is responsible for parsing command-line arguments, initializing necessary components, and coordinating the entire processing workflow. Here’s a breakdown of its flow:
In essence, the main function acts as the captain steering the ship. It listens for commands, prepares the necessary tools (CSVWriter and PeerExtractor), and initiates the processing workflow. This involves processing JSON files, writing peer review data to CSV files, and removing duplicates from the final output. By orchestrating these tasks, the main function ensures a smooth and organized execution of the entire process. It was chosen during development to save two CSV files in the peer review items extraction phase, one which is not altered and one which is filtered from duplicates as described above. Ths choice was made because in this way the final user can safely decide which file to use for his/her purposes.
The extractions of items related non peer review entries follow the exact same logic, of course, with some modification, mainly in the fact that the "type" key in the JSON files in which the interest falls on in this phase is not anymore "peer-review" but everything else. The second and last modification to the logic is that everything regarding the OCI creation and its related implementations is not in the "NonPeerExtractor.py" software because there was non need to calculate the OCI.
It is necessary to mention that the class "OciProcess" and all its relative methods were taken from the COCI workflow contained in this Github page.