May 11, 2022

Public workspaceBioinformatic workflow for NGS data control

  • 1AIDS Reference Laboratory, Department of Clinical Microbiology, University Hospital of Liege, 4000 Liege, Belgium
Icon indicating open access to content
QR code linking to this content
Protocol CitationKhalid El Moussaoui 2022. Bioinformatic workflow for NGS data control. protocols.io https://dx.doi.org/10.17504/protocols.io.8epv59bnjg1b/v1
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: In development
We are still developing and optimizing this protocol
Created: May 11, 2022
Last Modified: May 11, 2022
Protocol Integer ID: 62429
Disclaimer
DISCLAIMER – FOR INFORMATIONAL PURPOSES ONLY; USE AT YOUR OWN RISK

The protocol content here is for informational purposes only and does not constitute legal, medical, clinical, or safety advice, or otherwise; content added to protocols.io is not peer reviewed and may not have undergone a formal approval of any kind. Information presented in this protocol should not substitute for independent professional judgment, advice, diagnosis, or treatment. Any action you take or refrain from taking using or relying upon the information presented here is strictly at your own risk. You agree that neither the Company nor any of the authors, contributors, administrators, or anyone else associated with protocols.io, can be held responsible for your use of the information contained in or linked to this protocol or any of our Sites/Apps and Services.
Abstract
Workflow for data integrity and quality control of high throughput sequencing on Illumina NovaSeq6000. The analyses are performed on macOS Monterey 12.3.1 running on an ARM-architected Apple Silicon processor. This workflow considers that the user directory (~/) is structured as seen in the "work environment configuration" protocol. To avoid error messages, please follow this protocol and set up your computer before starting.
Activation of the environment
Activation of the environment
Open a terminal window.
Software
Terminal
NAME
macOS Monterey 12.3.1
OS
Apple Inc.
DEVELOPER

Activate the previously created QC_env environment by typing the following command in the terminal :
Command
conda activate QC_env

Data integrity check
Data integrity check
Considering that the .gz archive downloaded from the GIGA servers has been unzipped under ~/fastq_files, that the original_md5.txt file has been stored under ~/md5 and that the python & R scripts previously created are stored under ~/KE_utilities, type the following command in the terminal to recompute the md5 hash and store it in a new file under ~/md5
Command
md5 ~/fastq_files/* > ~/md5/recomputed_md5.txt

After generating the ~/md5/recomputed_md5.txt file, type the following command in the terminal to launch the python script that allows the data integrity check :
Command
python3 ~/KE_utilities/data_integrity_checker.py
Specify the path to the original_md5.txt file and then to the recomputed_md5.txt file :
Command
*************** DATA INTEGRITY CHECKER ***************

Please enter the path to original_md5.txt : /users/khalid/md5/original_md5.txt
Please enter the path to recomputed_md5.txt : /users/khalid/md5/recomputed_md5.txt

------------------------------------------------------

Run fastQC
Run fastQC
Start the fastQC analysis on all existing files in the ~/fastq_files directory in recursive mode using "*". Moreover, the addition of the --outdir option allows to specify an output directory for the reports generated by fastQC. This generates an individual .html report for each file.
Command
fastqc version : v. 0.11.9
fastqc ~/fastq_files/* --outdir ~/fastqc_reports/

The generated reports can be opened by typing the following command in the terminal :
Command
open ~/fastqc_reports/KE0xx_R1_fastqc.html

Run multiQC
Run multiQC
To summarize the reports generated with fastQC into a single report, run multiQC. To do this, type the following command in the terminal :
Command
multiqc version : v. 1.12
multiqc ~/fastqc_reports --outdir ~/multiqc_report

The generated report can be opened by typing the following command in the terminal :
Command
open ~/multiqc_report/multiqc_report.html

Filter reads with fastp
Filter reads with fastp
The reads can be filtered automatically with fastp. Just launch the program, specify the 2 .fastq.gz files (R1 and R2) as input and specify the name and location of the 2 processed files. Adding the -h option allows to specify a folder for the HTML report. The option -j " " allows to cancel the creation of the JSON report. The -R option allows to give a name to the generated HTML report.
Command
fastp version : v. 0.23.2
fastp -i ~/fastq_files/KE0xx_R1.fastq.gz 
-I ~/fastq_files/KE0xx_R2.fastq.gz 
-o ~/fastp/cleaned_fastq_files/KE0xx_R1_clean.fastq.gz 
-O ~/fastp/cleaned_fastq_files/KE0xx_R2_clean.fastq.gz 
-h ~/fastp/fastp_reports/KE0xx_fastp_report.html 
-j "" 
-R "Fastp report : KE0xx"