Complete WGS data processing in HPC environment

Ireneusz Stolarek; Michał Zeńczak; Małgorzata Marcinkowska-Swojak; Magdalena Rakoczy; Luiza Handschuh; Natalia Koralewska; Marek Figlerowicz

Dec 26, 2023

Complete WGS data processing in HPC environment

This protocol is a draft, published without a DOI.

Ireneusz Stolarek¹,
Michał Zeńczak¹,
Małgorzata Marcinkowska-Swojak¹,
Magdalena Rakoczy¹,
Luiza Handschuh¹,
Natalia Koralewska¹,
Marek Figlerowicz¹

¹Institute of Bioorganic Chemistry Polish Academy of Sciences

MOSAIC

Ireneusz Stolarek

Institute of Bioorganic Chemistry PAS

Protocol Citation: Ireneusz Stolarek, Michał Zeńczak, Małgorzata Marcinkowska-Swojak, Magdalena Rakoczy, Luiza Handschuh, Natalia Koralewska, Marek Figlerowicz 2023. Complete WGS data processing in HPC environment. protocols.io https://protocols.io/view/complete-wgs-data-processing-in-hpc-environment-czyex7te

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: Working

We use this protocol and it's working

Created: September 14, 2023

Last Modified: December 26, 2023

Protocol Integer ID: 87782

Abstract

In the rapidly evolving landscape of genomics research, the advent of whole genome sequencing (WGS) has ushered in an era of unprecedented data volumes, necessitating robust and efficient computational strategies for data processing. This protocol delineates a comprehensive and optimized workflow specifically tailored for the intricate demands of WGS data analysis on High-Performance Computing (HPC) infrastructure, leveraging the Slurm scheduler for efficient resource management.
The protocol begins with meticulous data preprocessing, encompassing quality control, trimming, and adapter removal, while concurrently harnessing the parallel processing capabilities of HPC clusters. The subsequent alignment of preprocessed reads to a reference genome is performed utilizing parallelized tools, ensuring computational efficiency and minimizing analysis time. Variant calling, a pivotal step in WGS analysis, is seamlessly integrated, and the protocol underscores the exploitation of HPC resources for parallelized execution, essential for managing the computational intensity of this step.
Quality control metrics, variant annotation, and prioritization are addressed with a focus on HPC-specific considerations, providing a robust foundation for downstream analysis.
Emphasizing troubleshooting strategies, best practices, and recommendations, this protocol equips researchers with the tools necessary to navigate the complexities of WGS data processing on HPC infrastructure. By providing a structured and adaptable framework, this protocol serves as a valuable resource for genomics researchers aiming to harness the power of high-performance computing in the analysis of whole genome sequencing data.

Pre-Demultiplexing sequencing run QC

The very first data obtained after the NGS run are not the well-known FASTQ files. The first data consists of raw pixel-based data and intensity measures taken by the instrument in the form of .bcl files. The data are generated by the Real Time Analysis software by the Illumina sequencing systems during the run.


Software
Sequencing Analysis Viewer
NAME
* The desired specifications of the run to be reached are available on the illumina website per each sequencing system

Install Sequencing Analysis Viewer

Pre-Demultiplexing sequencing run QC


Note
The input data for SAV are stored in the:
- InterOp catalog
- RunInfo.xml file
- RunParameters.xml file

We recommend sanity checks for signal intensity separation for each channel, error rate calculated as a percentage from phiX alignment, Q30 fraction, and cluster passing filter to identify potential problems with the suboptimal run.

The analysis tab

Expected result

The summary tab

Expected result

Software stack installation for HPC data processing

To process the NGS data from WGS experiment in an HPC environment we recommend installing the necessary software through conda or by other means.

Install conda
Software
miniconda
NAME

Download and install the conda linux installer from:
https://docs.conda.io/projects/miniconda/en/latest/index.html

Command
wget
wget -c https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

Install mamba
* mamba is an optional add-on, which makes installation of the downstream software much more efficient. We highly recommend installation of mamba.
* We recommend to install mamba in the base environment

Create conda virtual environment through mamba

Command
create virtual env
mamba create -n wgs

Activate conda virtual environment
* When working in scheduler mode (sbatch), please remember to activate the virtual environment when on local access node before submitting the job to the slurm scheduler
* When workin in interactie mode (srun), activate the virtual environment after being allocated a compuational node (srun --pty /bin/bash)
Command
mamba activate wgs

Install basic software into virtual environment

* remember to conduct software installation while being on the local access mode, and after activation of the selected virtual environment (conda activate virtual_env name)
* installation of software while being on a computational node might revert any changes to the virtual environment once exiting

Illumina data demultiplexing software

Software
bcl2fastq
NAME

Command
mamba install -c dranew bcl2fastq

NGS read alignment software

Software
bwa
NAME

Command
mamba install -c bioconda bwa

Pre and Post-Alignment QC software

Software
multiqc
NAME

Command
mamba install -c bioconda multiqc

Pre-alignment data filtering software


Software
AdapterRemoval
NAME


Command
mamba install -c bioconda adapterremoval

Post-Alignment data processing software


Software
picardtools
NAME

Command
mamba install -c bioconda picardtools

Software
samtools
NAME

Command
mamba install -c bioconda samtools

Variant Calling software

Software
GATK4
NAME

Command
mamba install -c bioconda gatk4

Create scheduling scripts

To fully automate HPC data processing prepare scheduling scripts.

Demultiplexing scheduling script

#!/bin/bash
file=$1
while read run; do
        echo -n $run
        cp Demultiplex.sl Demultiplex_${run}.sl
        sed -i "s/RUNID/$run/g" Demultiplex_${run}.sl
        sbatch Demultiplex_${run}.sl

done < $file

Alignment scheduling script

#!/bin/bash
file=$1
while read line; do

        runid=$(cut -f1 <(printf %s "$line"))
        folder=$(cut -f2 <(printf %s "$line"))
        sample=$(cut -f3 <(printf %s "$line"))
        echo -n $runid $folder $sample

        cp map.sl map_${runid}_${sample}.sl
        sed -i "s/RUNID/$runid/g" map_${runid}_${sample}.sl
        sed -i "s/FOLDER/$folder/g" map_${runid}_${sample}.sl
        sed -i "s/SAMPLE_NAME/$sample/g" map_${runid}_${sample}.sl
        sbatch map_${runid}_${sample}.sl

done < $file

Genotyping scheduling script

#!/bin/bash
file=$1
while read line; do

        runid=$(cut -f1 <(printf %s "$line"))
        sample=$(cut -f2 <(printf %s "$line"))

        echo -n $runid $sample

        cp genotype.sl genotype_${runid}_${sample}.sl
        sed -i "s/RUNID/$runid/g" genotype_${runid}_${sample}.sl
        sed -i "s/XXXX/$sample/g" genotype_${runid}_${sample}.sl
        sbatch genotype_${runid}_${sample}.sl

done < $file

Variant filtering script

#!/bin/bash
file=$1
while read line; do

        runid=$(cut -f1 <(printf %s "$line"))
        sample=$(cut -f2 <(printf %s "$line"))

        echo -n $runid $sample

        cp filter_variants.sl filter_variants_${runid}_${sample}.sl
        sed -i "s/RUNID/$runid/g" filter_variants_${runid}_${sample}.sl
        sed -i "s/XXXX/$sample/g" filter_variants_${runid}_${sample}.sl
        sbatch filter_variants_${runid}_${sample}.sl

done < $file

Variant annotation scheduling scripts

#!/bin/bash
file=$1
while read sample; do

        echo -n "$sample "
        cp vep.sl vep_${sample}.sl
        sed -i "s/XXXX/$sample/g" vep_${sample}.sl
        sbatch vep_${sample}.sl

done < $file

Variant prioritization scheduling script

#!/bin/bash
file=$1
while read sample; do

        echo -n "$sample "
        cp vep_filter.sl vep_filter_${sample}.sl
        sed -i "s/XXXX/$sample/g" vep_filter_${sample}.sl
        sbatch vep_filter_${sample}.sl

done < $file

Create base workflow script

Here we introduce a scheme of a basic slurm workflow script written in bash. This script automates all of the steps required to conduct the desired analyses

The first part is the header. It contains the desired resources that will be allocated on a computing node for the scheduled job.


#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=10
#SBATCH --mem=80gb
#SBATCH --time=168:00:00

The second part consists of all important placeholders and paths to the input and output data


# set up $TMPDIR and JAVA options
export TMPDIR="/home/users/${USER}/path/${SLURM_JOBID}"
export _JAVA_OPTIONS=-Djava.io.tmpdir=/home/users/${USER}/path/${SLURM_JOBID}

# set up application settings
export SCR=${TMPDIR}

# set up input data directory
DIR_INPUT=/direct/path_to/input_data

# create $TMPDIR
mkdir -p ${TMPDIR}

# enter the $TMPDIR directory to conduct further computational steps in this directory

cd $TMPDIR

The third part consists of computational workflow specific commands


echo "Hello world!"

*In the following sections we will modify only the third section of the workflow scriptss

Data Demultiplexing

In processing NGS data, data demultiplexing is often a first computational step. 
The demultiplexing of illumina based data can be conducted using bcl2fastq software.


bcl2fastq -R $DIR_INPUT/illumina_data_catalog/ -p 26 -r 4 -w 4 --no-lane-splitting --ignore-missing-bcls

Sample sheet preparation

Expected result
[Header]
IEMFileVersion,5
Investigator Name,BM
Experiment Name,MOSAIC_seq00011
Date,2022-11-25
Workflow,GenerateFASTQ
Application,NovaSeq FASTQ Only
Instrument Type,NovaSeq
Assay,Illumina DNA Prep
Index Adapters,"IDT-Ilmn DNA-RNA UD Indexes SetB Tagmentation"
Description,MOSAIC_seq00011
Chemistry,Amplicon

[Reads]
151
151

[Settings]
Adapter,CTGTCTCTTATACACATCT

[Data]
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,Index_Plate_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description
O8_0230_DNA205,O8_0230_DNA205,,,A04,UDP0121,AGAGAACCTA,UDP0121,TAGCATAACC,MOSAIC,MOSAIC00011
O8_0231_DNA206,O8_0231_DNA206,,,B04,UDP0122,GATATTGTGT,UDP0122,ACCGTGTGGT,MOSAIC,MOSAIC00011

Pre-Alignment QC

QC of the NGS data can be performed with a variety of tools. Here we present a typical output of an interactive QC tool multiqc

multiqc fastq_data_dir/

Expected result

Pre-alignment General Statistics

Pre-Alignment data filtering

AdapterRemoval \
--gzip \
--file1 read1.fastq.gz \
--file2 read2.fastq.gz \
--threads 24 \
--trimns \
--trimqualities \
--minquality 30 \
--minlength 25 \
--basename file_name

Alignment

Mapping paired truncated
bwa mem \
-t 25\
-M \
$DIR_REF/Homo_sapiens_assembly38.fasta \
file_name.pair1.truncated.gz \
file_name.pair2.truncated.gz |
samtools sort -@ 27 -O BAM -o file_name.bam -

Indexing

samtools index file_name.bam

Removing duplicates

java -jar $PICARD MarkDuplicates \
I=file_name.bam \
O=file_name_rmdup.bam \
M=file_name_rmdupt.txt \
REMOVE_DUPLICATES=true \
ASSUME_SORTED=false \
VALIDATION_STRINGENCY=SILENT \
CREATE_INDEX=true

Adding read group information

java -jar $PICARD AddOrReplaceReadGroups \
I=file_name_rmdup.bam \
O=file_name_rmdup_RG.bam \
VALIDATION_STRINGENCY=SILENT \
SORT_ORDER=coordinate \
RGID=file_name \
RGLB=file_name \
RGPL=illumina \
RGPU=file_name \
RGSM=file_name \
CREATE_INDEX=true

Optimizing HPC resources

To optimize resource allocation for human WGS data processing first we need to establish the minimal hardware requirements. The basic components consist of number of CPUs used, RAM usage and computation time. All three parameters are interrelated. The faster we want to finish the alignment step, the more CPUs we need to allocate, which requires more RAM allocation.
As a rule of thumb we are targeting to finish the alignment within 24-36 hours (this mostly depends on the hardware quality at our disposal). Given that, the required resources most optial to paralellize the computation are per 1 computational node:
- 24 CPUs
- 32 GB RAM


#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=24
#SBATCH --mem=32gb
#SBATCH --time=168:00:00

Note
Data storage requirements

Typical data sizes for the human WGS run with 24 multiplexed samples are:
Pre-demultiplexed catalog - 1.2 TB
Raw fastq.gz - 50 GB x 2 reads (paired-end run)
BAM - 80 GB
VCF - 3 GB

Post-Alignment QC

Expected result

Post-Alignment General Statistics

Alignment Scores - percent of aligned reads

Additionally as a sanity check we advise to calculate average genome coverage per each sample

samtools stats O4_2640_FAC_5668_S6_assembly38_rmdup_RG.bam > O4_2640_FAC_5668_S6_assembly38_rmdup_RG.txt

whole_genome_avg_cov=$(samtools depth -a O4_2640_FAC_5668_S6_assembly38_rmdup_RG.bam | awk '{sum+=$3}END{print sum/NR}')
non_zero_avg_cov=$(samtools depth O4_2640_FAC_5668_S6_assembly38_rmdup_RG.bam | awk '{sum+=$3}END{print sum/NR}')

echo "ID        whole_genome_avg_cov    non_zero_avg_cov" > O4_2640_FAC_5668_S6_cov.txt
echo  O4_2640_FAC_5668_S6       $whole_genome_avg_cov $non_zero_avg_cov >> O4_2640_FAC_5668_S6_cov.txt

Variant Calling

Base recalibration according to known polymorphic sites
gatk --java-options "-Xmx80g" BaseRecalibrator \
-I XXXX_assembly38_rmdup_RG.bam \
-R $GATK_resources/Homo_sapiens_assembly38.fasta \
--known-sites $GATK_resources/1000G_omni2.5.hg38.vcf.gz \
--known-sites $GATK_resources/1000G_phase1.snps.high_confidence.hg38.vcf.gz \
--known-sites $GATK_resources/hapmap_3.3.hg38.vcf.gz \
--known-sites $GATK_resources/Homo_sapiens_assembly38.known_indels.vcf.gz \
--known-sites $GATK_resources/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
--known-sites $GATK_resources/Homo_sapiens_assembly38.dbsnp138.vcf \
-O recal_data.table

BQSR recalibration
gatk --java-options "-Xmx80g" ApplyBQSR \
-I XXXX_assembly38_rmdup_RG.bam \
-R $GATK_resources/Homo_sapiens_assembly38.fasta \
--bqsr-recal-file recal_data.table \
-O XXXX_assembly38_rmdup_RG_BSQR.bam

Indexing

samtools index XXXX_assembly38_rmdup_RG_BSQR.bam

Variant calling
gatk --java-options "-Xmx80g" HaplotypeCaller  \
-I XXXX_assembly38_rmdup_RG_BSQR.bam \
-R $GATK_resources/Homo_sapiens_assembly38.fasta \
-O XXXX_assembly38_raw.vcf.gz

Variant filtering

SNP Variant recalibration

gatk --java-options "-Xmx24g -Xms24g" VariantRecalibrator \
        -R $GATK_resources/Homo_sapiens_assembly38.fasta \
        -V XXXX_assembly38_raw.vcf.gz  \
        --trust-all-polymorphic \
        -tranche 100.0 -tranche 99.95 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 97.0 -tranche 96.0 -tranche 95.0 -tranche 94.0 -tranche 93.5 -tranche 93.0 -tranche
 92.0 -tranche 91.0 -tranche 90.0 \
        --max-gaussians 6 \
        -an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an SOR -an DP \
        -mode SNP \
        --resource:hapmap,known=false,training=true,truth=true,prior=15.0 $GATK_resources/hapmap_3.3.hg38.vcf.gz \
        --resource:omni,known=false,training=true,truth=false,prior=12.0 $GATK_resources/1000G_omni2.5.hg38.vcf.gz \
        --resource:1000G,known=false,training=true,truth=false,prior=10.0 $GATK_resources/1000G_phase1.snps.high_confidence.hg38.vcf.gz \
        --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 $GATK_resources/Homo_sapiens_assembly38.dbsnp138.vcf \
        -O XXXX_VSQR_snps.recal \
        --tranches-file XXXX_VSQR_snps.tranches \
        --rscript-file XXXX_VSQR_snps.plots.R

SNP Variant VQSR

gatk --java-options "-Xmx5g -Xms5g" ApplyVQSR \
        -V XXXX_assembly38_raw.vcf.gz \
        --recal-file XXXX_VSQR_snps.recal \
        --tranches-file XXXX_VSQR_snps.tranches \
        --truth-sensitivity-filter-level 99.5 \
        --create-output-variant-index true \
        -mode SNP \
        -O XXXX_snp_recalibrated.vcf.gz

Indel Variant recalibration

gatk --java-options "-Xmx24g -Xms24g" VariantRecalibrator \
        -R $GATK_resources/Homo_sapiens_assembly38.fasta \
        -V XXXX_snp_recalibrated.vcf.gz \
        --trust-all-polymorphic \
        -tranche 100.0 -tranche 99.95 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 97.0 -tranche 96.0 -tranche 95.0 -tranche 94.0 -tranche 93.5 -tranche 93.0 -tranche
 92.0 -tranche 91.0 -tranche 90.0 \
        -an FS -an ReadPosRankSum -an MQRankSum -an QD -an SOR -an DP \
        -mode INDEL \
        --max-gaussians 4 \
        -resource:mills,known=false,training=true,truth=true,prior=12 $GATK_resources/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
        -resource:axiomPoly,known=false,training=true,truth=false,prior=10 $GATK_resources/Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz \
        -resource:dbsnp,known=true,training=false,truth=false,prior=2 $GATK_resources/Homo_sapiens_assembly38.dbsnp138.vcf \
        -O XXXX_VSQR_indels.recal \
        --tranches-file XXXX_VSQR_indels.tranches

Indel Variant VQSR

gatk --java-options "-Xmx5g -Xms5g" ApplyVQSR \
        -V XXXX_snp_recalibrated.vcf.gz \
        --recal-file XXXX_VSQR_indels.recal \
        --tranches-file XXXX_VSQR_indels.tranches \
        --truth-sensitivity-filter-level 99.5 \
        --create-output-variant-index true \
        -mode INDEL \
        -O XXXX_snp_indel_recalibrated.vcf.gz

Prioritize Variants

gatk --java-options "-Xmx32g" SelectVariants \
        -R $GATK_resources/Homo_sapiens_assembly38.fasta \
        -V XXXX_snp_indel_recalibrated.vcf.gz \
        -O XXXX_snp_indel_recalibrated_PASS.vcf.gz \
        --exclude-filtered true \
        --exclude-non-variants true \
        --remove-unused-alternates true \
        --sample-name XXXX

Functional Annotation

Annotate variants with VEP

bgzip -d XXXX_snp_indel_recalibrated_PASS.vcf.gz

#VEP annotate VQSR-filtered WGS results
#Output VEP VCF
vep --cache \
--pick_allele \
--individual all \
--fork 8 \
--vcf \
--sift b \
--polyphen b \
--ccds \
--hgvs \
--symbol \
--canonical \
--af_1kg \
--max_af \
--offline \
--merged \
-fasta $GATK_resources/Homo_sapiens_assembly38.fasta \
-synonyms /home/users/irekmosaic/.vep/homo_sapiens_merged/108_GRCh38/chr_synonyms.txt \
-i XXXX_snp_indel_recalibrated_PASS.vcf \
-o XXXX_snp_indel_recalibrated_PASS_vep.vcf \
--force_overwrite


bgzip XXXX_snp_indel_recalibrated_PASS_vep.vcf

Filter variants


bgzip -d XXXX_snp_indel_recalibrated_PASS_vep.vcf.gz

filter_vep --format vcf \
-i XXXX_snp_indel_recalibrated_PASS_vep.vcf \
-o XXXX_snp_indel_recalibrated_PASS_vep_filtered.vcf \
--filter "((IMPACT is HIGH) or (IMPACT is MODERATE and (SIFT match deleterious or PolyPhen match damaging)))" \
--force_overwrite

bgzip XXXX_snp_indel_recalibrated_PASS_vep_filtered.vcf

Protocol references

Van der Auwera GA & O'Connor BD. (2020). Genomics in the Cloud: Using Docker, GATK, and WDL in Terra (1st Edition). O'Reilly Media.
Poplin R, Ruano-Rubio V, DePristo MA, Fennell TJ, Carneiro MO, Van der Auwera GA, Kling DE, Gauthier LD, Levy-Moonshine A, Roazen D, Shakir K, Thibault J, Chandran S, Whelan C, Lek M, Gabriel S, Daly MJ, Neale B, MacArthur DG, Banks E. (2017). Scaling accurate genetic variant discovery to tens of thousands of samples bioRxiv, 201178. DOI: 10.1101/201178
Van der Auwera GA, Carneiro M, Hartl C, Poplin R, del Angel G, Levy-Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E, Garimella K, Altshuler D, Gabriel S, DePristo M. (2013). From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. Curr Protoc Bioinformatics, 43:11.10.1-11.10.33. DOI: 10.1002/0471250953.bi1110s43.
DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, Philippakis A, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell T, Kernytsky A, Sivachenko A, Cibulskis K, Gabriel S, Altshuler D, Daly M. (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet, 43:491-498. DOI: 10.1038/ng.806.
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res, 20:1297-303. DOI: 10.1101/gr.107524.110.
Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25(14), 1754–1760.
Twelve years of SAMtools and BCFtools. Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li. GigaScience, Volume 10, Issue 2, February 2021, giab008, https://doi.org/10.1093/gigascience/giab008
Schubert, M., Lindgreen, S. & Orlando, L. AdapterRemoval v2: rapid adapter trimming, identification, and read merging. BMC Res Notes 9, 88 (2016). https://doi.org/10.1186/s13104-016-1900-2

Public workspaceComplete WGS data processing in HPC environment

Complete WGS data processing in HPC environment