Investigation and identification of somatic and germline variants for colorectal cancer exomes using the NGS pipeline: a computational analysis perspective

Chandrashekar  K; Anagha S Setlur; Vidya Niranjan

May 24, 2023

Investigation and identification of somatic and germline variants for colorectal cancer exomes using the NGS pipeline: a computational analysis perspective

DOI

dx.doi.org/10.17504/protocols.io.x54v9dqxqg3e/v1

¹Department of Biotechnology, RV College of Engineering, Bangalore- 560059

Vidya Niranjan

R V College of Engineering

DOI: dx.doi.org/10.17504/protocols.io.x54v9dqxqg3e/v1

Protocol Citation: Chandrashekar K, Anagha S Setlur, Vidya Niranjan 2023. Investigation and identification of somatic and germline variants for colorectal cancer exomes using the NGS pipeline: a computational analysis perspective. protocols.io https://dx.doi.org/10.17504/protocols.io.x54v9dqxqg3e/v1

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: Working

We use this protocol and it's working

Created: May 23, 2023

Last Modified: May 24, 2023

Protocol Integer ID: 82294

Keywords: Colorectal cancer exomes, quality controls, somatic & germline variants, annotation & post-processing, computational analysis

Disclaimer

This protocol was run with colorectal cancer exome datasets. However, the SRR-IDs of the datasets have not been mentioned to make this protocol universal. The steps may be followed for any cancer exome, with the intent of investigating the somatic and germline mutations. 

Abstract

A thorough analysis on colorectal cancer exomes reveals potential mutations such as single nucleotide polymorphisms that can be beneficial for early detection of the disease. Thus, through a comprehensive computational protocol that identifies, investigates and analyzes the identified variants, this process of early disease detection becomes much simpler. In brief, the cancer exome datasets were retrieved from publicly available databases, followed by performing quality control checks. The datasets that qualified the quality control checks were then aligned with the human reference genome. Somatic and germline mutants were then identified and called separately, with specific tools for each case. Haplotype Caller was employed for germline variant identification, and Mutect2 for somatic. The identified mutants were then normalized, annotated and post-processed using snpEFF, wANNOVAR and VEP. This protocol helps in garnering insights on the various alterations that might possibly lead to colorectal cancer and suggests the possibility of utilizing the most important genes identified for wet-lab experimentation. 

Keywords: Colorectal cancer exomes, quality controls, somatic & germline variants, annotation & post-processing, computational analysis

Guidelines

The commands mentioned in the protocol can be run in the Linux terminal.
Enough storage space is recommended while running this protocol.

Materials

Tools mentioned in the current protocol must initially be installed in the system before running the commands for the NGS pipeline. These include: SRAtoolkit, Fastqc, Multiqc, Cutadapt, Bowtie2, Samtools, picard, GATK, and bcftools. wANNOVAR, VEP are available online and snpEFF is command line based.

Safety warnings

Nil

Ethics statement

The datasets used to arrive at this protocol were all from publicly available databases such as NCBI-SRA.

Before start

All the mentioned tools in the materials section must be installed and checked for appropriate installation. 

COLORECTAL CANCER EXOME RETRIEVAL

Retrieval of colorectal cancer exomes

This protocol was run with colorectal cancer exome datasets. However, the SRR-IDs of datasets have not been mentioned to make this protocol universal. Therefore, NCBI-SRA database (https://www.ncbi.nlm.nih.gov/sra) was employed to retrieve the colorectal cancer exome datasets. The SRR-IDs were noted down for further use. The exome datasets were selected based on a set of criteria. The strategy used for the datasets must be whole exome sequenced and the exomes should be of genomic source. In the present protocol, all the selected sequences were of paired layout, with Illumina used for sequencing and acquiring the reads. Other criteria as desired by the user depending on the requirement can be employed for retrieving the cancer exomes. 
Additionally, the tools mentioned in the current protocol must initially be installed in the system before running the commands for the NGS pipeline. These include: SRAtoolkit, Fastqc, Multiqc, Cutadapt, Bowtie2, Samtools, picard, GATK, and bcftools. wANNOVAR, VEP are available online and snpEFF is command line based.

QUALITY CONTROL

Quality control assessments

Prior to calling variants, the quality of raw data was assessed using the below-mentioned steps. The codes used for running the quality checks are provided in the following sections.

Quality Check using FastQC and MultiQC (for multiple datasets)

FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and MultiQC (https://multiqc.info/)  in the present study were used to effectively assess the sequencing data quality, to identify any possible issues with the raw sequence reads and generate informative reports. Scrutiny of the mean sequence quality per reading and per base, nucleotide content per base position, distribution of GC, etc were analyzed via FastQC. A cumulative report for all FastQC outcomes was obtained from MultiQC. 

For multiple datasets, FastQC and MultiQC commands used were:
Command
fastQC 
$ ./fastqc

Command
multiQC
$ multiqc .
Expected multiQC heatchart

Adaptor sequence removal via Cutadapt

This tool identifies and eliminates the adaptor sequences from raw reads and improves the quality and accuracy of downstream analysis. All unwanted sequences was thus removed using Cutadapt (https://cutadapt.readthedocs.io/en/stable/) by running:
Command
Running Cutadapt
$ cutadapt -a adapter-sequence* -o SRRID_cut.fastq.gz SRRID.fastq.gz

ALIGNMENT WITH HUMAN REFERENCE GENOME

Retrieval and indexing of human reference genome for analysis

Indexing is a necessary step to pre-process the reference genome so the aligner can seek potential alignment locations for the reads. This is useful during alignment with Bowtie2 (https://bowtie-bio.sourceforge.net/bowtie2/index.shtml), where short read aligners are processed with the reference genome. The following codes were run in the terminal for obtaining Bowtie2 outputs. 
Command
Unzipping the reference genome file
$ gunzip GRCh37_latest_genomic.fna.gz

Command
Indexing reference genome
$ bowtie2-build -f GRCh37_latest_genomic.fna.gz hg19

Alignment of indexed human reference genome against exome datasets

Once indexing is complete, alignment was performed using Bowtie2.
Command
Gapped alignment command with human reference genome (hg19 in this case)
$ bowtie2 -x hg19/38/37 -1 SRRID_1_cut.fastq.gz -2 SRRID_2_cut.fastq.gz -S SRRID.sam $ head SRRID.sam
Running Bowtie2 with the specified parameters allows alignment of paired-end sequencing reads to a reference genome, enabling downstream analysis such as variant calling, differential gene expression, or identification of genomic features specific to colorectal cancer. 

SAM TO BAM CONVERSION

SAM (Sequence Alignment/Map) to BAM (Binary Alignment/Map) conversion

To obtain the alignment outcomes in a readable format and to act as appropriate input for the next steps, SAM to BAM (http://www.htslib.org/) conversion was performed. The code used was: 
Command
SAM to BAM file conversion
$ samtools view -bS SRRID.sam > SRRID.bam

Sorting BAM 

BAM sorting and merging was performed using command:
Command
BAM sorting
$ samtools sort SRRID.bam -o SRRID.sorted.bam
BAM sorting helps in supporting  quick retrieval of alignments and has a compact size for further analysis.

Identifying and marking duplicate reads in BAM file

Duplicate reads may arise at times during sequencing due to several factors, inclusive  of technical biases, PCR amplification artifacts and preparation of library protocols. These may skew the downstream analyses and impact the result accuracy. Therefore, marking duplicates is essential to address this issue, by using the “MarkDuplicates” function of Picard (https://broadinstitute.github.io/picard/) tool. 
Command
Marking Duplicates 
$ java -jar picard.jar MarkDuplicates INPUT=SRRID_sorted_reads.bam OUTPUT=SRRID_dedup_reads.bam METRICS_FILE=SRRID_metrics.txt

Sorting BAM files for deleted duplicate files

To improve the compression efficiency, another round of BAM sorting was performed for the deleted duplicate files.
Command
BAM sorting for deleted duplicate files
$ samtools sort SRRID_dedup_reads.bam -o SRRID_sorted_dedup_reads.bam

Computing and collecting alignment summary metrics from BAM file

To obtain various statistics and metrics associated with alignment of sequencing reads to a reference genome, the “CollectAlignmentSummaryMetrics” function was used. The command used to run this function was:
Command
Collecting Alignment Summary Metrics
$ java -jar picard.jar CollectAlignmentSummaryMetrics R=ref.fa I=SRRID_sorted_dedup_reads.bam O=SRRID_alignment_metrics.txt
ref.fa: fasta sequence file of the human reference genome

 Estimating and collecting the insert size distribution of paired-end reads from BAM file

The primary purpose of collecting the insert sizes is to check the distribution and characteristics of the DNA fragment sizes, in a sequencing library. This enables the collection and visualization of the insert size metrics and distribution, that are useful for the quality scrutiny of the library, alignment of the reads, selection of the fragment size and quality control assessments in several genomic analyses. For this purpose, in the present work flow, the “CollectInsertSizeMetrics” function was used. The distribution of the paired end reads from BAM was analyzed using the command:
Command
Collecting Insert Size Metrics
$ java -jar picard.jar CollectInsertSizeMetrics INPUT=SRRID_sorted_dedup_reads.bam OUTPUT=SRRID_insert_metrics.txt HISTOGRAM_FILE=SRRID_insert_size_histogram.pdf

Calculating coverage depth at each position in a sorted and duplicate-marked BAM file

The “samtools depth” function was used to examine the depth of coverage at every position in the genome of reference depending on the aligned reads. Computation and estimation of depth of coverage important for assessing the sequence data quality, calling of variants, structure and gene expression analysis and overall exploration of data becomes much easier with the samtools depth function. Thus, to better understand and obtain insights into the coverage depth, this function was performed via:

Command
Determining coverage depth of sorted, deleted, & duplicated BAMs
$ samtools depth -a SRRID_sorted_dedup_reads.bam > SRRID_depth_out.txt

Add or modify read group information in BAM file

To assign or update the read group information to the sequence reads in a BAM file, the add or replace read groups function was employed. This was carried out using "picard.jar AddOrReplaceReadGroups". Sample identification, tracking of the data, sequence data differentiation, integrity checks of data and overall sequence reads compatibility was obtained after this step was performed, ensuring that the reads were properly annotated and organized, facilitating meaningful data analyses. 

This step was run using the following code in the terminal:
Command
Adding/modifying read group info
$ java -jar picard.jar AddOrReplaceReadGroups I=SRRID_sorted_dedup_reads.bam O=SRRID_output.bam RGID=4 RGLB=lib1 RGPL=ILLUMINA RGPU=unit1 RGSM=20

Indexing BAM file- necessary before variant calling

Prior to variant calling, another indexing of the final BAM output file was carried out to allow faster retrieval of specific genomic regions, enhanced visualization, and to facilitate random access.
Command
Indexing BAM file
$ samtools index SRRID_output.bam

VARIANT CALLING - GERMLINE & SOMATIC

Calling of germline variants

The below sections detail preliminary steps followed for calling germline variants.
Mutations from the sequenced data were identified, processed and the variants were called using PICARD (https://broadinstitute.github.io/picard/) and GATK (The Genome Analysis Toolkit, https://github.com/broadinstitute/gatk). The Haplotype caller function was used to call the germline variants.
The purpose of using the "HaplotypeCaller" function is to identify and call genetic variants, including single nucleotide variants (SNVs), insertions, deletions, and complex variants, based on the sequencing data and the provided reference genome. The following command was employed to call the variants:
Command
Germline variant calling
$ gatk HaplotypeCaller -R ref.fa -I SRRID_output.bam -O SRRID_raw_variants.vcf

Selection of SNPs from raw variants

The "SelectVariants" function with the "-select-type SNP" option was utilized to filter and extract only the SNPs from the original VCF file. This step enabled filtering and data refining to focus on SNPs, simplified further analyses, optimized computational resources, and ensured compatibility with SNP-specific analysis tools or pipelines. For working with larger datasets, this function works best for SNP-centric analyses. Therefore, to select the SNPs, the code run was:
Command
Selection of variants from raw mutants
$ gatk SelectVariants -R ref.fa -V SRRID_raw_variants.vcf –select-type-to-include SNP -O SRRID_raw_snps.vcf

Selection of INDELs from raw variants

The purpose of using the "SelectVariants" function with the "-select-type INDEL" option is to filter and extract only the INDELs from the original VCF file. Using the GATK "SelectVariants" with the "-select-type INDEL" option allows for the selection and extraction of INDELs from a VCF file. This step enables filtering and refining the data to focus on INDELs. This step was run prior to variant filtration.
Command
INDELs selection from raw mutants
$ gatk SelectVariants -R ref.fa -V SRRID_raw_variants.vcf –select-type-to-include INDEL -o SRRID_raw_indels.vcf

Variant filtration

To remove potential false positives and retain only the high quality variants, a series of filters were applied to the raw SNPs using the following codes:
Command
Calling SNPs
$ gatk VariantFiltration -R ref.fa -V SRRID_raw_snps.vcf -O SRRID_filtered_snps.vcf -filter-name "QD_filter" -filter "QD < 2.0" -filter-name "FS_filter" -filter "FS > 60.0" -filter-name "MQ_filter" -filter "MQ < 40.0" -filter-name "SOR_filter" -filter "SOR > 4.0" -filter-name "MQRankSum_filter" -filter "MQRankSum < -12.5" -filter-name "ReadPosRankSum_filter" -filter "ReadPosRankSum < -8.0"

Command
Calling INDELs
$ gatk VariantFiltration -R ref.fa -V SRRID_raw_indels.vcf -O SRRID_filtered_indels.vcf -filter-name "QD_filter" -filter "QD < 2.0" -filter-name "FS_filter" -filter "FS > 200.0" -filter-name "SOR_filter" -filter "SOR > 10.0"

Excluding filtered variants

To filter out the variants that do not meet the specific quality (as defined by the user), the “--exclude-filtered" option in the "SelectVariants" step was used. This ensured removal of mutants that failed specific criteria and thereby enhanced quality of the data, prioritized the high-confidence variants only, decreased the number of false positives, allowed compatibility access between datasets and also optimized the computational resources. This is a crucial step to be followed in the data refinement process.
Command
Excluding SNPs
$ gatk SelectVariants --exclude-filtered -V SRRID_filtered_snps.vcf -O SRRID_bqsr_snps.vcf

Command
Excluding INDELs
$ gatk SelectVariants --exclude-filtered -V SRRID_filtered_indels.vcf -O SRRID_bqsr_indels.vcf

Recalibration of base quality

To augment the accuracy and reliability of the called variants, a base recalibration was carried out by correcting the systematic errors, addressing biases, and providing more accurate base quality scores. This step is essential in ensuring that the sequencing data quality is maintained appropriately.
Command
Base recalibration- Step 1
$ gatk BaseRecalibrator -R ref.fa -I SRRID_output.bam --known-sites SRRID_bqsr_snps.vcf --known-sites SRRID_bqsr_indels.vcf -O SRRID_recal_data.table

Command
Base recalibration- step 2
$ gatk ApplyBQSR -R ref.fa -I SRRID_output.bam -bqsr SRRID_recal_data.table -O SRRID_recal_reads.bam 

Command
Base recalibration- step 3
$ gatk BaseRecalibrator -R ref.fa -I SRRID_recal_reads.bam --known-sites SRRID_bqsr_snps.vcf --known-sites SRRID_bqsr_indels.vcf -O SRRID_post_recal_data.table

Command
Base recalibration- step 4
$ gatk AnalyzeCovariates -before SRRID_recal_data.table -after SRRID_post_recal_data.table -plots SRRID_recalibration_plots.pdf

Command
Base recalibration- step 5
$ gatk HaplotypeCaller -R ref.fa -I SRRID_recal_reads.bam -O SRRID_raw_variants_recal.vcf

Selection of variants from recalibrated VCF files

To filter and extract only the SNPs from the recalibrated VCF files, functions “SelectVariants” along with “-selectType SNP” were used. This was useful since the data authors were working with were recalibrated variant sets.
Command
SNP selection
$ gatk SelectVariants -R ref.fa -V SRRID_raw_variants_recal.vcf –select-type-to-include SNP -O SRRID_raw_snps_recal.vcf

Command
INDEL selection
$ gatk SelectVariants -R ref.fa -V SRRID_raw_variants.vcf –select-type-to-include INDEL -O SRRID_raw_indels_recal.vcf

Variant filtration step 

A series of filters were then applied to the raw SNPs to remove the potential false positives and retain high-quality variants. The following filters were applied:
Command
Filtration
$ gatk VariantFiltration -R ref.fa -V SRRID_raw_snps_recal.vcf -O SRRID_filtered_snps_final.vcf -filter-name "QD_filter" -filter "QD < 2.0" -filter-name "FS_filter" -filter "FS > 60.0" -filter-name "MQ_filter" -filter "MQ < 40.0" -filter-name "SOR_filter" -filter "SOR > 4.0" -filter-name "MQRankSum_filter" -filter "MQRankSum < -12.5" -filter-name "ReadPosRankSum_filter" -filter "ReadPosRankSum < -8.0"

Calling of somatic variants

The below sections detail preliminary steps followed for calling somatic variants.

Tumor samples with matched normal ones

Somatic variants were called using Mutect2 (https://gatk.broadinstitute.org/hc/en-us/articles/360037593851-Mutect2), specific for calling these variants, part of the GATK pipeline. 
The purpose of this step is to identify and call somatic variants, which are genetic variations specific to the tumor sample compared to the matched normal sample. This was first carried out via:
Command
Calling somatic variants
$ gatk Mutect2  -R ref.fa -I tumor.bam -I normal.bam  -normal normal_sample_name  --germline-resource af-only-gnomad.vcf.gz --panel-of-normals pon.vcf.gz  -O somatic.vcf.gz
To identify and call somatic variants for more than 2 samples (which are variations specific to the tumor samples compared to the matched normal ones).
Command
Somatic variant calling for multiple samples
$ gatk Mutect2 -R reference.fa -I tumor1.bam -I tumor2.bam -I normal1.bam -I normal2.bam -normal normal1_sample_name -normal normal2_sample_name --germline-resource af-only-gnomad.vcf.gz --panel-of-normals pon.vcf.gz -O somatic.vcf.gz

Tumor only mode- without normal samples

To detect and call somatic variants particular to a tumor sample.
Command
Somatic variant calling for only tumor samples
$ gatk Mutect2 -R ref.fa -I sample.bam -O single_sample.vcf.gz
Additionally, to call somatic variants while applying filters depending on the population frequencies and a panel of normals, the following codes were run: 
Command
Somatic variant calling with filters
$ gatk Mutect2 -R ref.fa -I sample.bam --germline-resource af-only-gnomad.vcf.gz --panel-of-normals pon.vcf.gz -O single_sample.vcf.gz

Filtering of somatic variants

Filtering variants are an essential step before the VCF files are prepared, since it ensures that the obtained variants are more likely to be true somatic mutations. In the present work, the mutations were filtered that will further help in understanding the biology of colorectal cancer.
Command
Filtration of variants
$ gatk FilterMutectCalls -V raw_variants.vcf -O filtered_variants.vcf

The variant files obtained from both germline and somatic calling were then used for further analysis as described in the following steps.

VCF FILE PREPARATION, ANNOTATION, POST-PROCESSING

VCF file preparation, annotation and post-processing of variants

Once both germline and somatic mutations were identified and called, the VCF files for both these variants were prepared using bcftools for further analysis. Annotation and processing of these variants were then carried out using snpEFF (https://pcingola.github.io/SnpEff/), wANNOVAR (https://wannovar.wglab.org/) and VEF (Variant Effect Predictor) (https://asia.ensembl.org/Tools/VEP).  
Command
VCF file prep- step 1
$ Vcf-validator SRRID_vcf_file.vcf

Command
VCF file prep- step 2
$ bcftools view -i 'DP>10' SRRID_vcf_file.vcf > SRRID_filtered_vcf_file.vcf

Command
VCF file prep- step 3
$ bcftools norm -f ref.fasta SRRID_vcf_file.vcf -o SRRID_normalized_vcf_file.vcf
Snapshot of results obtained from normalizing VCF files using bcftools

Annotation of variants
Command
Variant annotation
$ java -jar snpEff.jar GRChXX.86 SRRID_vcf_file.vcf > SRRID_annotated_vcf_file.vcf $ java -jar snpEff.jar -v <snpeff_db> SRRID_filtered_snps_final.vcf > SRRID_filtered_snps_final.ann.vcf
Results obtained from snpEFF for variant annotation, using filtered SNPs as the input

Post-processing of variants using wANNOVAR and VEP

Variant post processing was performed using two tools: wANNOVAR and VEP

Step 1- wANNOVAR:
This tool is accessible from: https://wannovar.wglab.org/. 
Input information to be filled in wANNOVAR to obtain the post-processing results using wANNOVAR
The default settings for running wANNOVAR
Some expected results for wANNOVAR are provided below: 
 


Snapshots of expected wANNOVAR results 
Step 2- Variant Effect Predictor (VEP)
The VEP tool is accessible from: https://asia.ensembl.org/Tools/VEP

Input data settings for VEP
Additional configurations and the type of transcript database to be used for VEP analysis
Expected result for VEP

CONCLUSIONS AND FUTURE PERSPECTIVES

Colorectal cancers have several diagnostic and treatment options to combat it, however, a delay in disease detection is life-threatening. Therefore, this protocol emphasizes on the identification and analysis of somatic and germline variants for colorectal cancer exome datasets, retrieved from publicly available databases such as NCBI-SRA. These steps describe the various tools required to run this pipeline, with the codes required to run each tool. Moreover, although this protocol is described for colorectal cancer exomes, the same set of codes and steps may be followed for the overall analysis of any cancer exome, when the desired outcome is to obtain and analyze somatic and germline variants separately. This computational pipeline requires further in-vitro and in-vivo studies, however, the outcomes will offer prospects for similar such studies that are crucial for designing a cure, prognosis or a treatment strategy for colorectal cancers, based on the scrutinized variants. 

The outcomes from this protocol can be carried forward to building and designing decision support systems using artificial intelligence and machine learning (AI/ML), so that the identified somatic and germline mutants can be used for early colorectal cancer prediction through a prediction model. This will aid clinicians/ researchers/diagnosticians to help take better decisions for treatment strategies. 

Protocol references

Leinonen R, Sugawara H, Shumway M, International Nucleotide Sequence Database Collaboration. The sequence read archive. Nucleic acids research. 2010 Nov 8;39(suppl_1):D19-21.
Andrews S. FastQC: a quality control tool for high throughput sequence data.
Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8.
Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. journal. 2011 May 2;17(1):10-2.
Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nature methods. 2012 Apr;9(4):357-9.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup. The sequence alignment/map format and SAMtools. bioinformatics. 2009 Aug 15;25(16):2078-9.
Cingolani P, Patel VM, Coon M, Nguyen T, Land SJ, Ruden DM, Lu X. Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift. Frontiers in genetics. 2012 Mar 15;3:35.
Yang H, Wang K. Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR. Nature protocols. 2015 Oct;10(10):1556-66.
McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F. The ensembl variant effect predictor. Genome biology. 2016 Dec;17(1):1-4.
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome research. 2010 Sep 1;20(9):1297-303.

Public workspaceInvestigation and identification of somatic and germline variants for colorectal cancer exomes using the NGS pipeline: a computational analysis perspective

Investigation and identification of somatic and germline variants for colorectal cancer exomes using the NGS pipeline: a computational analysis perspective