Protocol Citation: Angelo Sante Varvara, Ilenia Urso, Sharon Natasha Cox, Graziano Pesole 2024. Bioinformatic pipeline for analysing variations in long-reads reconstructed human genomes. protocols.io https://dx.doi.org/10.17504/protocols.io.36wgqnoz3gk5/v1
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Recent advances in sequencing technologies have paved the way for personalized medicine and clinical genomics, providing unprecedented insights into the genetic basis of human health and disease. High-throughput sequencing is now recognized as an effective diagnostic tool for patients affected by rare diseases but the use of either Whole Exome Sequencing (WES) or Whole Genome Sequencing (WGS) technologies has shown to be uninformative in many cases. On the contrary, the recent introduction of long-read technologies can resolve genomic regions inaccessible to short-read sequencing, but a reliable benchmark assessment is missing.
Following a long and intense benchmarking activity, here we present our bioinformatic pipeline based on Oxford Nanopore Technology (ONT) long reads aimed at shading lights to the stages of alignment, SNVs, indels, and SV calling, and modified bases analysis.
Please kindly note that whilst running Dorado without a GPU is technically possible, it is strongly inadvisable as basecalling will be much slower when running purely on CPU.
Dorado is heavily optimized for Nvidia A100 (ampere) and H100 (hopper) GPUs and will deliver maximum performance on nodes containing these GPUs.
Basecalling and alignment
Basecalling and alignment
11h
11h
Basecalling raw data in pod5 format with Dorado basecalling.
Dorado is a high-performance open-source basecaller for Oxford Nanopore reads.
First, let's download the latest models for our needs with Dorado download.
Then, we can run Dorado basecaller on our raw data in pod5.
If your raw data are in fast5 format, it is recommended to convert them in pod5 with pod5 convert fast5 tool to reach optimal performances.
For our test, we are going to use human Genome assembly GRCh38 as a reference, calling both 5mC and 5hmC modified bases, using the most updated and accurate models.
Following the basecalling step, create a sequencing summary with Dorado summary, it outputs a tab-separated file with read-level sequencing information from the BAM file generated during basecalling.
9h
Sort and index the resulting alignment with samtools.
It's time to sort the alignment in bam format with samtools sort.
Indexing the sorted bam is mandatory for further steps.
2h
Alignment statistics
Alignment statistics
1h
1h
Create alignment statistics with NanoStat, samtools, and pycoQC.
NanoStat highlights alignment information as, for instance, N50 or read length and read quality.
Samtools utilities give us insights on coverage and mapping values.
PycoQC produces an html format file with tons of charts regarding basecalling progress and outcomes.
1h
Variant calling
Variant calling
2h 35m
2h 35m
SNV and indels calling with Clair3.
Variant calling is performed with Clair3 and we strongly advise these parameters, with the same model path used for the basecalling stage.
2h
Polishing calls with bcftools.
This step depends on your alignment statistics. We usually advise removing all "lowQual" calls as well as variants with depth below 5 (or 10 if your coverage is above 30x).
Furthermore, bcftools norm command will split all multi-allelic variants into multiple rows and furtherly check calls with the reference throwing out redundancies or errors, producing a vcf file with improved accuracy.
Lastly, it can probably be helpful to add further info to the final vcf file, especially in pathogenic studies. Annovar and SnpEff are both powerful tools that can fill variant calls with annotations, as those from ClinVar, Gnomad and dbSNP.
10m
SV calling with sniffles2.
We strongly advise using the bed files found in sniffles GitHub repository to achieve better performances while calling structural variants in repetitive regions.
Calls can be subsequently filtered and polished depending on our needs with bcftools.
10m
Additionally, sniffles2 developers offer a python program that generates a wide range of plots for single and multi-sample VCF files, such as SV size & type distribution, and genotype frequency.
5m
SV annotation and prioritization with SvAnna.
Structural variants need to be annotated as well as, in case of pathogenic analysis, prioritized. We benchmarked several tools for this purpose, but we determined SvAnna to be the most suitable for our goals, taking as inputs either sniffles2 vcf and our HPOs.
The output in html is specifically useful in order to have a first look at the most valuable structural variants in terms of pathogenic relevance.
10m
Phasing
Phasing
3h 35m
3h 35m
Phasing heterozygous variants with whatshap phase.
This tool allows phasing variants (called via clair3), annotating the info directly in our vcf file.
Please note that also structural variants can be phased with whatshap phase in the same way.
30m
Phasing assessment with whatshap stats.
Assessing the amount of phased heterozygous variants is crucial, and we can achieve this with whatshap stats. It will output various statistics which can vary a bit depending on the alignment quality (coverage, N50, etc), but we can safely conclude that a percentage above 90% is expected in 30X covered genomes.
5m
Tagging reads by haplotype with whatshap haplotag.
Quite often it is interesting to show the reads as well as the variants. For that, we can run the whatshap haplotag subcommand on the phased VCF file. It tags each read in a BAM file with HP:i:1 or HP:i:2 depending on which haplotype it belongs to, and also adds a PS tag that describes in which haplotype block the read is.
The command below creates a BAM file 'haplotagged.bam' with the tagged reads, which you can open in IGV, in order to visualize either haplotype assignment and modified bases.
3h
Tips for trios
Tips for trios
In case we had a trio of samples (child, parent1, and parent2) the above stages of variant calling and phasing can be completed with a single command thanks to the Clair3-Nova tool. With few lines of code, we would be able to perform SNVs and indels calling, phasing heterozygous variants, and haplotagging input bam files with haplotype infos, the two latter steps ran via whatshap if we specify the tool path.
If needed, Clair3-Nova can output also gvcfs files.
12h
Modified bases
Modified bases
3h
3h
Modified bases calling and stats with modkit.
Modified bases can be called with modkit directly from the modBAM file, depending on the type of model chosen during the basecalling stage. Modkit can effectively count the following modified bases: 5mC, 5hmC, 6mA. It outputs a bedMethyl file, a table with the counts of base modifications from every sequencing read over each reference genomic position. No reference sequence is required (hence we strongly advise using it) and for user convenience, the counting process can be modulated using several additional transforms and filters, for instance the most basic of these is to report only counts from reference CpG dinucleotides.
It is also possible to produce a summary to have an overview of the mod tags that are present in a BAM and get basic statistics. The default output is a totals table (designated by '#' lines) and a modification calls table. We ran the command with --no-sampling flag to count modified bases throughout the genome but if you prefer, you can choose a specific region or a specific number of reads to be randomly picked.
Furthermore, modkit offers the possibility to perform differential methylation scoring. This can be done using two subcommands: modkit dmr pair to compare regions in a pair of samples (for example, tumor and normal, or control and experiment), or modkit dmr multi to compare regions between all pairs of samples (for example, a trio sample set ). Before performing this step, the input data (i.e., bedMethyl files created in the previous step) must be compressed with bgzip and indexed with tabix. Additionally, you must specify a BED file containing the specific regions to be compared, such as CpG Islands in hg38. This BED file should contain 4 columns: chrom, chromStart, chromEnd, and CpGname. Keeping the name column is optional and we can also specify the modified base in which we are interested in, for example, for CpG islands we can specify "--base C".
The modkit dmr multi command performs pairwise comparisons across all regions specified in the regions BED file for more than two samples. The data preparation is the same as described in the previous section, applied to each sample. An example command for a trio is the following:
The latest version of modkit introduces the option "--segments", which allows modkit dmr to dynamically identify and generate regions of differential methylation during the analysis using a hidden Markov model, instead of relying on pre-defined regions. This can provide more flexible and potentially more accurate identification of methylation patterns in your data
3h
Further analysis
Further analysis
Further analyses are expected to be performed according to the specific aim of your study.