Using Sanger sequencing, the Human Genome Project expended approximately USD $2.7 billion and took more than 10 years to pro- duce the first human genome sequence. Today, a human genome can be sequenced in a matter of days for less than USD $1000 on a single next-generation sequencing (NGS) machine. This step change in through- put and per-base cost has transformed the use of DNA sequencing in biomedical research and is being translated in an expanding number of ways into medicine. NGS is increasingly being applied to understand- ing and managing infectious diseases. This includes the sequencing of microbial genomes for the purposes of laboratory identification of infectious agents [1], detection of antibiotic resistance markers [2], and the public health surveillance of epidemiological clusters and outbreaks [3]. Examples include its deployment in public health surveillance and control of community cases of Escherichia coli [4], Campylobacter jejuni [5], Legionella pneumophila [6] and Mycobacterium tuberculo- sis [7] disease, or global and regional epidemics caused by influenza [8], Ebola [9], and Zika [10] viruses. It has also been utilised to track
the source and spread of healthcare-associated infections caused by Staphylococcus aureus [11], Pseudomonas aeruginosa [12], Acineto- bacter baumannii [13], and Enterococcus faecium [14] in order to guide infection prevention and control in hospitals.
In addition to its whole genome (WGS), whole exome (WES), transcriptome (RNA-Seq), bisulphite methylome, and metagenomic sequencing capabilities, NGS can be directed to the detection of specific genes or mutations associated with human disease through targeted-panel amplicon screening. However, barriers remain with regard to establishing NGS in a laboratory for the first time and this hinders its uptake in clinical microbiology and other settings. One of these challenges is the lack of a simplified step-by-step protocol that can be picked up by laboratory personnel with no prior training or experience in NGS and used to gen- erate reliable, high quality sequence data. Illumina dye-sequencing is currently considered the gold standard internationally in terms of read depth and base-calling accuracy, genome coverage, scalability, and the range of sequencing applications it delivers.
In this work, we produced an easy-to-follow, step-by-step NGS protocol with consistent genome coverage and average read depth that was applicable to a range of bacterial pathogens i.e., Gram-positive van- comycin-resistant Enterococcus faecium, Gram-negative non-typeable Haemophilus influenzae, and acid-fast high-GC content Mycobacterium tuberculosis. This protocol can be used to generate Illumina-based WGS data for clinical isolates of bacterial pathogens of importance to human health.
Figure 1 is the graphical summary of the process of obtaining whole genome sequence data from bacterial culture. This wet labora- tory procedure generated FastQ reads from the sequencer within three days of start. We modified a number of the DNA extraction steps to obtain a sufficient quantity of contamination free template. Similarly, we replaced library normalization plates and Nextera XT tagment amplicon (NTA) plates with conventional polymerase chain reaction (PCR) tubes which may represent a cost-effective alternative. In ad- dition, we have recommended the use of equal DNA concentrations of each library during library normalization to ensure better coverage and minimize bias. Simplification of bacterial NGS may assist in its uptake by beginner users.
A consensus sequence was generated for each of the isolates analysed in Geneious. The Geneious report provided information on the percentage coverage of test sequence to the reference genome and the mean read depth (Table 1). Each contiguous sequence is viewable in Geneious and can be analysed for coverage with respect to the reference genome. Quality control checks of raw sequence data were also performed using FastQC [22]. This freely-available software provided information re- garding per base sequence content and quality, per base and sequence GC content, and highlighted the parameters of the sequence quality.
Initial typing analysis
We used open source databases to analyze the sequence data. For example, Geneious mapped contiguous sequences were imported into PubMLST (https://pubmlst.org/) for sequence typing of Haemophilus influenzae and vancomycin-resistant Enterococcus faecium. This can also be achieved using raw fastq reads in the MLST profiling tool from the Center for Genomic Epidemiology (CGE) database (http://www. genomicepidemiology.org/). The Resfinder tool (https://cge.cbs.dtu. dk/services/ResFinder/) was used to identify acquired antimicrobial resistance genes from raw fastq files. For example, PubMLST typing classified NTHi 1 as sequence type 46 and Resfinder did not detect the presence of any antimicrobial resistance determining mutations.
Mycobacterium tuberculosis complex raw fastq.gz files were uploaded to the TGS-TB database (https://gph.niid.go.jp/tgs-tb/) to predict drug susceptibility, in silico spoligotype, lineage type, and phylogenetic classification. This database also enabled detection of IS6110 insertion sites, and 43 loci for variable number tandem repeat (VNTR) typing. The drug resistance profile of the MTBC isolates were further confirmed using PhyResSE database (http://phyresse.org/). For example, TGS-TB identified MTBC1 as a drug susceptible Mycobacterium bovis isolate.
Coverage refers to the percentage of reference genome bases covered by mapped sequence reads. Mean read depth indicates the mean number of times each base is mapped by a sequence read. Reference genomes used were E. faecium ST18 DO (TX16) (accession number NC_017960), Haemophilus influenzae 86-028NP (nontypeable) (accession number NC_007146), and Mycobacterium tuberculosis H37Rv (accession number NC000962). VRE, van- comycin resistant Enterococcus faecium; NtHi, non-typeable Haemophilus influenza; MTBC, Mycobacterium tuberculosis complex.
Possible problems and their troubleshooting solutions are listed in Table 2. There are a number of limitations associated with the protocol that should be noted. These include: effective results with the protocol are reliant on the efficacy of the extraction procedure in producing a sufficient quantity of genomic DNA; analysis of sequences generated on an Illumina platform can be affected by the presence of highly repetitive regions; and depending on the output information sought, genome assembly can be influenced by the reference genome selected for the mapping of reads. Nevertheless, the protocol was effective in generating high quality sequencing data for the range of bacterial species tested.
Acknowledgments
This research was supported by funding from the Royal Hobart Hospital Research Foundation (17-104) and the Tasmanian Community Fund (36Medium00014).