Plasmid Sequence Analysis from Long Reads

David A Eccles

Mar 13, 2024

Version 9

Plasmid Sequence Analysis from Long Reads V.9

DOI

dx.doi.org/10.17504/protocols.io.36wgq4n5yvk5/v9

David A Eccles¹

¹Malaghan Institute of Medical Research (NZ)

David A Eccles

Malaghan Institute of Medical Research (NZ)

DOI: dx.doi.org/10.17504/protocols.io.36wgq4n5yvk5/v9

Protocol Citation: David A Eccles 2024. Plasmid Sequence Analysis from Long Reads. protocols.io https://dx.doi.org/10.17504/protocols.io.36wgq4n5yvk5/v9Version created by David A Eccles

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: Working

We use this protocol and it's working

Created: March 13, 2024

Last Modified: March 13, 2024

Protocol Integer ID: 96611

Keywords: plasmid, nanopore, sequencing, rapid barcoding, data analysis, assembly

Abstract

This protocol demonstrates how to assemble reads from plasmid DNA, and generate a circularised and non-repetitive consensus sequence.

For plasmid sample preparation for nanopore sequencing, see here. The most recent version of our rapid barcoding kit protocol at the time this protocol was written is attached below:

DAE_Rapid_Barcoding_Kit_Protocol_2024-Feb-11.pdf218KB  
 
At the moment, this protocol uses Canu to de-novo assemble high-quality single-cut reads.

Input(s): 
demultiplexed fastq files (see protocol Demultiplexing Nanopore reads with LAST). I've noticed that the default demultiplexing carried out by Guppy and Dorado has some issues with chimeric reads, which can affect assembly.

Output(s):
Consensus sequence per barcode as a fasta file

Before start

When preparing plasmid DNA for sequencing, I follow one piece of advice that Dr. Divya Mirrington (ONT) gave me about pooling: create a pooled sample with volumes that you're  confident about, then remove an aliquot from that for adding the  remaining reagents [paraphrased]. I don't do any additional cleanup for  purified plasmid DNA; they tend to sequence very well on flow cells  without that cleanup.

My key requirement for plasmid  sequencing is a concentration of >20 ng/μl (ideally by qubit or  quantus). Concentrations over 100 ng/μl should be diluted down. If the  plasmids can be diluted down to all exactly the same concentration (but  at least 20 ng/μl), or they're all similar lengths, then that makes  creating an equimolar pool much easier.

When creating the pools, I  add at least 1 μl of the the sample that I need to add the least for  (might be more if the total volume is less than 11 μl), then add the  corresponding amount of other samples to create an equimolar pool. I  then take 11 μl from the pool to be used for rapid adapter addition.

If samples are equal concentration:
Add  amounts according to the length of the plasmid divided by the length of  the shortest plasmid. For example, if there are two plasmids, one with  length 3kb and another with length 35 kb, then add 1 μl of the 3kb  plasmid, and 35/3 = 11.7 μl of the 35kb plasmid.

If plasmids are roughly equal length (i.e. less than ~10% length difference between plasmids):​
Add  amounts according to the concentration of the highest-concentration  sample divided by the concentration of the plasmid. For example, if  there are three plasmids, one with concentration 50 ng/μl, one with  concentration 35 ng/μl, and one with concentration 20 ng/μl, then add 1  μl of the 50 ng/μl plasmid, 50/35 = 1.4 μl of the 35 ng/μl plasmid, and  2.5 μl of the 20 ng/μl plasmid. The total volume of this pool will be  less than 11 μl (1 + 1.4 + 2.5 = 4.9 μl), so in this case I would triple  these volumes (3 μl; 4.2 μl; 7.5 μl) to create a pool of > 11 μl.

If samples are different concentrations and different lengths:​
Make  the sample prep easier. Use multiple flow cells for different plasmid  length ranges. Dilute higher-concentration samples down to the  lowest-concentration samples. I don't recommend trying to do both  calculations at the same time to determine added volumes because there's  a much higher chance of getting added amounts wrong, leading to wasted  samples or wasted flow cells.

If you have a sufficiently-accurate pipetting robot, a sample sheet, and someone who is comfortable with equations:​
Pre-calculate amount to add assuming 12 μl total pool volume:
ratio = length / max(length) * max(conc) / conc​
volume = ratio * 12 / sum(ratio)

[That's my guess at the right equations; please let me know if there's an error]

Read file preparation

Demultiplex reads as per protocol Demultiplexing Nanopore reads with LAST.

If this has been done, then the following command should produce output without errors:

for bc in $(awk '{print $2}' demultiplexed/barcode_counts.txt);
  do ls demultiplexed/reads_${bc}.fq.gz;
done

Example output:

demultiplexed/reads_BC02.fq.gz
demultiplexed/reads_BC03.fq.gz
demultiplexed/reads_BC04.fq.gz
demultiplexed/reads_BC05.fq.gz
demultiplexed/reads_BC07.fq.gz
demultiplexed/reads_BC09.fq.gz

If the barcode_counts.txt file is missing, the output will look like this:

awk: fatal: cannot open file `barcode_counts.txt' for reading (No such file or directory)

If one or more of the barcode-demultiplexed files are missing, the output will look something like this:

demultiplexed/reads_BC02.fq.gz
demultiplexed/reads_BC03.fq.gz
demultiplexed/reads_BC04.fq.gz
ls: cannot access 'demultiplexed/reads_BC05.fq.gz': No such file or directory
ls: cannot access 'demultiplexed/reads_BC07.fq.gz': No such file or directory
demultiplexed/reads_BC09.fq.gz

If reads have been demultiplexed by MinKNOW (bearing in mind that there may be issues with chimeric reads), then the following approach should work to create the right input format:


mkdir demultiplexed;
# readlocation is the directory that contains barcode subdirectories
readLocation = "../*/fastq_pass"; # or "../*/pass" for Guppy
for x in $(ls ${readLocation}); do
  echo ${x};
  cat ${readLocation}/${x}/*.fastq | gzip > demultiplexed/reads_${x}.fq.gz;
  echo "1 ${x}" >> barcode_counts.txt;
done

Create a directory to store results files

cd demultiplexed
mkdir results

Downloading Bioinformatics Scripts

This protocol uses scripts that I have developed over the course of my bioinformatics work to simplify fastq and fasta file processing. They can be downloaded from my gitlab repository:

git clone https://gitlab.com/gringer/bioinfscripts.git scripts

Read filitering

It is assumed that you have some prior knowledge of the plasmid length. If the length is unknown, it can be determined by looking at the L50 summary statistics produced from fastx-length.pl when run on a representative sample:

scripts/fastx-length.pl reads_BC05.fq.gz > /dev/null

Example output:

Total sequences: 8288
Total length: 29.360421 Mb
Longest sequence: 19.739 kb
Shortest sequence: 80 b
Mean Length: 3.542 kb
Median Length: 4.936 kb
N10: 410 sequences; L10: 5.051 kb
N50: 2747 sequences; L50: 5.007 kb    <==
N90: 5310 sequences; L90: 2.947 kb
A/T homopolymer lengths:
  Minimum: 0 bp
  Maximum: 1905 bp
  Mean: 13 bp
  Median: 6 bp
  Mode: 131 bp
  10th percentile: 0 bp
  90th percentile: 37 bp
Predicted / modeled quality across all reads:
  Minimum: q2
  Maximum: q20
  Mean: q12
  Median: q13
  Mode: q20
  10th percentile: q7
  90th percentile: q16

In this case, the plasmid length is approximately 5kb, so a system variable is created containing that information, as well as a value half that length:

plasmidLength=5000
plasmidHalfLength=2500

Filter out any reads that are less than half of the target read length, then from the remaining reads keep keep 100 of highest-quality reads. These reads are fragmented into 1.5kb pieces with 750bp overlap.
for bc in $(awk '{print $2}' barcode_counts.txt);
  do scripts/fastx-fetch.pl -lengthFilter ${plasmidHalfLength} -count 100 reads_${bc}.fq.gz | \
    ~/scripts/normalise_seqlengths.pl -fraglength 1500 -overlap 750 | \
    gzip > Filtered_reads_${bc}.fq.gz;
done

Choice 1: mapping to a reference sequence

This protocol assumes the reference sequence has a name 'refName', and can be found at 'reference/refName.fa'. Change 'refName' here in the likely event that the sequence / file has a different name, then export the variable REFNAME to make subsequent steps easier.

export REFNAME="refName"

Create a LAST index for the reference sequence:

lastdb -uNEAR -R01 reference/${REFNAME}.fa reference/${REFNAME}.fa

Prepare a substitution matrix for barcode mapping. I recommend doing this by training LAST, and incorporating quality scores into the trained matrix:

last-train --matsym -Q 1 -P 10 reference/${REFNAME}.fa \
  Filtered_reads_BC*.fastq.gz

Copy the last few lines from the output into a text file called `plasmid.mat`

Here is a matrix that I have created from plasmid reads called using the guppy super-accuracy basecaller, v5.0.13:

#last -Q 1
#last -t4.39812
#last -a 17
#last -A 18
#last -b 4
#last -B 5
#last -S 1
# score matrix (query letters = columns, reference letters = rows):
       A      C      G      T
A      6    -45    -34    -46
C    -45      6    -46    -45
G    -34    -46      6    -46
T    -46    -45    -46      6
plasmid.mat  

This matrix has a moderate penalty for opening gaps (i.e. insertions and deletions), and a lower penalty for inserting them. Insertions are slightly less likely than deletions. It has a higher penalty for A/G transition variants, and a similar penalty for C/T transition variants as other substitution penalties.

Map reads to the reference and convert to BAM format using maf-convert and samtools:

for bc in $(awk '{print $2}' barcode_counts.txt);
  do echo ${bc};
  lastal -j 7 -P 10 -p plasmid.mat reference/${REFNAME}.fa \
      Filtered_reads_{bc}.fastq.gz | last-split -n -m 0.99 | last-postmask | \
    maf-convert sam | samtools view -h --reference reference/${REFNAME}.fa | \
    samtools sort > results/${bc}_vs_${REFNAME}.bam;
  samtools index results/${bc}_vs_${REFNAME}.bam;
done

Split reads into forward and reverse-mapped sequences, and tally up base counts at each location:


cat results/read_L50.txt | while read bc len;
  do echo ${bc} ${len};
  samtools view -b -F 0x10 results/${bc}_vs_${REFNAME}.bam | \
    samtools mpileup --reference reference/${REFNAME}.fa -B -Q 0 - | \
    ~/scripts/readstomper.pl -c > results/fwd_stomped_${bc}_vs_${REFNAME}.csv
  samtools view -b -f 0x10 results/${bc}_vs_${REFNAME}.bam | \
    samtools mpileup --reference reference/${REFNAME}.fa -B -Q 0 - | \
    ~/scripts/readstomper.pl -c > results/rev_stomped_${bc}_vs_${REFNAME}.csv
done

Create a visualisation of the base-stomped output:

for bc in $(awk '{print $2}' barcode_counts.txt);
  do echo ${bc};
  ~/scripts/stomp_plotter.r -f results/fwd_stomped_${bc}_vs_${REFNAME}.csv \
    -r results/rev_stomped_${bc}_vs_${REFNAME}.csv -prefix results/${bc}
done

Example output (`results/BC02_stompPlot_reference_seq.png`):

 
BC02 reads mapped to a reference sequence called 'reference_seq'

Choice 2: De-novo Canu Assembly

Run a Canu assembly on each read set, using default options:

for bc in $(awk '{print $2}' barcode_counts.txt);
  do echo ${x};
  canu -nanopore Filtered_reads_${bc}.fq.gz -p ${bc} \
    -d results/canu_${bc} genomeSize=${plasmidLength};
done

Trim the assembled contigs based on Canu's trim recommendations for the longest assembled contig:

for bc in $(awk '{print $2}' barcode_counts.txt);
  do echo ${bc};
  samtools faidx results/canu_${bc}/${bc}.contigs.fasta
     $(grep '^>' results/canu_${bc}/${bc}.contigs.fasta | \
       perl -pe 's/^>(.*?) .*trim=([0-9]+)-([0-9]+)/$1.":".($2+1)."-".($3)/e') | \
     scripts/fastx-fetch.pl  |  \
   scripts/fastx-sort.pl -l | scripts/fastx-fetch.pl -c 1 | \
   perl -pe "s/^>/>${bc}_/" > results/longest_circTrimmed_${bc}.fasta;
done

Final Cleanup - Orientation and Rotation

If plasmid sequences contain a long polyA region, the assembled sequences can be oriented assuming that the longest homopolymer is this polyA region:

for bc in $(awk '{print $2}' barcode_counts.txt);
  do echo ${bc};
  if scripts/fastx-hplength.pl results/longest_circTrimmed_${bc}.fasta | tail -n 1 | grep "T";
    then scripts/fastx-rc.pl results/longest_circTrimmed_${bc}.fasta | \
      scripts/fastx-fetch.pl -oneline > results/oriented_longest_circTrimmed_${bc}.fasta;
    else scripts/fastx-fetch.pl -oneline results/longest_circTrimmed_${bc}.fasta \
      > results/oriented_longest_circTrimmed_${bc}.fasta;
  fi;
done

The plasmid can then be oriented based on a known alignment sequence.

First the alignment sequence is stored in a file 'align_seq.fa', and indexed using lastdb:

echo -e ">align_seq\nGAGACGGTCGACTGCGGCCGCTCGAGTCT" > align_seq.fa
lastdb -uRY4 -R01 align_seq.fa align_seq.fa
Then the alignment sequence is used to determine the rotation of the plasmid:

for bc in $(awk '{print $2}' barcode_counts.txt);
  do echo ${bc};
  scripts/fastx-rotate.pl \
    -o $(lastal align_seq.fa results/oriented_longest_circTrimmed_${bc}.fasta | \
      grep BC | awk '{print $3}') \
    results/oriented_longest_circTrimmed_${bc}.fasta \
      > results/rotated_longest_circTrimmed_${bc}.fasta;
done

Public workspacePlasmid Sequence Analysis from Long Reads V.9

Plasmid Sequence Analysis from Long Reads V.9