This workflow details how COI and 18S rRNA gene raw amplicon sequencing data from the European ARMS programme (ARMS-MBON) can be processed bioinformatically to generate read count and taxonomy tables of molecular operational taxonomic units (mOTUs) for the identification of (marine) non-indigenuous species (NIS). However, the end products may also be used for any other diversity analyses that suit the user. The pipeline may also be adjusted to work with amplicon sequence variants (ASVs) instead of mOTU by omitting and ajdusting certain steps.
The data used here comprise all publicly available COI and 18S sequencing data from ARMS-MBON as of February 2024.
Processes described in this pipeline were executed on Unix and Windows OS. Certain steps (especially software installations etc.) may differ when run on different operating systems. Some computationally intensive steps were run on a high-performance computing cluster. This is noted in the respective section of this workflow.
Note that in this workflow, separate directories were created for each marker gene. Make sure that the input files required (i.e., the files produced in the correspoding preceding step) are in the respective directory.
References to all data, software, packages and databases used in this workflow (please cite any of the tools used in your analysis):
ARMS-MBON (Obst et al., 2020)
R v4.1.0 and v4.3.1 (R Core Team, 2021, 2023) (v4.3.1 was used for dada2 processing and COI numt-removal)
RStudio 2022.07.1 (RStudio Team, 2022)
cutadapt v4.5 (Martin, 2011)
Git v2.37.3 (Chacon & Straub, 2014)
Python v3.11.4 (Van Rossum & Drake, 2009)
MACSE v2.05 (Ranwez et al., 2018)
swarm v3.0.0 (Mahé et al., 2015)
NCBI BLAST (Johnson et al., 2008)
BLAST+ release 2.11.0 (Camacho et al., 2009)
BOLD (Ratnasingham & Hebert, 2007)
BOLDigger-commandline v2.2.1 (Buchner & Leese, 2020)
SeqKit v2.5.1 (Shen et al., 2016)
MIDORI2 (Leray et al., 2022)
MIDORI2 webserver (Leray et al., 2018)
GenBank release 257 (Benson et al., 2012)
RDP classifier (Wang et al., 2007)
Silva taxonomic training data formatted for DADA2 (Callahan, 2018) (Silva v132; Quast et al., 2013)
SILVA v128 and v132 dada2 formatted 18s 'train sets' (Morien & Parfrey, 2018) (Silva v128 and v132; Quast et al., 2013)
Protist Ribosomal Reference database (PR2) v5.0.0 (Guillou et al., 2013)
World Register of Marine Species (WoRMS) (Ahyong et al., 2023)
World Register of Introduced Marine Species (WRiMS) (Rius et al., 2023)
Microsoft Excel 2016 (Microsoft Corporation, 2016)
argparse v2.2.2 (Davis, 2023)
dada2 v1.28.0 (Callahan et al., 2016)
ShortRead v1.58.0 (Morgan et al., 2009)
Biostrings v2.68.1 (Pagés et al., 2020)
ggplot2 v3.4.2 and v3.4.3 (Wickham, 2016) (v3.4.3 was used in the dada2 workflow to plot read quality profiles)
ensembleTax v1.2.2 (Catlett et al., 2023)
tidyr v1.3.0 (Wickham et al., 2023)
dplyr v1.0.9 and v1.1.3 (Wickham et al., 2022, 2023) (v1.1.3 was used during COI numt-removal)
stringr v1.5.0 (Wickham, 2022)
devtools v2.4.3 (Wickham et al., 2021)
hiReadsProcessor v1.29.1 and v1.36.0 (Malani, 2021) (v1.36.0 was used during COI numt-removal)
seqinr v4.2.30 (Charif & Lobry, 2007)
remotes v2.4.2 (Csárdi et al., 2021)
LULU v0.1.0 (Frøslev et al., 2017)
readxl v1.4.0 (Wickham & Bryan, 2022)
phyloseq v1.36.0 (McMurdie & Holmes, 2013)
vegan v2.6.2 (Oksanen et al., 2023)
ggpubr v0.4.0 (Kassambara, 2020)
data.table v1.14.2 (Dowle & Srinivasan, 2021)
xlsx v0.6.5 (Dragulescu & Arendt, 2020)
plyr v1.8.7 (Wickham, 2011)
geosphere v1.5.18 (Hijmans 2022)
Ahyong, S., Boyko, C. B., Bailly, N., Bernot, J., Bieler, R., Brandão, S. N., Daly, M., De Grave, S., Gofas, S., Hernandez, F., Hughes, L., Neubauer, T. A., Paulay, G., Boydens, B., Decock, W., Dekeyzer, S., Vandepitte, L., Vanhoorne, B., Adlard, R., … Zullini, A. (2023). World Register of Marine Species (WoRMS). WoRMS Editorial Board. https://www.marinespecies.org
Benson, D. A., Cavanaugh, M., Clark, K., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., & Sayers, E. W. (2012). GenBank. Nucleic Acids Research, 41(D1), D36–D42. https://doi.org/10.1093/nar/gks1195
Buchner, D., & Leese, F. (2020). BOLDigger – a Python package to identify and organise sequences with the Barcode of Life Data systems. Metabarcoding and Metagenomics 4: E53535, 4, e53535-. https://doi.org/10.3897/MBMG.4.53535
Callahan, B. (2018). Silva taxonomic training data formatted for DADA2 (Silva version 132). Zenodo. https://doi.org/10.5281/zenodo.1172783
Callahan, B. J., McMurdie, P. J., Rosen, M. J., Han, A. W., Johnson, A. J. A., & Holmes, S. P. (2016). DADA2: High-resolution sample inference from Illumina amplicon data. Nature Methods 2016 13:7, 13(7), 581–583. https://doi.org/10.1038/nmeth.3869
Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., & Madden, T. L. (2009). BLAST+: Architecture and applications. BMC Bioinformatics, 10(1), 1–9. https://doi.org/10.1186/1471-2105-10-421/FIGURES/4
Catlett, D., Son, K., & Liang, C. (2023). ensembleTax: Ensemble Taxonomic Assignments of Amplicon Sequencing Data.
Chacon, S., & Straub, B. (2014). Pro git. Apress.
Charif, D., & Lobry, J. R. (2007). SeqinR 1.0-2: a contributed package to the R project for statistical computing devoted to biological sequences retrieval and analysis. In U. Bastolla, M. Porto, H. E. Roman, & M. Vendruscolo (Eds.), Structural
approaches to sequence evolution: Molecules, networks, populations (pp. 207–232). Springer Verlag.
Csárdi, G., Hester, J., Wickham, H., Chang, W., Morgan, M., & Tenenbaum, D. (2021). remotes: R Package Installation from Remote Repositories, Including “GitHub.” https://cran.r-project.org/package=remotes
Davis, T. L. (2023). argparse: Command Line Optional and Positional Argument Parser.
https://cran.r-project.org/package=argparse
Dowle, M., & Srinivasan, A. (2021). data.table: Extension of `data.frame`. https://cran.r-project.org/package=data.table
Dragulescu, A., & Arendt, C. (2020). xlsx: Read, Write, Format Excel 2007 and Excel 97/2000/XP/2003 Files.
https://cran.r-project.org/package=xlsx
Frøslev, T. G., Kjøller, R., Bruun, H. H., Ejrnæs, R., Brunbjerg, A. K., Pietroni, C., & Hansen, A. J. (2017). Algorithm for post-clustering curation of DNA amplicon data yields reliable biodiversity estimates. Nature Communications 2017 8:1, 8(1),
1–11. https://doi.org/10.1038/s41467-017-01312-x
Guillou, L., Bachar, D., Audic, S., Bass, D., Berney, C., Bittner, L., Boutte, C., Burgaud, G., De Vargas, C., Decelle, J., Del Campo, J., Dolan, J. R., Dunthorn, M., Edvardsen, B., Holzmann, M., Kooistra, W. H. C. F., Lara, E., Le Bescot, N., Logares, R., …Christen, R. (2013). The Protist Ribosomal Reference database (PR2): a catalog of unicellular eukaryote Small Sub-Unit rRNA sequences with curated taxonomy. Nucleic Acids Research, 41(D1), D597–D604. https://doi.org/10.1093/NAR/GKS1160
Hijmans, R. J. (2022). geosphere: Spherical Trigonometry. https://cran.r-project.org/package=geosphere
Johnson, M., Zaretskaya, I., Raytselis, Y., Merezhuk, Y., McGinnis, S., & Madden, T. L. (2008). NCBI BLAST: a better web interface. Nucleic Acids Research, 36(Web Server), W5–W9. https://doi.org/10.1093/nar/gkn201
Kassambara, A. (2020). ggpubr:
“ggplot2” Based Publication Ready Plots. R package version 0.4.0. https://cran.r-project.org/package=ggpubr
Leray, M., Ho, S. L., Lin, I. J., & Machida, R. J. (2018). MIDORI server: a webserver for taxonomic assignment of unknown metazoan mitochondrial-encoded sequences using a curated database. Bioinformatics, 34(21), 3753–3754.
https://doi.org/10.1093/BIOINFORMATICS/BTY454
Leray, M., Knowlton, N., & Machida, R. J. (2022). MIDORI2: A collection of quality controlled, preformatted, and regularly updated reference databases for taxonomic assignment of eukaryotic mitochondrial sequences. Environmental DNA, 4(4), 894–907. https://doi.org/10.1002/EDN3.303
Mahé, F., Rognes, T., Quince, C., de Vargas, C., & Dunthorn, M. (2015). Swarmv2: Highly-scalable and high-resolution amplicon clustering. PeerJ, 2015(12), e1420. https://doi.org/10.7717/PEERJ.1420/SUPP-1
Malani, N. V. (2021). hiReadsProcessor: Functions to process LM-PCR reads from 454/Illumina data.
Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.Journal, 17(1), 10–12. https://doi.org/10.14806/ej.17.1.200
McMurdie, P. J., & Holmes, S. (2013). phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. PLOS ONE, 8(4), e61217. https://doi.org/10.1371/JOURNAL.PONE.0061217
Morgan, M., Anders, S., Lawrence, M., Aboyoun, P., Pagès, H., & Gentleman, R. (2009). ShortRead: a Bioconductor package for input, quality assessment and exploration of high-throughput sequence data. Bioinformatics, 25, 2607–2608.
https://doi.org/10.1093/bioinformatics/btp450
Morien, E., & Parfrey, L. W. (2018). SILVA v128 and v132 dada2 formatted 18s “train sets.” Zenodo. https://doi.org/10.5281/zenodo.1447330
Obst, M., Exter, K., Allcock, A. L., Arvanitidis, C., Axberg, A., Bustamante, M., Cancio, I., Carreira-Flores, D., Chatzinikolaou, E., Chatzigeorgiou, G., Chrismas, N., Clark, M. S., Comtet, T., Dailianis, T., Davies, N., Deneudt, K., de Cerio, O. D., Fortič, A., Gerovasileiou, V., … Pavloudi, C. (2020). A Marine Biodiversity Observation Network for Genetic Monitoring of Hard-Bottom Communities (ARMS-MBON). Frontiers in Marine Science, 7, 1031. https://doi.org/10.3389/FMARS.2020.572680/BIBTEX
Pagès, H., Aboyoun, P., Gentleman, R., & DebRoy, S. (2020). Biostrings: Efficient manipulation of biological strings. https://bioconductor.org/packages/Biostrings
Quast, C., Pruesse, E., Yilmaz, P., Gerken, J., Schweer, T., Yarza, P., Peplies, J., & Glöckner, F. O. (2013). The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Research, 41(D1), D590–D596. https://doi.org/10.1093/NAR/GKS1219
R Core Team. (2020, 2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.r-project.org/
Ranwez, V., Douzery, E. J. P., Cambon, C., Chantret, N., & Delsuc, F. (2018). MACSE v2: Toolkit for the Alignment of Coding Sequences Accounting for Frameshifts and Stop Codons. Molecular Biology and Evolution, 35(10), 2582–2584.
https://doi.org/10.1093/MOLBEV/MSY159
Ratnasingham, S., & Hebert, P. D. N. (2007). bold: The Barcode of Life Data System (http://www.barcodinglife.org). Molecular Ecology Notes, 7(3), 355–364. https://doi.org/10.1111/J.1471-8286.2007.01678.X
Rius, M., Ahyong, S., Bieler, R., Boudouresque, C., Costello, M. J., Downey, R., Galil, B. S., Gollasch, S., Hutchings, P., Kamburska, L., Katsanevakis, S., Kupriyanova, E., Lejeusne, C., Marchini, A., Occhipinti, A., Pagad, S., Panov, V. E., Poore, G.
C. B., Robinson, T. B., … Zhan, A. (2023). World Register of Introduced Marine Species (WRiMS). WoRMS Editorial Board. https://www.marinespecies.org/introduced
RStudio Team. (2022). RStudio: Integrated Development Environment for R. RStudio, PBC. http://www.rstudio.com/
Shen, W., Le, S., Li, Y., & Hu, F. (2016). SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLOS ONE, 11(10), e0163962. https://doi.org/10.1371/journal.pone.0163962
Van Rossum, G., & Drake, F. L. (2009). Python 3 Reference Manual. CreateSpace.
Wang, Q., Garrity, G. M., Tiedje, J. M., & Cole, J. R. (2007). Naïve Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73(16), 5261–5267.
https://doi.org/10.1128/AEM.00062-07/SUPPL_FILE/SUMMARY_BYHIERARCHY.ZIP
Wickham, H. (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1–29. https://www.jstatsoft.org/v40/i01/
Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org
Wickham, H. (2022). stringr: Simple, Consistent Wrappers for Common String Operations.
https://cran.r-project.org/package=stringr
Wickham, H., & Bryan, J. (2022). readxl: Read Excel Files. https://cran.r-project.org/package=readxl
Wickham, H., François, R., Henry, L., & Müller, K. (2022, 2023). dplyr: A Grammar of Data Manipulation. R package version 1.1.2. https://cran.r-project.org/package=dplyr
Wickham, H., Hester, J., Chang, W., & Bryan, J. (2021). devtools: Tools to Make Developing R Packages Easier. https://cran.r-project.org/package=devtools
Wickham, H., Vaughan, D., & Girlich, M. (2023). tidyr: Tidy Messy Data. https://cran.r-project.org/package=tidyr