License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: August 26, 2024
Last Modified: November 22, 2024
Protocol Integer ID: 112587
Keywords: docker, bioinformatics, dnalinux, fungi
Abstract
Protocol to annotate a fungi genome
Setup
Setup
Install Docker
If you don't have Docker already, install it. There are two versions, Docker Engine (also known as CE) and Docker Desktop. The Desktop version is more user friendly but since may require commercial license for large enterprise, this tutorial is based on the Docker engine. Both version will work in this protocol. Linux users can install both Docker CE and Desktop, while macOS and Windows users should install Docker Desktop.
You will need fastq data (long reads), short reads, and the assembly data. In the following code, the assembly data file is called assembly.fasta. The long reads file is called ID.fastq. The short reads should be two files (ID_R1.fastq.gz and ID_R2.fastq.gz).
If you have more files for short reads, you can concatenate them so you end up with 2 files. For example, if you have ID_L001_R1.fastq.gz, ID_L002_R1.fastq.gz, ID_L001_R2.fastq.gz, ID_L002_R2.fastq.gz, you can concatenate them with these commands:
All files should be inside a directory, for example: your_dir
Inside your_dir there should be three directories: funannotate_prep, funannotate and funannotate/ipsout.
You can create them with this command:
Download FamDB HDF5 database, Interproscan database and GeneMark license
FamDB HDF5 database
FamDB HDF5 database is needed for the RepeatMasker step. This database is partitioned by taxonomic groups, the partition needed for Fungi is partition number 0, for more information about partitions read this file: README.txt2KB
Bash commands to download, unzip and mv the database to /your_dir:
Interproscan database
This DB is needed for the Interproscan step.
Download the Interproscan DB from here (this file is >5Gb).
Commands to download and untar:
If you don't have a GeneMark license, get it from this page. License key file should be named gm_key and located in /your_dir. This license is need to run the Funannotate Predict step.
Run sspace_longread
Run sspace_longread
Run the following command (replace /your_dir for the base directory where you have your data
Run Gapcloser
Run Gapcloser
Run the following command (replace /your_dir for the base directory where you have your data
Run BWA Index
Run BWA Index
Run the following command (replace /your_dir for the base directory where you have your data
Run fastp
Run fastp
Run the following command (replace /your_dir for the base directory where you have your data)
Run BWA mem
Run BWA mem
Run the following command (replace /your_dir for the base directory where you have your data). Replace CPU for your CPU count.
Run SAMTOOLS
Run SAMTOOLS
SAMTOOLS View, Sort and Index
Run the following command (replace /your_dir for the base directory where you have your data).
Pilon
Pilon
Run the following command (replace /your_dir for the base directory where you have your data).
Funannotate
Funannotate
Funannotate Clean and Sort
Run the following command (replace /your_dir for the base directory where you have your data).
RepeatMasker
RepeatMasker
Run the following command (replace /your_dir for the base directory where you have your data). Remember that is step requires the dfam38_full.0.h5 database installed in a directory that should be called /ftmp in the docker.
Fuannotate Predict
Fuannotate Predict
Run the following command (replace /your_dir for the base directory where you have your data). Replace CPU for your CPU count.
Interproscan
Interproscan
Run the following command (replace /your_dir for the base directory where you have your data). Replace CPU for your CPU count.
Funannoate annotate
Funannoate annotate
Run the following command (replace /your_dir for the base directory where you have your data)