| Data_stages | Features_of_the_stage | Pros | Cons | File_format | Note |
| Clean_reads | The original data | 1. Intact information, can recover potential heterozygous sites, sequencing error, and coverage | 1. Big files (10 Mb * N)
2. Parsing reads file needs comprehensive computational analysis
3. Need extra step to analyse if there are samples to exclude | fastq | file size denote a relative number
N - number of samples |
| Cluster reads by alignment | Clustered and aligned reads data to the reference sequence (bwa). Will filter reads according to standards. | discarded putative contaminations
remained potential heterozygous sites, sequencing error, and coverage info | 1. Big files (4 Mb * N)
2. Computationally expensive (can easily generate vcf file)
3. Need extra step to analyse if there are samples to exclude | bam | I think this file as well as the reference_gene.fasta would be the best choice. |
| Sequence Assembly | Assembled each cluster in each samples | Small files (0.003 Mb * N)
| 1. Lost heterozygous information. No sequence error info, no coverage info. Still need alignments
2. Need extra step to analyse if there are samples to exclude | fasta | |
| Multi-align assembled sequence | Align all the assembles sequence together | 1. Small files (0.003 *N Mb)
2. Easy to analyse | Lost heterozygous information. No sequence error info, no coverage info. | fasta | I am currently use this for Inga and Geonoma analyses |
| Variant calling file | Variation called | 1. Small files (0.001 *N Mb)
2. Easy to analyse | Lost heterozygous information. No sequence error info, no coverage info. No non-variant sites info | fasta | |