The program to detect true gene dropout (not to be confused with "dropped_genes” from step 22 above) is called “zeroes.py”. To run this program, only one input file is needed but must be manually created. The required input file is called “zeros.csv” and must contain the following column headers in order “Symbol”, “Observed_zeros”, “Total_samples”, and “Probability”. Rows under “Symbol” should contain the gene symbols and “Total samples” should contain the total number of cohort samples being used to calculate how many samples had an observe zero for a gene, which should exclude outliers. For each gene, manually sum the number of cohort samples for which the gene expression value from the output
“step4_fpkm_uq.csv” containing counts was exactly equal to 0, before any 0 values are substituted later for all normalized log2 values (e.g., any file including step15a or later). Therefore it is imperative to use files before step15a when negative log2 values are replaced. The number of samples with “true 0” should be entered in the column labeled “Observed_zeros”. The probability values entered in the rows underneath the column header titled “Probability” must be externally calculated from some reference cohort, such as TCGA data. Here it is okay to use raw TCGA count data to identify true zeros, or again if TCGA is run through the pipeline simply use data in the “step4_fpkm_uq.csv” file for calculating probability. The probability formula is equal to the number of reference samples with true zero divided by the total number of samples considered in the reference cohort and should be a fraction ≤ 1.