Alternative analysis method: Search for k-mer tolerating mismatch.
The code below functions as above to search for k-mers of interest, counting both exact matches and matches that tolerate a single nucleotide mismatch. This analysis method could be advantageous to improve read count but disadvantageous by introducing noise to the data. In our hands, both methods of analysis have yielded similar results
#output CSV file with the following columns
echo "File,Count_Exact_String,Count_1_Mismatch_String" > Name_of_output.csv
#directory containing .fastq files to be searched
directory="/example/directory/fastqfiles"
# nucleotide string to search for
string="NucleotideString"
#Loop through each .fastq file in the directory
for file in "$directory"/*.fastq; do
# Extract the filename from the path
filename=$(basename "$file")
# find occurrences of the string with up to 1 mismatch and count both
extracted_strings=$(awk -v pattern="$string" 'BEGIN{RS="@";ORS=""}
# Check if the exact string is found
# Check for up to 1 mismatch
for (j=1; j<=length(pattern); j++) {
if (substr($2, j, 1) != substr(pattern, j, 1)) {
if (++mismatches > 1) break
# Print the counts as CSV
print count_exact "," count_mismatch ","
# output counts to the CSV (same file as above)
echo "$filename,$extracted_strings" >> Name_of_output.csv