Alon S*, Goodwin DR*, Sinha A*, Wassie AT*, Chen F*, Daugharthy ER**, Bando Y, Kajita A, Xue AG, Marrett K, Prior R, Cui Y, Payne AC, Yao CC, Suk HJ, Wang R, Yu CJ, Tillberg P, Reginato P, Pak N, Liu S, Punthambaker S, Iyer EPR, Kohman RE, Miller JA, Lein ES, Lako A, Cullen N, Rodig S, Helvie K, Abravanel DL, Wagle N, Johnson BE, Klughammer J, Slyper M, Waldman J, Jané-Valbuena J, Rozenblatt-Rosen O, Regev A; IMAXT Consortium, Church GM***+, Marblestone AH***, Boyden ES***+ (2021) Expansion Sequencing: Spatially Precise In Situ Transcriptomics in Intact Biological Systems, Science 371(6528):eaax2656. (* equal contribution, ** key contributions to early stages of project, *** equal contribution, +co-corresponding authors)
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
This protocol is shared as an Open Access protocol (under HTAN's definitions). This protocol is Subject to IP Restrictions.
Abstract
Figure 1. Summary of padlock probe structure, barcode design, and barcode readout processes. (A) Top: architecture of padlock probe. Bottom: amplicon sequence after RCA. Note that the amplicon sequence is the reverse complement of the probe sequence. (B) Barcodes are defined in a logical space, consisting of a sequence of the numbers 0-3, i.e. 2012310. These logical barcodes are implemented as specific nucleotide sequences for readout with different sequencing chemistries. The details of the mapping between logical barcodes and Illumina sequencing and SOLiD sequencing barcode nucleotides and fluorophores are shown here. (C) The specific implementation of the barcode structure that is discussed in this protocol. The barcode library consists of length 7 barcodes with Hamming distance 3. A subset of the total barcode library with length 4 and Hamming distance 2 is embedded within this library. (D) An illustration of the process of four rounds of Illumina sequencing by synthesis on an example sequence. (E) An illustration of two rounds of SOLiD sequencing by ligation on on an example sequence. (F) The specific implementation of the logical barcoding scheme in (C) for nucleotide sequences read out through the Illumina chemistry. (G) Like (F), but specifying the nucleotide sequences for the SOLiD chemistry implementation discussed here.
This protocol accompanies Expansion Sequencing (ExSeq), and describes the process of generating padlock probes for targeted ExSeq. The steps detailed here are a generalization of the protocol used to generate padlock probes for figures 4-6, and represent our recommendations for future users of the technology. This protocol has associated code (available at https://github.com/anusinha/TargetedExSeqProbeGen). We have added additional technical explanations here.
As a technical summary of the experimental steps of the method, targeted ExSeq begins with physical expansion of a sample of interest (using the chemistry of expansion microscopy), resulting in a tissue-hydrogel composite with the RNA molecules covalently anchored to the hydrogel. An in situ sequencing library is then prepared from a set of transcripts of interest. The library preparation begins with the hybridization of oligonucleotide padlock probes bearing barcodes to RNA transcripts of interest. Multiple padlock probes target one gene, and each of those padlocks have the same barcode identifying that gene. Each gene has a unique barcode.
The padlock probes are subsequently enzymatically ligated and amplified using rolling circle amplification (RCA), producing amplicons (often called RCA colonies, or "rolonies" colloquially). The amplified barcode sequences on the amplicon are read out through the process of in situ sequencing, in which sequencing reagents are deployed within the gel. Through iterative rounds of chemistry and imaging, the barcode sequence is read out, one base at a time. The imaging data is processed through the ExSeq image processing pipeline (https://github.com/dgoodwin208/ExSeqProcessing/wiki), resulting in a map of reads in 3D space.
This protocol describes the process by which padlock probes are generated. Because the experimental design governs the design of the padlock probes, this protocol is necessarily less prescriptive than the other protocols.
The overall architecture of the padlock probes is shown in Fig. 1A (top). The probes are doubly barcoded, enabling the user to read out the barcode sequence either via SOLiD or Illumina sequencing chemistry. The probes consist of the following sequence regions: (1) a split homology region (H1, H2), which hybridizes to the transcript of interest; (2) a barcode for SOLiD sequence readout that is 5' to the Seq Primer; (3) a barcode for Illumina sequencing that is 3' to the Seq primer; and (4) a universal RCA/Seq primer binding site. After RCA, the amplicon concatamer Fig. 1A (bottom) is produced. Note that the concatamer is the reverse complement of the probe sequence. In situ sequencing is then performed on the concatamer.
The padlock probe generation process consists of five stages: (1) Target selection; (2) Barcode generation; (3) Barcode assignment; (4) Homology region generation; (5) Probe assembly and export.
The first stage of the design is target selection of genes of interest. As this is dependent on the scientific question, generalizable advice is difficult to provide. This protocol assumes that the user has already selected a set of transcripts of interest. Relevant considerations are provided in Step 2.
The second and third stages of the design process are closely linked. In these steps, the barcode space is defined (Steps 3-5), and those sequences are assigned to transcripts (Steps 6-7).
In the fourth stage, the homology regions of the padlock probes that hybridize to the RNA are identified (Steps 8-10). In the fifth stage, the homology regions, barcode sequences, and backbone sequences are assembled to form the final probe (Steps 11-14).
The probes are then ordered and pooled together to form the pooled padlock probe libraries (Steps 15-16), used in the Targeted ExSeq Library Preparation protocol.
To enable multiple sequencing chemistries to be used, barcodes are designed in a logical space, which is agnostic to the sequencing chemistry. At the time of assembly, these logical barcodes are converted to nucleotide sequences, which are read out during in situ sequencing with the SOLiD or Illumina sequencing chemistries. The logical barcodes are a repeated sequence of the numbers {0, 1, 2, 3}. For example, 2012310 is a logical barcode of length 7, in which the first base of the barcode is 2, the second is 0, the third is 1, and so on.
The mapping between logical barcodes and nucleotide sequences is shown in Fig. 1B. The details of this mapping are discussed in further detail in Steps 11-14. Examples of these mappings being used in the in situ sequencing process are shown in Fig. 1D (Illumina sequencing) and Fig. 1E (SOLiD sequencing).
To constrain this design space for this protocol, we focus on a specific implementation of the padlock probe, shown in Fig. 1C.
First, the barcode design: we focus on barcodes of length 7, with Hamming distance 3, which enables 1-bit error correction, and 2-bit error detection. This barcode library enables the interrogation of ~330 targets with 7 rounds of sequencing. We also discuss a subset of this barcode library, which enables the interrogation of up to ~40-50 targets with four rounds of sequencing. This subset is of length 4, with Hamming distance 2, enabling 1-bit error detection. The embedded subset is useful for smaller experiments, with 10-50 targets, where the readout can be performed with four rounds of sequencing. Notably, this subset of barcodes is extensible, i.e. additional genes can be added to the library if 7 rounds of sequencing are performed.
Second: we design doubly barcoded probes. These padlock probes carry two barcode sequences for readout using either the Illumina or SOLiD sequencing chemistries. The implementation details on the mapping between logical barcodes and the specific nucleotide sequences are shown in Fig. 1F (Illumina) and Fig. 1G (SOLiD).
Third: the total probe length is 74 nucleotides, corresponding to the design criterion of a padlock probe being at least 2*n+10 nucleotides long, where n is the total length of the homology region (in this case, n=32 nt).
Fourth: we generate up to 16 probes per transcript, which we find strikes a good balance between yield and cost.
This protocol was used (with an earlier version of the codebase) to design probes to profile human metastatic breast cancer biopsies as a part of the Human Tumor Atlas Pilot Project (HTAPP). For the design of the SOLiD sequencing barcodes, the alternative sequential barcode design was utilized (see Step 12.2).
Worked Example
Step 17 is a worked example of generating padlock probes for AKT1, CDK7, and KRT17, using pre-prepared input files. The output results of the script can be compared to reference results.
Image Attribution
All images (c) Massachusetts Institute of Technology.
Guidelines
This protocol assumes familiarity with MATLAB.
We have encountered issues with the probe generation if the working directory is synced with Dropbox.
Materials
Software requirements/dependencies:
MATLAB with Bioinformatics Toolbox
R with DNABarcodes library
Legacy BLAST (2.2.26) and BLAST+ (2.2.26) from NCBI
This script has been tested primarily with MATLAB R2019b on Windows 10. Future versions of MATLAB may not support blastlocal (within the Bioinformatics Toolbox), but previous releases will retain this functionality. MATLAB does not support blastlocal in MacOS 10.15 (Catalina) and later (see note in Step 1.1).
Experimental:
Basic lab wetware: microcentrifuge tubes, 50 mL tubes, pipette tips
Safety warnings
Please carefully read all safety datasheets for all reagents used in the protocol, and perform all steps in accordance with relevant guidelines.
Preparation
Preparation
1h
1h
Setting up BLAST with MATLAB and Building BLAST Database
The selection of appropriate homology regions for probe generation screens candidate transcript regions against the transcriptome to eliminate sequences with off-target homology. In this section, BLAST is installed locally, the files needed to generate a species-specific BLAST database are downloaded, and the database is generated.
In addition, the scripts for the rest of the protocol are downloaded from GitHub.
45m
Legacy BLAST Installation
Legacy versions of BLAST are used to generate the BLAST database (using BLAST+ 2.2.26) and run the BLAST query with the blastlocal command in MATLAB (using BLAST 2.2.26). The reference page for blastlocal states that only version 2.2.17 is supported; in practice, BLAST 2.2.26 works.
Save the extracted files to an appropriate location, and add this location to the MATLAB path and the system path variables.
To add the location to the MATLAB path: navigate to the Home tab, then Set Path (in the Environment section of the menu). In the window that pops up, click on Add Folder and navigate to the directory where the executable files have been installed. Then click Save to add the location to the MATLAB path.
Modifying the system path varies by operating system. For Windows, this support page may be useful. For OS X, this webpage may be useful.
Warning: blastlocal is NOT compatible with Mac OS X 10.15 or higher. Additionally, blastlocal is slated to be removed in a future MATLAB release. When this occurs, an older version of MATLAB could be used to run the code. Alternatively, the dependence could be replaced with a system call, with appropriate parsing of the output. The underlying system call from blastlocal is "blastall -i "temp.fa" -p blastn -d <refseqdatabase>.fna -S 1 -e E_VALUE" (see Step 9), which is parsed by the blastreadlocal.m built in MATLAB function. If blastreadlocal.m is removed, the function could be copied from an older version of MATLAB's Bioinformatics toolkit to parse the system call output.
20m
Downloading the RefSeq mRNA Library for Species of Interest
Download the appropriate compressed sequence files for the organism of interest from the RefSeq FTP site ( https://ftp.ncbi.nlm.nih.gov/refseq/ ). Navigate to the correct organism (e.g. H_sapiens for human), then to the mRNA_Prot directory.
Download all of the *.rna.fna.gz files, extract them, then concatenate the *.fna files together together.
E.g. in bash/Mac OS Terminal:
Command
cat *.fna > humanrefseq.fna
In Windows Command Prompt:
Command
type *.rna.fna > humanrefseq.txt
ren humanrefseq.txt humanrefseq.fna
Note that the concatenated file must first be output to a file with a different file extension; if the output of the type command were output to a .fna file, Command Prompt would have included that output .fna file as an additional input file. The concatenated file is saved to a txt file, then renamed to the fna extension.
20m
Generating BLAST Database for Species of Interest
We now generate the BLAST database using the BLAST tools.
In the terminal, navigate to the directory where the concatenated sequences file is saved, and run the following command to generate the database (where the input file is named as determined by the previous step).
Command
makeblastdb -in humanrefseq.fna -dbtype nucl
Save the resulting .nhr, .nin, .nsq files in the working directory for probe generation.
This protocol assumes that the user has selected a set of genes of interest with their RefSeq Nucleotide accession numbers. Some relevant considerations for gene selection are described here.
Transcript length: it is easier to generate more probes for longer transcripts. A larger number of probes increases the detection efficiency. However, it is still possible to generate probes to detect shorter transcripts. We generally aim to generate 16 probes per transcript, but have had good results in tissue with 8 probes per transcript. As a rough number, there is a good chance of generating 16 probes for most transcripts that are >1.5 kb in length.
Splice variants: for transcripts with multiple splice variants, the RefSeq annotations can aid in selecting the specific splice variant to use. A common strategy is to look for the transcript that has the most number of shared exons across all transcript variants (roughly the intersection of exons). This is often the shortest variant, but not necessarily always.
For sequences that are not in RefSeq (i.e. custom sequences), they can be saved in GenBank format and added to the generated GenBank file, with a custom string identifier in the locus name and accession number fields (i.e. "Plasmid001"). This can be used in place of the accession number in subsequent steps.
5m
Downloading Transcript Sequences From RefSeq
Transcript sequences can be downloaded from RefSeq in the GenBank format.
Prepare a comma separated list of all accession numbers (e.g. for the genes AKT1, CDK7, and KRT17 in human: NM_001014432.1, NM_001799.4, NM_000422.3) and copy that list into the search field of the RefSeq Nucleotide search.
From search results, click on: Send to -> Complete Record -> File -> GenBank (full) -> Create File.
Download GenBank file and save to working directory.
5m
Barcoding Library Design
Barcoding Library Design
Overview of Barcoding Scheme
In a subsequent step, barcodes will be assigned to the genes of interest. Here, the space of potential barcodes is defined.
Barcodes are designed in a logical space. The logical barcode is a sequence of numbers from 0 to 3, where each number represents a color that will be read out during the in situ sequencing process, i.e. 2012310 is a potential barcode of length 7, where the order of colors detected is: 2, 0, 1, 2, 3, 1, 0. The specifics of which physical color (dyes, filters) depends on the sequencing chemistry.
This level of abstraction is useful because the mapping between a logical barcode and the physical nucleotide sequence can be non-obvious. Designing the barcode in the logical space helps to compartmentalize the implementation details to a subsequent step. It also enables extensibility -- if a new sequencing chemistry is used, the barcoding scheme doesn't need to be changed. Only the additional code for the mapping between logical barcodes and nucleotide sequences for the new chemistry needs to be added.
The barcode library consists of the set of individual barcodes that could potentially be assigned to a transcript. The library is an implementation of an overall barcoding scheme, a specification of properties that the library must satisfy. The typical properties considered are length and minimum Hamming distance.
The length determines the total number of rounds of sequencing needed to read out the barcode. The Hamming distance between two barcodes is the number of positions at which the two barcodes differ. As an example, the barcodes 0303101 and 1201301 are both length 7, and have a Hamming distance of 4, as they differ in position 1, 2, 4, and 5. The minimum Hamming distance for a set of barcodes is the minimum Hamming distance between any two different barcodes in the set, and determines the error-robustness and error-correction properties. For a barcode library with minimum Hamming distance of k, the number of bit errors that can be detected is k-1, and the number of bit errors that can be corrected is ⌊2k−1⌋.
Here, several different options for generating a barcode library from a barcode scheme are presented. Option 1 is to use a pre-designed library, which has several useful properties. Option 2 is to create a custom barcode library from a specification. The primary focus is on generating barcode libraries with length 7 and Hamming distance 3 as in the specification in Fig. 1C.
[Option 1] Using a Pre-Generated Barcode Library
A pre-generated barcode library is saved in the GitHub repository (prioritizedBarcodeLibrary.csv). The library was generated by following a variant of the process described in Option 2A.
The barcodes in the library are of length 7, and the library has a minimum Hamming distance of 3. There are 326 barcodes in the library. In addition, there is a subset of 58 barcodes, that, when considering only the first four positions of the barcode, have a minimum Hamming distance of 2. As described in the abstract, this enables smaller scale experiments to be performed that interrogate <= 58 targets, that can be read out in four rounds of sequencing. An overview of the barcode structure is shown in Fig. 1C.
The barcodes are prioritized as follows:
Priority 1: this set of 47 barcodes satisfies the following properties:
contain at least three unique values within the first four digits of the barcode
are approximately base-balanced (as a set)
these barcodes, along with those in priority 2 contain the subset library of length 4, Hamming distance 2
Priority 2: this set of 11 barcodes is the remaining set of barcodes in the subset library with the embedded length 4, Hamming distance 2 code.
Priority 3: this set of 258 barcodes contains at least three unique values within the entire length 7 barcode.
Priority 4: this set of 10 barcodes is the remaining set of barcodes from the original set. They have low complexity, containing only two unique values within the barcode.
For experiments with <= 47 targets, priority 1 can be used. For experiments with <= 58 targets, priority 1 and 2 can be used. For larger experiments, priorities 1-3 can be used, with the inclusion of priority 4 only if necessary.
The file (prioritizedBarcodeLibrary.csv) contains 327 rows, including a header row. The file contains three columns: barcode sequence (in the logical barcode-space), priority, and diversity (number of distinct values in the barcode sequence). Note that the barcodes are saved with a preceding single-quote (i.e. '0132200) to avoid Excel interpreting the barcode as a number and dropping leading zeros.
Technical Note on Library Generation
To generate this library, the script in option 2A was run many times, generating sets of length 7, minimum Hamming distance 3 barcode libraries. Each of these barcode libraries were split into a priority A subset, containing all barcodes with at least three unique barcode values (from 0 to 3), and a priority B subset containing the remaining barcodes.
The script was also used to generate a length 4, minimum Hamming distance 2 barcode library, and split it similarly into priority 1 and priority 2 subsets, based on the same diversity criterion.
The full-length barcode libraries were queried to find the library that contained the largest number of elements from the length 4, hamming distance 2 barcode library in the priority A subset, and the best library was selected.
The priority 1 barcodes were identified by finding priority 1 length-4 barcodes within the priority A length-7 barcodes. As each individual length-4 barcode could map to multiple full length-7 barcodes in the library, an assignment step was needed to expand length-4 barcodes to unique length-7 barcodes. A base-balanced barcode set was also sought. To generate the final barcode set, sets of length-7 barcodes were generated from the length-4 barcodes by randomly selecting one of the matching length-7 barcodes for each length-4 barcode. These barcode sets were tested for base balance, and the best one was selected, defining a library of 47 priority 1 barcodes.
The remaining length-4 barcodes (that did not map to priority A length-7 in the prior step) were mapped to full length-7 barcodes, forming priority 2.
Priority 3 was defined by the remaining priority A length-7 barcodes, and priority 4 by all remaining barcodes.
[Option 2] Generating a Custom Barcode Library
In Option 2A, the DNABarcodes R package is used to generate a barcode library. A user can also devise their own way to generate the barcode library, noted in Option 2B.
Then, run the following R script (in the GitHub as /utilities/generate_barcode_library.R), which writes the barcode library to the file barcode_set.csv. This example script generates a barcode library with ~330 barcodes of length 7 and minimum Hamming distance 3.
The bcLength parameter is the barcode length and minHammingDist is the minimum Hamming distance. The parameters popSize and numIter are parameters for the evolutionary algorithm used to generate the barcode library; for these values, the software will take ~40-80 minutes to run on a typical laptop. The number of cores can be set appropriately to your system. The population size and number of iterations can be decreased to speed up runtime, but the library size may be slightly smaller.
The output of the create.dnabarcodes function is in nucleotides. This is converted to numerical values by substituting 0 for A, 1 for C, etc. A single quote (') is prepended to every entry to avoid Excel interpreting the csv file as numerical data, which would strip leading zeros.
The function create.dnabarcodes is a stochastic evolutionary algorithm, meaning that the output will be different each time.
[Option 2B] User-Generated Approach
There are other approaches that a user could devise that may suit their particular application better.
Assignment of Barcodes to Transcripts
Assignment of Barcodes to Transcripts
Overview of Barcode Assignment to Transcripts
In the previous steps, genes of interest were selected, and a barcode library was generated. In this step, approaches for assigning barcodes to specific transcripts are suggested.
To minimize optical crowding and for better base-calling performance in the analysis of the in situ sequencing data, it is advantageous to distribute the amplicons across all color channels and rounds roughly evenly.
This can be done in absence of any expression data, but can be improved if single-cell (or bulk) expression data is available.
[Option 1] Random Assignment of Barcodes to Transcripts
If no expression data is available, transcripts can be randomly assigned to barcodes.
[Option 2] Barcode Assignment Using scRNA-Seq
If single-cell RNA sequencing data is available, clustering into cell types can be performed, and mean expression profiles of the genes of interest for each cell type can be computed. This results in a in a cell types by genes of interest matrix.
This matrix can potentially be used to guide the barcode assignment. One simple approach is to generate candidate barcode assignments, and score them for color balance across rounds using the cell type profiles matrix.
For a particular barcode assignment, a 3D cost matrix can be computed, where each entry in the matrix represents the total expression in a particular color channel, in a sequencing round, for a particular cell type. The size of the cost matrix is the number of rounds x number of colors x cell types. One possible metric to summarize the cost matrix is the L2 (Euclidian) norm of the cost matrix, though other summary indices are possible. To generate the barcode assignment, a large number of potential barcode assignments can be randomly generated, and the one with the lowest total cost selected.
This can also be done with bulk RNA-Seq if that data is available.
Compilation of Gene/Barcode List
Compilation of Gene/Barcode List
Compilation of Gene/Barcode List
Tabulate the results from the previous steps into the genelist Excel file.
The columns are gene symbol, RefSeq accession number, minimum Tm threshold (see note describing this below), and barcode.
It is critical that the gene symbol match the symbol in RefSeq (including case). For example, human gene symbols should be in all caps (i.e. GAPDH), while mouse gene symbols should have the first letter capitalized (i.e. Gapdh). Certain genes, such as KIT in humans, are commonly referred to by a modified name such as c-KIT. However, the symbol in the genelist must be KIT, since that is the entry in RefSeq.
An example spreadsheet for the worked example (see step 17, example_genelist.xlsx in GitHub) is shown in Table 1. No header line is included in this spreadsheet.
A
B
C
D
AKT1
NM_001014432.1
55
'3212010
CDK7
NM_001799.4
53
'2102033
KRT17
NM_000422.3
52
'0221001
Table 1. Example genelist spreadsheet (example_genelist.xlsx)
Note again the prepended quote in front of the barcode sequence; this prevents Excel from interpreting '0221001 as the number 221001.
The minimum Tm threshold is the minimum melting temperature allowed for each individual homology arm of the padlock probe. This is used to screen potential homology regions in a subsequent step.
Overview of Padlock Design
Overview of Padlock Design
Padlock Probe Architecture
The overall design of the padlock probe is shown in Fig. 1A.
The generation of the homology regions (corresponding to H1 and H2 in Fig. 1A) is described in steps 9-10.
The generation of the SOLiD and Illumina barcode sequences (in Fig. 1A) is described in steps 11-14.
Under the standard probe design process, the RCA/Seq primer site is the fixed sequence 'TCTCGGGAACGCTGAAGACGGC', and can be optionally modified in Step 13.
Generation of Homology Regions
Generation of Homology Regions
Overall Approach to Generating Homology Regions
In this step, the homology regions used in the padlock probes are generated. These regions are H1 and H2 in Fig. 1A.
This is implemented in the MATLAB script Targeted_ExSeq_ProbeGeneration_Homology.m.
For each transcript, a sliding window search for acceptable homology regions is performed. Candidate regions of the transcript are screened for acceptability. If a region is acceptable, the search window advances past the end of the accepted region; if the region is unacceptable, the window is shifted by one nucleotide and the process is repeated. Acceptable regions are collected and written out to a file in the output directory, with one file (named by the accession number) per gene of interest.
Criterion for exclusion of a homology region are any of the following:
Containing a single nucleotide repeat that is five bases or longer (i.e. 'AAAAA', 'CCCCC', 'GGGGG', 'TTTTT').
Containing three or fewer different nucleotides (i.e. A, C, G bases only).
GC content of either half of the homology region (corresponding to the two homology arms of the padlock probe at the 5' and 3' ends of the probe) outside of the specified range (see below for parameter specification).
Tm of either half of the homology region below the Tm threshold (defined on a per-gene basis in the genelist)
Tm difference between the two halves of the homology region outside of the maximum tolerated difference (see below for parameter specification).
Presence of secondary structure in the homology region: hairpins, defined by a minimum hairpin base length of 7 and loop length of 6; and dimers, defined with a minimum length of 10.
A BLAST hit (with E value below given threshold) to a different transcript that has more matching bases than a given threshold, and that is well-centered with greater than or equal to a given number of bases of the hit on both sides of the ligation junction (see below for parameter specification).
To run the script, specify the parameters that require user input (and optionally those that can be varied), and run the MATLAB script. Output is saved in a directory (specified below), with one file per gene, named by the accession number. The format of each output file is a FASTA file, with the headers containing the probe number, position in transcript, and minimum Tm threshold for the gene, with the sequences being the reverse complement of the segments of the transcript that pass all exclusion criteria. The MATLAB console displays search progress. The estimated time to generate homology regions for each gene is 1-10 minutes on a typical laptop computer, though some genes may require a significantly longer search time.
For further technical details, consult inline comments in script (and in the called functions screen_homology_complexity.m, screen_homology_physical.m, screen_homology_blast.m).
Note: when running the script, the file Temp.fa, which containing the sequence of the current homology region being BLASTed, is repeatedly saved and deleted from the working directory. The script deletes Temp.fa at the start of execution (in case a prior run was terminated early, without deleting Temp.fa). However, if the BLAST search did not stop when the script was terminated, the BLAST program needs to be terminated before manual deletion of Temp.fa. If BLAST errors are encountered, the first debugging step should be to verify that Temp.fa was deleted.
Parameters Requiring User Input
There are four parameters that require user input.
genelist_spreadsheet, the location/filename of the genelist.
genbank_sequences, the location/filename of the GenBank format file downloaded in Step 2.1.
refseq_database, the location/filename of the .fna file generated previously. This filename should include the .fna extension.
homology_regions_output, the name of the directory (to be created by the script) where homology regions will be saved.
Optional Parameters to Set
There are eight parameters that can optionally be changed. They are listed below, along with their default values. The parameters are all stored within the params struct.
Most users will not need to change these parameters, except as described in Step 9.4. These parameters are made available primarily for advanced users.
HOMOLOGY_LENGTH = 32, the total length of the homology region queried, corresponding to a padlock probe with two homology arms of length HOMOLOGY_LENGTH/2.
GC_RANGE = [40, 65], the minimum and maximum acceptable GC content for the individual homology arms.
MAX_TM_SEPARATION = 8, the maximum acceptable difference in the melting temperatures of the individual homology arms.
E_VALUE = 0.05, the maximum acceptable E value for BLAST queries to be returned for potential query sequence exclusion evaluation.
SPACER_LENGTH = 0, the minimum separation between adjacent homology regions.
BLAST_HOMOLOGY_MIN_LENGTH_CUTOFF = 13, the minimum number of matching bases of a BLAST hit for potential query sequence exclusion.
BLAST_HOMOLOGY_LIGATION_JUNCTION_MIN_OVERLAP_CUTOFF = 4, the minimum overlap of the BLAST hit on both sides of the ligation junction for potential query sequence exclusion.
REPORT_IF_BELOW_NUMBER_OF_PROBES = 9, the threshold for reporting additional information if too few homology regions are identified.
Example Output
The first several lines of output are shown for the first gene (AKT1) in the worked example (see Step 17).
NM_001014432.fa
>probe_1_index_264_MinTm_55
tttgacccaggctggctcggccttccctaagc
>probe_2_index_317_MinTm_55
cctctgatgcaccagctgacaggctgcctcct
>probe_3_index_425_MinTm_55
caaccctccttcacaatagccacgtcgctcat
...
Typical Use and Modifying Parameters to Increase Number of Homology Regions
In typical usage, the script will be run once, with the same high Tm threshold for all genes to generate draft candidate regions for all genes. For specific genes of interest where too few homology regions were generated, a new genelist will be constructed containing solely those genes requiring further optimization, and the script will be rerun with modified parameters. This process is iterated until enough candidate regions have been generated for all genes in the original genelist.
Several parameters can be modified to increase the number of binding regions. These include:
Adjusting the minimum Tm for individual genes: a decrease from 55 to 53 or 51 can potentially identify more homology regions.
Increasing the maximum acceptable Tm range (MAX_TM_SEPARATION) can also potentially identify more homology regions. Together with (1), these two parameters are the most commonly modified parameters.
Increasing the range of acceptable GC content (GC_RANGE).
Decreasing the E value threshold (E_VALUE), can allow for more non-specific (lower E value) hits to be tolerated.
On adjusting the E value threshold: the E value threshold is the maximum E value for hits returned by BLAST. Hits with higher E values are ignored for further consideration for elimination of the candidate binding region. Lowering the E value threshold allows for more non-specific (E values above the threshold) hits to be tolerated. For example, consider a candidate sequence with two hits, one self-hit in the original gene of interest with E value=1e-8, and an off-target partial homology match with E value=0.002. If the E value threshold is set to E_VALUE=0.05, the off-target hit will be considered for further exclusion criteria. However, if the E value is lowered to E_VALUE=0.001, the off-target hit will no longer be considered for further exclusion, and the sequence may be accepted as a homology sequence.
[Optional] Validating Homology Regions
This is an optional step to validate the homology regions identified in the previous steps. This validation step consists of re-BLASTing the homology regions primarily to confirm that there are no off-target hits resulting from the process by which self-hits and pseudogene hits are tolerated.
When searching for homology regions in a gene, hits in splice variants, pseudogenes, and non-coding variants of that transcript should be tolerated. This is implemented by searching the gene names of hits to ensure they start with the gene symbol of the gene of interest. For example, a hit in the KRT17 pseudogene KRT17P1 would be tolerated, as the gene symbol "KRT17P1" begins with "KRT17".
However, this may potentially create artifacts. As an example, when designing probes for CD4, there may potentially be a hit in CD44. Since "CD44" begins with "CD4", the hit would be tolerated, and the homology region could be accepted. The output of this script, with manual review, corrects for this potential issue.
To identify these events and exclude them if necessary, the previously found homology regions are BLASTed again under the same search criteria used previously, except without gene symbol/accession number filtering. Hits are collected, and the results are saved to a spreadsheet for manual verification. If problematic sequences are identified, they can be individually removed from the FASTA files containing the homology regions.
The script for this step is Targeted_ExSeq_ProbeGeneration_Validate_Homology_Regions.m.
Parameters Requiring User Input
There are four parameters that require user input.
genelist_spreadsheet, the location/filename of the genelist.
refseq_database, the location/filename of the .fna file generated previously. This filename should include the .fna extension.
homology_regions_input, the name of the directory (created by the previous script) where homology regions are saved.
output_filename, the filename of the spreadsheet where the summary of the homology region validation will be saved.
Optional Parameters to Set
There are four parameters that can optionally be changed. They are the same as the parameters defined previously in Step 9.2. They are listed below, along with their default values. The parameters are all contained within the params struct.
These values should be set to the same values used for the prior search.
The output spreadsheet is saved to the file indicated by the parameter output_filename. This spreadsheet contains one row for every homology region generated by the previous script. The first two columns of the spreadsheet list the gene symbol and the homology region sequence. The third column onwards lists the gene symbols for the BLAST hits that pass filtering. Self-hits are included here, as no gene symbol filtering is performed.
An example excerpt of the output for Human KRT17 from the worked example (see Step 17) is shown in Table 2. There are hits in KRT17 pseudogenes for all of the homology sequences identified, which are tolerated for this application.