License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Comparing DNA or protein sequences can provide insight into the structure or function of newly discovered or characterized open reading frames or proteins. The amino acid sequence is often conserved between species or between proteins with similar structural/functional properties. A sequence that is conserved means that the sequence is identical across several species. A sequence can have aspects that are similar meaning that while the amino acids are not identical, they have similar properties. Examples of similar sequences are an D --> E change or L --> I change.
The video below provides a brief overview of the process:
Materials
Microsoft Word or a similar text editing software
Internet connection
A target sequence in FASTA format or a PDB structure or a Uniprot ID
Do you have a sequence already?
Do you have a sequence already?
If you are getting the sequence from Uniprot, go to Step 2
If you are extracting the sequence from a PDB file, go to Step 3
If you have a sequence already provided skip to Step 4.
Obtain Sequence from Uniprot
Obtain Sequence from Uniprot
Obtain sequence from Uniprot OR extract sequence from structure in YASARA (skip to step 3)
Click on the link above and locate your protein on Uniprot by using the search bar. In this procedure, we will be using the UBA5 protein found in humans as an example protein.
1m
Your search results will pull up proteins from multiple oranisms. Be sure to chose the protein from the correct organism. In this case, we are interested in the UBA5 in humans. If your protein is found in a plant such as Arabisopsis, be sure it says that in the organism column.
You should be able to locate avaliable sequences of your protein by looking at the options on the left and clicking the 'sequences tab'. Or, scroll down until you find the seqeunces. The available sequences will be shown. Sometimes they will provide different isoforms of the protein. Be sure to research which isoform you are interested in if there are different isoforms. Press the 'FASTA' download button as shown above the seqeunce.
Once you press the FASTA download button, a page like the one shown below should come up. Copy the ENTIRE text starting at the >. This text is to be pasted into the NCBI pBLAST decribed in the next steps.
Obtain Sequence from a PDB file
Obtain Sequence from a PDB file
1m
1m
Extracting sequence from object on YASARA
Open your structure of choice on YASARA. Once the object is loaded, click the 'Analyze' tab at the top of the page, then 'Sequence of', then 'Object'.
When the following window pops up, be sure to select the correct object. The following objects have their PDB ID's listed. The 6H77 selected is the PDB ID for the model of UBA5 being analyzed in this example.
The sequences should come up in the command bar at the bottom. Highlight your sequence. It will appear red once you highlight it. Press 'Cntrl C' (PC) or 'Command C' (Mac) to copy this sequence.
Paste the seqeunce you've copied from either UNIPROT or YASARA into the box under the 'Enter Query Sequence' tab.
Sometimes it is useful to narrow your search by restricting the database or excluding organisms. THIS IS OPTIONAL!
The example below shows how to restrict the data base to model organisms and exclude human sequences, however this is not required for sequence alignment to work.
Select the'Model Organism (landmark)' in the dropdown menu under 'Choose Search Set'. Exclude Hominidae (taxid:9604) from the search. Be sure to check the exclude box!
Select 'pBlast' for the Program Selection.
Press 'Blast'
It may take a few minutes to a few hours before you see results.
Once the results have appeared, you can se the check boxes to the left to select sequences of your protein found in different species. Select options that have a query cover of 50% and above or so. Use proteins with less query coverage or less than 50% sequence identity with caution
Once you have your desired sequences chosen, press 'Download' >> FASTA (Complete Sequence)
These sequences should appear in your downloads. They will open with Notepad for PC users. Copy all the text.
Select 'RTF_new' for 'Output format' and 'ALN' for 'input sequence format'. Paste your ClustalOmega results into the box at the bottom.
Press 'Run BOXSHADE...' . It may take a couple of minutes.
When it is done, you will get an output window. Press on the 'Output number 1' link provided. This will download the BoxShade file. It can be opened with Microsoft Word.
Analyze your results in Word.
Black highlight means perfect conservation
Grey highlight mean an amino acid or a position with some conservation
No highlight means no conservation
A dash means a gap or missing amino acid in that sequence