Aug 31, 2022

Public workspaceProteoform Identification and Quantitation with TopPIC and TDPortal for Human Tissues

  • 1Pacific Northwest National lab;
  • 2Pacific Northwest National Laboratory
Icon indicating open access to content
QR code linking to this content
Protocol CitationJames M Fulcher, Yen-Chen Liao, Mowei Zhou, Ljiljana.PasaTolic 2022. Proteoform Identification and Quantitation with TopPIC and TDPortal for Human Tissues. protocols.io https://dx.doi.org/10.17504/protocols.io.3byl4bpj2vo5/v1
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: In development
We are still developing and optimizing this protocol
Created: August 02, 2022
Last Modified: August 31, 2022
Protocol Integer ID: 68089
Funders Acknowledgement:
National Institutes of Health (NIH) Common Fund, Human Biomolecular Atlas Program (HuBMAP)
Grant ID: UG3CA256959-01
Abstract
This protocol describe a workflow for top-down proteomics analysis. Top-down proteomics data are processed with two separate software packages TopPIC and TDPortal. Proteoform identifications were merged from the two software with unified FDR to increase coverage. TopPICR was separately used to cluster TopPIC proteoform to extract abundances for label-free quantitation.
TopPIC Processing
TopPIC Processing
Convert Instrument raw data to mzML using MSConvert
Software
MSConvert
NAME

Analyze mzML files using the TopPIC Suite (version 1.4.13.1) .
Software
TopPIC Suite
NAME
Xiaowen Liu
DEVELOPER

TopFD Parameters---------------------------------
Spectral data type: Centroid
Maximum charge: 30
Maximum monoisotopic mass: 50000 Dalton
Peak error tolerance: 0.02 m/z
MS1 signal/noise ratio: 3
MS/MS signal/noise ratio: 1
Thread number: 10
Precursor window size: 2 m/z
Use Env CNN model: No
Miss MS1 spectra: No
Generate Html files: Yes
Do final filtering: Yes
TopPIC 1.4.13 Parameters----------------------------------
********************** Parameters **********************
Protein database file: Download ID_008032_8627C6BD.fasta.zipID_008032_8627C6BD.fasta.zip
Spectrum file: xxxxxxxxxxxxxxxxx_ms2.msalign
Number of combined spectra: 1
Fragmentation method: FILE
Search type: TARGET
Fixed modifications: None
Use TopFD feature file: True
Maximum number of unexpected modifications: 1
Error tolerance for matching masses: 15 ppm
Error tolerance for identifying PrSM clusters: 0.8 Da
Spectrum-level cutoff type: EVALUE
Spectrum-level cutoff value: 0.05
Proteoform-level cutoff type: EVALUE
Proteoform-level cutoff value: 0.05
Allowed N-terminal forms: NONE,NME,NME_ACETYLATION,M_ACETYLATION
Maximum mass shift of modifications: 275 Da
Minimum mass shift of modifications: -150 Da
Thread number: 14
E-value computation: Generating function
Common modification file name: Download Dynamic_mods.txtDynamic_mods.txt
MIScore threshold: 0.15
Executable file directory:
Version: 1.4.13


Note
The protein fasta contains human proteome from UniProt with both SwissProt and TREMBL sequences. Decoy sequences were added as well. Unzip the attachment to use it.

TopPIC outputs proteoform spectrum matches (PrSMs) as tab-separated files (...toppic_prsm.tsv) and quantification data within MS1 feature files (..._ms1.feature). These are both imported into the R environment for post-processing with TopPICR.
TopPICR is used for post-processing to improve proteoform identification and quantification. All functions are documented within the TopPICR R package.
Software
TopPICR
NAME
Evan Martin
DEVELOPER

First, result files are read into R using the read_toppic(file_path = path, file_name = names) function in TopPICR, where the "path" is the path to the directory containing the TopPIC PrSM files and "names" is a character vector specifying the PrSM files to import. This function can also be utililzed to import the MS1 feature files into a separate object.
Next, the data is further processed with the augment_annotation() and rm_false_gene() functions to account for ambiguity in proteoform identifications
False discovery rate (FDR) filtering is accomplished by finding the appropriate E-value cutoff to filter the results to 1% FDR at the isoform and protein level. This is provided by the find_evalue_cutoff() and apply_evalue_cutoff() functions.
Proteoform inference is performed with infer_pf() function and the proteoform level is determined with set_pf_level() function .
CITATION
Smith LM, Thomas PM, Shortreed MR, Schaffer LV, Fellers RT, LeDuc RD, Tucholski T, Ge Y, Agar JN, Anderson LC, Chamot-Rooke J, Gault J, Loo JA, Paša-Tolić L, Robinson CV, Schlüter H, Tsybin YO, Vilaseca M, Vizcaíno JA, Danis PO, Kelleher NL (2019). A five-level classification system for proteoform identifications.. Nature methods.

Retention time alignment is processed with the form_model() and align_rt() functions.
Mass calibration is accomplished with the calc_error() and recalibrate_mass() functions
Clustering and deisotoping error correction is performed with the cluster() and create_pcg() functions.
Metadata for each proteoform cluster is generated with the create_mdata() function.
Steps 4.5 and 4.6 are applied to the MS1 feature files as well before features are matched and combined (for MBR) with the match_features() and combine_features() functions.
The final table of proteoform identification and quantitation results from TopPIC Suite and TopPICR are exported as comma-separated value (.csv) files.
TDPortal Processing
TDPortal Processing
Request TDPortal access and follow their instructions to set up an account.
Software
TDPortal
NAME
Northwestern University
DEVELOPER
TDPortal search process
6.1 Upload data
6.2 Search on TDPortal
Note
TDportal has an option for label-free quantitation, but it is not used in this workflow.

Upload data
  1. Connect to Northwestern through VPN. (https://kb.northwestern.edu/page.php?id=94726)
  2. Copy the files to your user folder. (Eg. \\resfiles.northwestern.edu\NU-PCEDATA\external_users\XXXXX)
  3. The system will ask you to log in. Please use "ads\your id" with your password to log into your folder.
  4. Create a sub-folder under your user folder with each search.
  5. Put raw files to the sub-folder accordingly and do not have more folders under the sub-folder. (https://kb.northwestern.edu/page.php?id=70525).
Search on TDPortal (https://portal.nrtdp.northwestern.edu/static/TDPortalSOP_043_20180301.pdf)
  1. Log in TDPortal with “your email address” and “your password”
  2. Your subfolder's name will show as each dataset.
  3. Select files into the “Input files” under the selected dataset.
  4. Select organism "human".
  5. Set parameters as follow:
User empirical P-score: False
Filter by FDR: True
Create SAS input sheet for quant: Select True when we need to.
Precursor resolution: High resolution
Fragmentation Type: Auto (or the type we used on MS).
Code set: Standard 4.0.0
Include ProSight Error Tolerance Search: False (select “true” when we want to allow one unknown mass shift in the proteoform).
Exporting TDPortal results
Software
TDViewer
NAME
Northwestern University
DEVELOPER
  1. Download *.tdReports file. Note: There can be two separate processes created in the queue. One for ID results in the TDReport. Another is the CSV file for quantitation (if enabled).
  2. Click the download icon to download these files.


3. Open TDReport by TDViewer 2.0(http://tdviewer2.northwestern.edu/)
4. Read and export proteoform ID results from TDViewer with 1% FDR cutoff.
Combining Results
Combining Results
Results (proteoform spectral matches) from TopPIC and TDPortal are then merged using a function written in R that is openly available on GitHub. The input proteoform tables from each software was pre-filtered with FDR cutoff of 1% (adjusted FDR in TopPICR for TopPIC, and the default FDR in TDPortal).
Software
TDPortal_TopPIC_Join
NAME
James M Fulcher
DEVELOPER

Final output
Final output
Results for proteoform spectral matches (merged from TopPIC and TDPortal) and proteoform quantitation (TopPICR) are uploaded to HIVE.
Citations
Step 4.4
Smith LM, Thomas PM, Shortreed MR, Schaffer LV, Fellers RT, LeDuc RD, Tucholski T, Ge Y, Agar JN, Anderson LC, Chamot-Rooke J, Gault J, Loo JA, Paša-Tolić L, Robinson CV, Schlüter H, Tsybin YO, Vilaseca M, Vizcaíno JA, Danis PO, Kelleher NL. A five-level classification system for proteoform identifications.
https://doi.org/10.1038/s41592-019-0573-x