Mar 29, 2022

Public workspaceidpr Workflow

Peer-reviewed method
  • William McFadden1,2,
  • Judith Yanowitz1,3
  • 1Magee-Womens Research Institute;
  • 2Emory University*;
  • 3Dept of OBGYN and Reproductive Sciences, University of Pittsburgh School of Medicine
  • William McFadden: *current address;
Icon indicating open access to content
QR code linking to this content
Protocol CitationWilliam McFadden, Judith Yanowitz 2022. idpr Workflow. protocols.io https://dx.doi.org/10.17504/protocols.io.b58gq9tw
Manuscript citation:
McFadden WM, Yanowitz JL (2022) idpr: A package for profiling and analyzing Intrinsically Disordered Proteins in R. PLOS ONE 17(4): e0266929. https://doi.org/10.1371/journal.pone.0266929
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it’s working
Created: March 11, 2022
Last Modified: March 29, 2022
Protocol Integer ID: 59368
Keywords: idpr Workflow, alpha-Synuclein Figures, p53 Figures
Funders Acknowledgement:
Judith Yanowitz
Grant ID: R01GM127569
Abstract
This protocol details about idpr workflow.
Guidelines
References
Paper Citations
CITATION
Erdős, G., & Dosztányi, Z. (2020). Analyzing protein disorder with IUPred2A. Current Protocols in Bioinformatics.

CITATION
Mészáros, B., Erdős, G., & Dosztányi, Z. (2018). IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic acids research.

CITATION
Soudy M, Anwar AM, Ahmed EA, Osama A, Ezzeldin S, Mahgoub S, Magdeldin S (2020). UniprotR: Retrieving and visualizing protein sequence and functional information from Universal Protein Resource (UniProt knowledgebase).. Journal of proteomics.

CITATION
UniProt Consortium (2014). UniProt: a hub for protein information. Nucleic acids research.




R / Package Citations

citation()

##
## To cite R in publications use:
##
## R Core Team (2021). R: A language and environment for statistical
## computing. R Foundation for Statistical Computing, Vienna, Austria.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {R: A Language and Environment for Statistical Computing},
## author = {{R Core Team}},
## organization = {R Foundation for Statistical Computing},
## address = {Vienna, Austria},
## year = {2021},
## }
##
## We have invested a lot of time and effort in creating R, please cite it
## when using it for data analysis. See also ’citation("pkgname")’ for
## citing R packages.

citation("idpr")

##
## To cite package ’idpr’ in publications use:
##
## William M. McFadden and Judith L. Yanowitz (2021). idpr: Profiling
## and Analyzing Intrinsically Disordered Proteins in R. R package
## version 1.2.0.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {idpr: Profiling and Analyzing Intrinsically Disordered Proteins in R},
## author = {William M. McFadden and Judith L. Yanowitz},
## year = {2021},
## note = {R package version 1.2.0},
## }

citation("Biostrings")

##
## To cite package ’Biostrings’ in publications use:
##
## H. Pagès, P. Aboyoun, R. Gentleman and S. DebRoy (2021). Biostrings:
## Efficient manipulation of biological strings. R package version
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {Biostrings: Efficient manipulation of biological strings},
## author = {H. Pagès and P. Aboyoun and R. Gentleman and S. DebRoy},
## year = {2021},
## note = {R package version 2.60.1},
## }
##
## ATTENTION: This citation information has been auto-generated from the
## package DESCRIPTION file and may need manual editing, see
## ’help("citation")’.

citation("ggplot2")
##
## To cite ggplot2 in publications, please use:
##
## H. Wickham. ggplot2: Elegant Graphics for Data Analysis.
## Springer-Verlag New York, 2016.
##
## A BibTeX entry for LaTeX users is
##
## @Book{,
## author = {Hadley Wickham},
## title = {ggplot2: Elegant Graphics for Data Analysis},
## publisher = {Springer-Verlag New York},
## year = {2016},
## isbn = {978-3-319-24277-4},
## }

citation("UniprotR")

##
## To cite UniprotR in publications use:
##
## Soudy, M., Anwar, A.M., Ahmed, E.A., Osama, A., Ezzeldin, S.,
## Mahgoub, S. and Magdeldin, S., 2020. UniprotR: Retrieving and
## visualizing protein sequence and functional information from
## Universal Protein Resource (UniProt knowledgebase). Journal of
## Proteomics, 213, p.103613.
##
## A BibTeX entry for LaTeX users is
##
## @Article{,
## title = {UniprotR: Retrieving and visualizing protein sequence and functional information from Universal ## author = {Mohamed Soudy and Ali Mostafa Anwar and Eman Ali Ahmed and Aya Osama and Shahd Ezzeldin ## journal = {Journal of Proteomics},
## volume = {213},
## pages = {103613},
## year = {2020},
## issn = {1874-3919},
## doi = {10.1016/j.jprot.2019.103613},
## }


Additional Information:
Session Info
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] Biostrings_2.60.1 GenomeInfoDb_1.28.1 XVector_0.32.0
## [4] IRanges_2.26.0 S4Vectors_0.30.0 BiocGenerics_0.38.0
## [7] ggplot2_3.3.5 idpr_1.2.0
##
## loaded via a namespace (and not attached):
## [1] colorspace_2.0-2 seqinr_4.2-8
## [3] ggsignif_0.6.2 ellipsis_0.3.2
## [5] rio_0.5.27 qdapRegex_0.7.2
## [7] GenomicRanges_1.44.0 fs_1.5.0
## [9] farver_2.1.0 ggpubr_0.4.0
## [11] alakazam_1.1.0 fansi_0.5.0
## [13] lubridate_1.7.10 xml2_1.3.2
## [15] knitr_1.33 ade4_1.7-17
## [17] jsonlite_1.7.2 Rsamtools_2.8.0
## [19] broom_0.7.9 dbplyr_2.1.1
## [21] data.tree_1.0.0 readr_2.0.1
## [23] compiler_4.1.0 httr_1.4.2
## [25] backports_1.2.1 assertthat_0.2.1
## [27] Matrix_1.3-4 lazyeval_0.2.2
## [29] htmltools_0.5.1.1 prettyunits_1.1.1
## [31] tools_4.1.0 igraph_1.2.6
## [33] gtable_0.3.0 glue_1.4.2
## [35] GenomeInfoDbData_1.2.6 dplyr_1.0.7
## [37] Rcpp_1.0.7 carData_3.0-4
## [39] Biobase_2.52.0 cellranger_1.1.0
## [41] vctrs_0.3.8 ape_5.5
## [43] nlme_3.1-152 xfun_0.25
## [45] stringr_1.4.0 networkD3_0.4
## [47] openxlsx_4.2.4 rvest_1.0.1
## [49] lifecycle_1.0.0 rstatix_0.7.0
## [51] zlibbioc_1.38.0 MASS_7.3-54
## [53] scales_1.1.1 airr_1.3.0
## [55] hms_1.1.0 MatrixGenerics_1.4.1
## [57] SummarizedExperiment_1.22.0 tidyverse_1.3.1
## [59] gprofiler2_0.2.0 yaml_2.2.1
## [61] curl_4.3.2 gridExtra_2.3
## [63] stringi_1.7.3 highr_0.9
## [65] zip_2.2.0 BiocParallel_1.26.1
## [67] rlang_0.4.11 pkgconfig_2.0.3
## [69] matrixStats_0.60.0 bitops_1.0-7
## [71] evaluate_0.14 lattice_0.20-44
## [73] purrr_0.3.4 UniprotR_2.0.8
## [75] labeling_0.4.2 GenomicAlignments_1.28.0
## [77] htmlwidgets_1.5.3 tidyselect_1.1.1
## [79] plyr_1.8.6 magrittr_2.0.1
## [81] R6_2.5.0 magick_2.7.2
## [83] generics_0.1.0 DelayedArray_0.18.0
## [85] DBI_1.1.1 pillar_1.6.2
## [87] haven_2.4.3 foreign_0.8-81
## [89] withr_2.4.2 abind_1.4-5
## [91] RCurl_1.98-1.4 tibble_3.1.3
## [93] modelr_0.1.8 crayon_1.4.1
## [95] car_3.0-11 utf8_1.2.2
## [97] plotly_4.9.4.1 tzdb_0.1.2
## [99] rmarkdown_2.10 progress_1.2.2
## [101] grid_4.1.0 readxl_1.3.1
## [103] data.table_1.14.0 forcats_0.5.1
## [105] reprex_2.0.1 digest_0.6.27
## [107] tidyr_1.1.3 munsell_0.5.0
## [109] viridisLite_0.4.0

Runtime
#--- End Runtime
end_time <- Sys.time()
time_diff <- end_time - start_time
time_diff

## Time difference of 41.00444 secs

Installing idpr: Downloading the Current Release
Installing idpr: Downloading the Current Release

idpr is published in Bioconductor where the stable, released version of the package can be downloaded. The development version, which may be unstable, is published on GitHub.
The package can be installed from Bioconductor with the following line of code. This requires the BiocManager package to be installed.

if(!'BiocManager' %in% installed.packages()) {
install.packages("BiocManager")
}
if(!'idpr' %in% installed.packages()) {
BiocManager::install("idpr")
}

The UniprotR package is used within this workflow to fetch the amino acid sequences for the proteins analyzed. idpr contains multiple ways to read in sequences, including from .fasta files via Biostrings. To avoid distributing additional files, we are utilizing UniprotR to fetch sequences from the UniProt database.To run this workflow please install UniprotR. UniprotR is not a dependency of idpr, though this workflow exemplifies how the packages can be used together

if(!'UniprotR' %in% installed.packages()) {
install.packages("UniprotR")
}


Installing idpr: Downloading the Development Version
Installing idpr: Downloading the Development Version
The most recent version of the package can be installed with the following line of code. This requires the devtools package to be installed.

if(!'devtools' %in% installed.packages()) {
install.packages("devtools")
}
if(!'idpr' %in% installed.packages()) {
devtools::install_github("wmm27/idpr")
}

Installing idpr: Loading idpr
Installing idpr: Loading idpr
Once installed, idpr can be loaded with the ‘library’ function.

library(idpr)

To test the package is loaded, the idpr function ‘netCharge’ will be used to determine the charge of Glutamic Acid (E) at pH 8. Since pH » pKa, the charge of E should be near -1.

netCharge("E",
pH = 8,
includeTermini = FALSE)
## [1] -0.9997418

alpha-Synuclein Figures: Fetching the amino acid sequence
alpha-Synuclein Figures: Fetching the amino acid sequence
First, I will use the UniprotR package to get the alpha-synuclein amino acid sequence from the UniProtID.
For alpha-Synuclein, the ID is P37840.

## Please wait we are processing your accessions ...
Retrieved Sequence:

print(a_syn_seq)
## [1] "MDVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKTKEGVLYVGSKTKEGVVHGVATVAEKTKEQVTNVGGAVVTGVTAVAQKTVEGAGSIAAATGFVKKDQLGKNEEGAPQEGILEDMPVDPDNEAYEM

alpha-Synuclein Figures: Generating the idprofile for alpha-Synuclein
alpha-Synuclein Figures: Generating the idprofile for alpha-Synuclein
To get the ‘idprofile’, a simple function is needed with the sequence and Uniprot ID specified. This generates
all plots in figure 1.

idprofile(sequence = a_syn_seq,
uniprotAccession = "P37840",
proteinName = "alpha-Synuclein")

alpha-Synuclein Figures: Generating the idprofile for alpha-Synuclein
alpha-Synuclein Figures: Generating the idprofile for alpha-Synuclein

## [[1]]



##
## [[2]]

Calculation of Local Charge in alpha−Synuclein.
Window Size = 9 ; Net Charge = −9.048


##
## [[3]]

Measurement of Scaled Hydropathy in alpha−Synuclein.
Window Size = 9 ; Average Scaled Hydropathy = 0.455


##
## [[4]]


FoldIndex Prediction of Intrinsic Disorder in alpha−Synuclein


##
## [[5]]

Prediction of Intrinsic Disorder in alpha−Synuclein.
By IUPred2A long


##
## [[6]]


alpha-Synuclein Figures: Generating Supplemental Figures for alpha-Synuclein
alpha-Synuclein Figures: Generating Supplemental Figures for alpha-Synuclein
The following code generates plots for supplemental figure 1.
Generating Supplemental Figures for alpha-Synuclein: Charge-Hydropathy plot of protein domains
Generating Supplemental Figures for alpha-Synuclein: Charge-Hydropathy plot of protein domains
To add multiple points to the charge hydropathy plot, first the sequence will be split into the N-term (residues
1-103) and C-term (residues 104-140). To do this, I will use the ‘AAString’ function from Biostrings. For
the split sequences and the full length sequence, the average net charge and the mean scaled hydropathy are
calculated and put into a data frame. These coordinates will be used for adding ggplot2 annotations. Since
idpr depends on both of these packages, they should already be installed.

# --- Load packages
library(ggplot2)
library(Biostrings)
ABCD
RHNameName_Expression
0.0478030.4950491-103aSyn [1-103]
-0.377830.34473104-140aSyn [104-140]
-0.064630.4553211-140aSyn [1-140]
Then, the ggplot is made and annotations are added. See ggplot2 for annotation options.

# --- Make the base plot
a_syn_split_plot <- chargeHydropathyPlot(a_syn_seq_split)

# --- Add arrows to plot using ggplot2 geom_segment() function.
# Arrows start at aSyn [1-140] and point to domains.
a_syn_split_plot <- a_syn_split_plot +
geom_segment(aes(x = 0.4553214, y = -0.06462863, xend = 0.488, yend = 0.03),
arrow = arrow(length = unit(0.2, "cm"),
type = "closed"))+
geom_segment(aes(x = 0.4553214, y = -0.06462863, xend = 0.355, yend = -0.358),
arrow = arrow(length = unit(0.2, "cm"),
type = "closed"))
# --- Add labels to points with ggplot2 functions
a_syn_split_plot <- a_syn_split_plot +
geom_text(data = RH_DF,
aes(x = H,
y = R,
label = Name),
nudge_x = c(0.05, -0.05, 0.05),
nudge_y = c(0.07, 0.070, -0.055)
)
# --- Adds colored points to plot. Adds on top of geom_segment.
a_syn_split_plot <- a_syn_split_plot +
geom_point(data = RH_DF,
aes(x = H,
y = R),
color = c("#348AA7", "#92140C", "chocolate1"))
plot(a_syn_split_plot)


Generating Supplemental Figures for alpha-Synuclein: Sequence Map showing Familial Mutations
Generating Supplemental Figures for alpha-Synuclein: Sequence Map showing Familial Mutations
Several point mutations in the alpha-Synuclein NTD have been identified that are linked to familial parkinsons disease. These are annotated here in the context of intrinsic disorder predictions from IUPred2. Functions from ggplot2 are needed for annotations, and therefore this package must be attached if not already.
First, the data is retrieved from IUPred2. Setting plotResults = FALSE returns a data frame for custom
plotting.
ABC
PositionAAIUPred2
1M0.206376
2D0.239899
3V0.30533
4F0.328603
5M0.281712
6K0.239899
Then, a sequence map is created with the IUPred results. Column 2 (a_syn_iupred_df$AA) is a character
vector of individual, single-letter amino acids. Column 3 (a_syn_iupred_df$IUPred2) is a numeric vector
of IUPred2 scores.

iupred_map <-
sequenceMap(sequence = a_syn_iupred_df$AA,
property = a_syn_iupred_df$IUPred2,
nbResidues = 35,
customColors = c("darkolivegreen3", "grey65", "darkorchid1"))
# --- Plot the unedited, unannotated sequenceMap
plot(iupred_map)


For adding annotations to a sequence map, you can get the position within the plot using the idpr function
‘sequenceMapCoordinates’. This helps guide or identify the coordinates for ggplot2 annotations.
ABCD
PositionAArowcol
1M41
2D42
3V43
4F44
5M45
6K46
Additionally, several annotations are added and the plot themes are edited. See code for all annotations.
Adding the labels for Familial Mutations and ‘*’ to add above the mutated residues. These positions are
determined by ‘sequenceMapCoordinates’ and values are added to row (y) value to move annotations above
letters. To center residues, the column (x) values were adjusted by 0.5 or 0.35.

iupred_map <- iupred_map +
annotate("text",
x = c(15.5, 18.5, 30.5, 11.5, 16.5),
y = c(3.15, 3.15, 4.15, 3.15, 3.3),
label = c("H50Q", "A53T", "A30P", "E46K", "G51D")) +
annotate("text",
x = c(15.35, 18.35, 30.25, 11.35, 16.35),
y = c(2.825, 2.825, 3.825, 2.825, 2.825),
label = rep("*", 5),
size = 7)

Finally, the annotated sequence map is plotted.

plot(iupred_map)

Sequence Map of IUPred2 Predictions for aSyn


p53 Figures: Fetching the amino acid sequence
p53 Figures: Fetching the amino acid sequence
First, I will use the UniprotR package to get the p53 amino acid sequence from the UniProtID. For p53,
the ID is P04637.

## Please wait we are processing your accessions ...

Retrieved Sequence:

print(p53_seq)

## [1] "MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYS

p53 Figures: Fetching IUPred2A
p53 Figures: Fetching IUPred2A
To retrieve the IUPred2 long and ANCHOR2 scores, the p53 uniprot is used.

iupredAnchor(uniprotAccession = "P04637",
proteinName = "p53")

Prediction of Intrinsic Disorder in p53.
By IUPred2A long and ANCHOR2


p53 Figures: Fetching IUPred2 Redox
p53 Figures: Fetching IUPred2 Redox
To retrieve the IUPred2 with redox predictions, the p53 uniprot is used.

iupredRedox(uniprotAccession = "P04637",
proteinName = "p53")


Prediction of Intrinsic Disorder in p53
By IUPred2 long|Based on Environmental Redox State


p53 Figures: Making the sequenceMap of p53’s sequence structural tendency
p53 Figures: Making the sequenceMap of p53’s sequence structural tendency
First, the characteristic to map in the plot must be calculated. Here, the tendency for each residue to favor
ordered/disordered structures is determined

tendency_DF <- structuralTendency(p53_seq)
knitr::kable(head(tendency_DF))
ABC
PositionAATendency
1MOrder Promoting
2EDisorder Promoting
3EDisorder Promoting
4PDisorder Promoting
5QDisorder Promoting
6SDisorder Promoting
Then, the sequenceMap is made. Since p53 is a long sequence, the nbResidues are increased in the sequenceMap
for easier viewing.

tendency_map <-
sequenceMap(sequence = tendency_DF$AA,
property = tendency_DF$Tendency,
nbResidues = 79,
customColors = c("#F0B5B3", "darkolivegreen3", "darkorchid1")
)
plot(tendency_map) #Return the unedited map


To get the coordinates for ggplot annotations, the ‘sequenceMapCoordinates’ function can assist. Since the
default has been changed for nbResidues, from 30 to 79, this must change in the coordinates function to
properly calculate the position of each residue within the sequenceMap.

p53_coords <- sequenceMapCoordinates(p53_seq,
nbResidues = 79)
knitr::kable(head(p53_coords)) #Top of results to show example
ABCD
PositionAArowcol
1M51
2E52
3E53
4P54
5Q55
6S56
Additional annotations are made, see the code and the example from Fig. S1B on working with sequenceMap
and annotations.
After annotations and titles have been added, the plot can be generated.

plot(tendency_map)

p53 Figures: Making the sequenceMap of p53’s sequence structural tendency
p53 Figures: Making the sequenceMap of p53’s sequence structural tendency
Sequence Map of Residue Tendency for p53.


p53 Figures: Generating the idprofile for p53
p53 Figures: Generating the idprofile for p53
To get the ‘idprofile’, a simple function is needed with the sequence and Uniprot ID specified. This generates
all plots in supplementary figure 2.


idprofile(sequence = p53_seq,
uniprotAccession = "P04637",
proteinName = "p53") #Specifying proteinName automatically names plot


p53 Figures: Generating the idprofile for p53
p53 Figures: Generating the idprofile for p53

## [[1]]


Compositional Profile of p53


##
## [[2]]

Calculation of Local Charge in p53
Window Size = 9 ; Net Charge = −5.774


##
## [[3]]

Measurement of Scaled Hydropathy in p53
Window Size = 9 ; Average Scaled Hydropathy = 0.416


##
## [[4]]

FoldIndex Prediction of Intrinsic Disorder in p53


##
## [[5]]

Prediction of Intrinsic Disorder in p53
By IUPred2A long


##
## [[6]]


Citations
Erdős, G., & Dosztányi, Z.. Analyzing protein disorder with IUPred2A
https://doi.org/10.1002/cpbi.99
Mészáros, B., Erdős, G., & Dosztányi, Z.. IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding
https://doi.org/10.1093/nar/gky384
Soudy M, Anwar AM, Ahmed EA, Osama A, Ezzeldin S, Mahgoub S, Magdeldin S. UniprotR: Retrieving and visualizing protein sequence and functional information from Universal Protein Resource (UniProt knowledgebase).
https://doi.org/10.1016/j.jprot.2019.103613
UniProt Consortium. R: A Language and Environment for Statistical Computing
https://doi.org/10.1093/nar/gku989