Title: | Profiling and Analyzing Intrinsically Disordered Proteins in R |
---|---|
Description: | ‘idpr’ aims to integrate tools for the computational analysis of intrinsically disordered proteins (IDPs) within R. This package is used to identify known characteristics of IDPs for a sequence of interest with easily reported and dynamic results. Additionally, this package includes tools for IDP-based sequence analysis to be used in conjunction with other R packages. Described in McFadden WM & Yanowitz JL (2022). "idpr: A package for profiling and analyzing Intrinsically Disordered Proteins in R." PloS one, 17(4), e0266929. <https://doi.org/10.1371/journal.pone.0266929>. |
Authors: | William M. McFadden [cre, aut], Judith L. Yanowitz [aut, fnd], Michael Buszczak [ctb, fnd] |
Maintainer: | William M. McFadden <[email protected]> |
License: | LGPL (>= 3) |
Version: | 1.17.0 |
Built: | 2024-12-02 06:30:46 UTC |
Source: | https://github.com/bioc/idpr |
This function will determine the charge of a peptide using the Henderson-Hasselbalch Equation. The output is a data frame (default) or a plot of charge calculations along the peptide sequence. Charges are determined globally, or along the entire chain.
chargeCalculationGlobal( sequence, pKaSet = "IPC_protein", pH = 7, plotResults = FALSE, includeTermini = TRUE, sumTermini = TRUE, proteinName = NA, printCitation = FALSE, ... )
chargeCalculationGlobal( sequence, pKaSet = "IPC_protein", pH = 7, plotResults = FALSE, includeTermini = TRUE, sumTermini = TRUE, proteinName = NA, printCitation = FALSE, ... )
sequence |
amino acid sequence as a character string or vector of individual residues. alternatively, a character string of the path to a .fasta / .fa file |
pKaSet |
A character string or data frame. "IPC_protein" by default. Character string to load specific, preloaded pKa sets. c("EMBOSS", "DTASelect", "Solomons", "Sillero", "Rodwell", "Lehninger", "Toseland", "Thurlkill", "Nozaki", "Dawson", "Bjellqvist", "ProMoST", "Vollhardt", "IPC_protein", "IPC_peptide") Alternatively, the user may supply a custom pKa dataset. The format must be a data frame where: Column 1 must be a character vector of residues named "AA" AND Column 2 must be a numeric vector of pKa values. |
pH |
numeric value, 7.0 by default. The environmental pH used to calculate residue charge. |
plotResults |
logical value, FALSE by default.
This determines what is returned. If |
includeTermini , sumTermini
|
Logical values, both TRUE by default. This
determines how the calculation handles the N- and C- terminus.
includeTermini determines if the calculation will use the charge of the
amine and carboxyl groups at the ends of the peptide (When TRUE). These
charges are ignored when |
proteinName |
character string with length = 1. optional setting to include the name in the plot title. |
printCitation |
Logical value. FALSE by default.
When |
... |
any additional parameters, especially those for plotting. |
If plotResults = FALSE
, a data frame
is returned with the position, residue, and charge (-1 to +1). If
plotResults = TRUE
, a graphical output is returned (ggplot) showing
the charge distribution.
For users who wish to keep a common aesthetic, the following colors are
used when plotResults = TRUE.
Dynamic line colors:
Close to -1 = "#92140C"
Close to +1 = "#348AA7"
Close to 0 (midpoint) = "grey65" or "#A6A6A6"
pKaData
for residue pKa values and
hendersonHasselbalch
for charge calculations.
Other charge functions:
chargeCalculationLocal()
,
hendersonHasselbalch()
,
netCharge()
#Amino acid sequences can be character strings aaString <- "ACDEFGHIKLMNPQRSTVWY" #Amino acid sequences can also be character vectors aaVector <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y") #Alternatively, .fasta files can also be used by providing #a character string of the path to the file. exampleDF <- chargeCalculationGlobal(aaString) head(exampleDF) exampleDF <- chargeCalculationGlobal(aaVector) head(exampleDF) #Changing pKa set or pH used for calculations exampleDF_pH5 <- chargeCalculationGlobal(aaString, pH = 5) head(exampleDF_pH5) exampleDF_pH7 <- chargeCalculationGlobal(aaString, pH = 7) head(exampleDF_pH7) exampleDF_EMBOSS <- chargeCalculationGlobal(aaString, pH = 7, pKa = "EMBOSS") head(exampleDF_EMBOSS) #If the termini charge should not be included with includeTermini = F exampleDF_NoTermini <- chargeCalculationGlobal(aaString, includeTermini = FALSE) head(exampleDF_NoTermini) #and how the termini should be handeled with sumTermini exampleDF_SumTermini <- chargeCalculationGlobal(aaString, sumTermini = TRUE) head(exampleDF_SumTermini) exampleDF_SepTermini <- chargeCalculationGlobal(aaString, sumTermini = FALSE) head(exampleDF_SepTermini) #plotResults = TRUE will output a ggplot as a line plot chargeCalculationGlobal(aaString, plot = TRUE) #since it is a ggplot, you can change or annotate the plot gg <- chargeCalculationGlobal(aaVector, window = 3, plot = TRUE) gg <- gg + ggplot2::ylab("Residue Charge") gg <- gg + ggplot2::geom_text(data = exampleDF, ggplot2::aes(label = AA, y = Charge + 0.1)) plot(gg) #alternatively, you can pass the data frame to sequenceMap() sequenceMap(sequence = exampleDF$AA, property = exampleDF$Charge)
#Amino acid sequences can be character strings aaString <- "ACDEFGHIKLMNPQRSTVWY" #Amino acid sequences can also be character vectors aaVector <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y") #Alternatively, .fasta files can also be used by providing #a character string of the path to the file. exampleDF <- chargeCalculationGlobal(aaString) head(exampleDF) exampleDF <- chargeCalculationGlobal(aaVector) head(exampleDF) #Changing pKa set or pH used for calculations exampleDF_pH5 <- chargeCalculationGlobal(aaString, pH = 5) head(exampleDF_pH5) exampleDF_pH7 <- chargeCalculationGlobal(aaString, pH = 7) head(exampleDF_pH7) exampleDF_EMBOSS <- chargeCalculationGlobal(aaString, pH = 7, pKa = "EMBOSS") head(exampleDF_EMBOSS) #If the termini charge should not be included with includeTermini = F exampleDF_NoTermini <- chargeCalculationGlobal(aaString, includeTermini = FALSE) head(exampleDF_NoTermini) #and how the termini should be handeled with sumTermini exampleDF_SumTermini <- chargeCalculationGlobal(aaString, sumTermini = TRUE) head(exampleDF_SumTermini) exampleDF_SepTermini <- chargeCalculationGlobal(aaString, sumTermini = FALSE) head(exampleDF_SepTermini) #plotResults = TRUE will output a ggplot as a line plot chargeCalculationGlobal(aaString, plot = TRUE) #since it is a ggplot, you can change or annotate the plot gg <- chargeCalculationGlobal(aaVector, window = 3, plot = TRUE) gg <- gg + ggplot2::ylab("Residue Charge") gg <- gg + ggplot2::geom_text(data = exampleDF, ggplot2::aes(label = AA, y = Charge + 0.1)) plot(gg) #alternatively, you can pass the data frame to sequenceMap() sequenceMap(sequence = exampleDF$AA, property = exampleDF$Charge)
This calculates the charge, as determined by the Henderson-Hasselbalch equation, for each window along the sequence. This function uses a sliding window. The output is either a graph or a data frame of calculated charges.
chargeCalculationLocal( sequence, window = 9, proteinName = NA, pH = 7, pKaSet = "IPC_protein", printCitation = FALSE, plotResults = FALSE, ... )
chargeCalculationLocal( sequence, window = 9, proteinName = NA, pH = 7, pKaSet = "IPC_protein", printCitation = FALSE, plotResults = FALSE, ... )
sequence |
amino acid sequence as a single character string or vector of single characters. It also supports a single character string that specifies the location of a .fasta or .fa file. |
window |
a positive, odd integer. 9 by default. Sets the size of sliding window, must be an odd number. The window determines the number of residues to be analyzed and averaged for each position along the sequence. |
proteinName |
character string, optional. Used to add protein name
to the title in ggplot. Ignored if |
pH |
numeric value, 7.0 by default. The environmental pH used to calculate residue charge. |
pKaSet |
A character string or data frame. "IPC_protein" by default. Character string to load specific, preloaded pKa sets. c("EMBOSS", "DTASelect", "Solomons", "Sillero", "Rodwell", "Lehninger", "Toseland", "Thurlkill", "Nozaki", "Dawson", "Bjellqvist", "ProMoST", "Vollhardt", "IPC_protein", "IPC_peptide") Alternatively, the user may supply a custom pKa dataset. The format must be a data frame where: Column 1 must be a character vector of residues named "AA" AND Column 2 must be a numeric vector of pKa values. |
printCitation |
Logical value. FALSE by default.
When |
plotResults |
logical value. TRUE by default.
If |
... |
any additional parameters, especially those for plotting. |
see plotResults argument
For users who wish to keep a common aesthetic, the following colors are
used when plotResults = TRUE.
Dynamic line colors:
Close to -1 = "#92140C"
Close to +1 = "#348AA7"
Close to 0 (midpoint) = "grey65" or "#A6A6A6"
pKaData
for residue pKa values and citations. See
hendersonHasselbalch
for charge calculations.
Other charge functions:
chargeCalculationGlobal()
,
hendersonHasselbalch()
,
netCharge()
#Amino acid sequences can be character strings aaString <- "ACDEFGHIKLMNPQRSTVWY" #Amino acid sequences can also be character vectors aaVector <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y") #Alternatively, .fasta files can also be used by providing # a character string of the path to the file. exampleDF <- chargeCalculationLocal(aaString) exampleDF <- chargeCalculationLocal(aaVector) head(exampleDF) #Changing window will alter the number of residues analyzed exampleDF_window3 <- chargeCalculationLocal(aaString, window = 3) head(exampleDF_window3) exampleDF_window15 <- chargeCalculationLocal(aaString, window = 15) head(exampleDF_window15) #Changing pKa set or pH used for calculations exampleDF_pH5 <- chargeCalculationLocal(aaString, pH = 5) head(exampleDF_pH5) exampleDF_pH7 <- chargeCalculationLocal(aaString, pH = 7) head(exampleDF_pH7) exampleDF_EMBOSS <- chargeCalculationLocal(aaString, pH = 7, pKa = "EMBOSS") head(exampleDF_EMBOSS) #plotResults = TRUE will output a ggplot chargeCalculationLocal(aaString, plot = TRUE) #since it is a ggplot, you can change or annotate the plot gg <- chargeCalculationLocal(aaVector, window = 3, plot = TRUE) gg <- gg + ggplot2::ylab("Local Charge") gg <- gg + ggplot2::geom_text(data = exampleDF_window3, ggplot2::aes(label = CenterResidue, y = windowCharge + 0.1)) plot(gg)
#Amino acid sequences can be character strings aaString <- "ACDEFGHIKLMNPQRSTVWY" #Amino acid sequences can also be character vectors aaVector <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y") #Alternatively, .fasta files can also be used by providing # a character string of the path to the file. exampleDF <- chargeCalculationLocal(aaString) exampleDF <- chargeCalculationLocal(aaVector) head(exampleDF) #Changing window will alter the number of residues analyzed exampleDF_window3 <- chargeCalculationLocal(aaString, window = 3) head(exampleDF_window3) exampleDF_window15 <- chargeCalculationLocal(aaString, window = 15) head(exampleDF_window15) #Changing pKa set or pH used for calculations exampleDF_pH5 <- chargeCalculationLocal(aaString, pH = 5) head(exampleDF_pH5) exampleDF_pH7 <- chargeCalculationLocal(aaString, pH = 7) head(exampleDF_pH7) exampleDF_EMBOSS <- chargeCalculationLocal(aaString, pH = 7, pKa = "EMBOSS") head(exampleDF_EMBOSS) #plotResults = TRUE will output a ggplot chargeCalculationLocal(aaString, plot = TRUE) #since it is a ggplot, you can change or annotate the plot gg <- chargeCalculationLocal(aaVector, window = 3, plot = TRUE) gg <- gg + ggplot2::ylab("Local Charge") gg <- gg + ggplot2::geom_text(data = exampleDF_window3, ggplot2::aes(label = CenterResidue, y = windowCharge + 0.1)) plot(gg)
This function calculates the average net charge <R> and the average
scaled hydropathy <H> and visualizes the data. There are known boundaries
on the C-H plot that separate extended and collapsed proteins.
This was originally described in Uversky et al. (2000)
https://doi.org/10.1002/1097-0134(20001115)41:3<415::AID-PROT130>3.0.CO;2-7
.
The plot returned is based on the charge-hydropathy plot from
Uversky (2016) https://doi.org/10.1080/21690707.2015.1135015.
See Uversky (2019) https://doi.org/10.3389/fphy.2019.00010 for
additional information and a recent review on the topic.
This plot has also been referred to as a "Uversky Plot".
chargeHydropathyPlot( sequence, displayInsolubility = TRUE, insolubleValue = 0.7, proteinName = NA, customPlotTitle = NA, pH = 7, pKaSet = "IPC_protein", plotResults = TRUE, ... )
chargeHydropathyPlot( sequence, displayInsolubility = TRUE, insolubleValue = 0.7, proteinName = NA, customPlotTitle = NA, pH = 7, pKaSet = "IPC_protein", plotResults = TRUE, ... )
sequence |
amino acid sequence (or pathway to a fasta file) as a character string. Supports multiple sequences / files, as a character vector of strings. Additionally, this supports a single protein as character vectors. Multiple proteins are not supported as a character vector of single characters. |
displayInsolubility |
logical value, TRUE by default. This adds (or removes when FALSE) the vertical line separating collapsed proteins and insoluble proteins |
insolubleValue |
numerical value. 0.7 by default.
Ignored when |
proteinName , customPlotTitle
|
optional character string. NA by default. Used to either add the name of the protein to the plot title when there is only one protein, or to create a custom plot title for the output. |
pH |
numeric value, 7.0 by default. The environmental pH is used to calculate residue charge. |
pKaSet |
pKa set used for charge calculations. See
|
plotResults |
logical value, TRUE by default.
This determines what is returned. If |
... |
additional arguments to be passed to
idpr::netCharge(),
idpr::meanScaledHydropathy() or
|
Graphical values of Charge-Hydropathy Plot
For users who wish to keep a common aesthetic, the following colors are
used when plotResults = TRUE.
Point(s) = "chocolate1" or "#ff7f24"
Lines = "black"
Kozlowski, L. P. (2016). IPC – Isoelectric Point Calculator. Biology
Direct, 11(1), 55. https://doi.org/10.1186/s13062-016-0159-9
Kyte, J., & Doolittle, R. F. (1982). A simple method for
displaying the hydropathic character of a protein.
Journal of molecular biology, 157(1), 105-132.
Uversky, V. N. (2019). Intrinsically Disordered Proteins and Their
“Mysterious” (Meta)Physics. Frontiers in Physics, 7(10).
https://doi.org/10.3389/fphy.2019.00010
Uversky, V. N. (2016). Paradoxes and wonders of intrinsic disorder:
Complexity of simplicity. Intrinsically Disordered Proteins, 4(1),
e1135015. https://doi.org/10.1080/21690707.2015.1135015
Uversky, V. N., Gillespie, J. R., & Fink, A. L. (2000).
Why are “natively unfolded” proteins unstructured under physiologic
conditions?. Proteins: structure, function, and bioinformatics, 41(3),
415-427.
https://doi.org/10.1002/1097-0134(20001115)41:3<415::AID-PROT130>3.0.CO;2-7
netCharge
and
meanScaledHydropathy
for functions used to calculate values.
#Amino acid sequences can be character strings aaString <- "ACDEFGHIKLMNPQRSTVWY" #Amino acid sequences can also be character vectors aaVector <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y") #Alternatively, .fasta files can also be used by providing ##The path to the file as a character string chargeHydropathyPlot(sequence = aaString) chargeHydropathyPlot( sequence = aaVector) #This function also supports multiple sequences #only as character strings or .fasta files multipleSeq <- c("ACDEFGHIKLMNPQRSTVWY", "ACDEFGHIK", "LMNPQRSTVW") chargeHydropathyPlot(sequence = multipleSeq) #since it is a ggplot, we can add additional annotations or themes chargeHydropathyPlot( sequence = multipleSeq) + ggplot2::theme_void() chargeHydropathyPlot( sequence = multipleSeq) + ggplot2::geom_hline(yintercept = 0, color = "red") #choosing the pKa set used for calculations chargeHydropathyPlot( sequence = multipleSeq, pKaSet = "EMBOSS")
#Amino acid sequences can be character strings aaString <- "ACDEFGHIKLMNPQRSTVWY" #Amino acid sequences can also be character vectors aaVector <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y") #Alternatively, .fasta files can also be used by providing ##The path to the file as a character string chargeHydropathyPlot(sequence = aaString) chargeHydropathyPlot( sequence = aaVector) #This function also supports multiple sequences #only as character strings or .fasta files multipleSeq <- c("ACDEFGHIKLMNPQRSTVWY", "ACDEFGHIK", "LMNPQRSTVW") chargeHydropathyPlot(sequence = multipleSeq) #since it is a ggplot, we can add additional annotations or themes chargeHydropathyPlot( sequence = multipleSeq) + ggplot2::theme_void() chargeHydropathyPlot( sequence = multipleSeq) + ggplot2::geom_hline(yintercept = 0, color = "red") #choosing the pKa set used for calculations chargeHydropathyPlot( sequence = multipleSeq, pKaSet = "EMBOSS")
The Disorder40, Disorder60, and Disorder85 Matrices were
developed and described in
Brown et al. (2009).
In short: There are substitution scoring matrices used to align proteins
or regions which experience intrinsic disorder. The matrices were
calculated using pairwise sequence alignments of protein families
which here identified from 287 experimentally confirmed Intrinsically
Disordered Proteins (IDPs). The IDPs contained at least 30 sequential
residues of intrinsic disorder and protein families were found using
BLAST.
There was not a comprehensive comparison to other frequently
used substitution matrices (like BLOSUM and PAM) in terms of improving IDP
sequence alignments. The authors note that the purpose of these
matrices were made to compare evolutionary characteristics of disordered
and ordered proteins. Please see the source material for additional
information.
Trivedi and
Nagarajaram (2019) compared EDSSMat62 against all three Disordered
Matrices. Disorder40 and Disorder85 attain lower E-values for
highly disordered proteins, on average, when compared to EDSSMat62.
EDSSMat62 attained lower E-values when compared to Disorder60 for aligning
highly disordered proteins.
EDSSMat62 preforms better than all three Disorder matrices for IDPs
enriched in ordered regions. Please
see the referenced paper, specifically Supplementary Figures S18-20, for
additional information and original comparison.
Additionally, please cite the source article when using Disorder40,
Disorder60, or Disorder85.
Disorder40 Disorder60 Disorder85
Disorder40 Disorder60 Disorder85
All matrices are symmetric. 24 residues are represented:
Each of the standard 20 standard amino acids
Four ambiguous residues:
B: Asparagine or Aspartic Acid (Asx)
Z: Glutamine or Glutamic Acid (Glx)
X: Unspecified or unknown amino acid
*: Stop
An object of class matrix
(inherits from array
) with 24 rows and 24 columns.
An object of class matrix
(inherits from array
) with 24 rows and 24 columns.
An object of class matrix
(inherits from array
) with 24 rows and 24 columns.
As mentioned in the Description, the intended use of these matrices was not to improve sequence alignments. Therefore, no gap penalty values are provided.
It should also be noted that a more recent work, Trivedi and Nagarajaram (2019), determined optimal parameters based on the disordered content of query sequences, as reported in the paper's Supplementary Table S5.
Matrix Name | Gap Open (LD) | Gap Extension (LD) | Gap Open (MD) | Gap Extension (MD) | Gap Open (HD) | Gap Extension (HD) |
Disorder40 | -20 | -1 | -7 | -1 | -7 | -1 |
Disorder60 | -20 | -1 | -16 | -1 | -11 | -2 |
Disorder85 | -20 | -1 | -16 | -1 | -7 | -2 |
Please see the referenced paper for additional information and original
reporting. Additionally, please see EDSSMat
.
Trivedi, R., Nagarajaram, H.A. Amino acid substitution scoring matrices specific to intrinsically disordered regions in proteins. Sci Rep 9, 16380 (2019). https://doi.org/10.1038/s41598-019-52532-8
Brown, C. J., Johnson, A. K., & Daughdrill, G. W. (2009). Comparing Models of Evolution for Ordered and Disordered Proteins. Molecular Biology and Evolution, 27(3), 609-621. doi:10.1093/molbev/msp277
Disordered Matrices Vignette within the idpr package and EDSSMat62
Other IDP-based Substitution Matrices:
DUNMat
,
EDSSMat
This matrix was developed and described in
Radivojac et al. (2002).
The name "DUNMat" is taken from
Trivedi and
Nagarajaram (2019). This is to keep naming consistent and distinct from
other matrices named "Disorder".
In short: This is a substitution scoring matrix used to align proteins or
regions which experience intrinsic disorder. The scores for this matrix
are derived from proteins that have long regions of disorder (LDR),
defined in this paper as an intrinsically disordered region (IDR) of at
least 40 sequential residues. 55 protein families with LDRs were used to
generate the data. Direct comparisons were not made against BLOSUM or PAM
matrices within the source paper due to differences in scaling, however,
when ranking its performance, it preformed the best in aligning proteins
with less than 50% sequence identity. Please see the source material,
specifically, table 2, for additional information.
Trivedi and
Nagarajaram (2019) compared EDSSMat62 and DUNMat and show that DUNMat, on
average, attained smaller E-values in the dataset of IDPs enriched in
ordered regions, while EDSSMat62 attained smaller E-values in sets of
highly disordered IDPs. Please
see the referenced paper, specifically Supplementary Figure S21, for
additional information and original comparison.
Additionally, please cite the source article when using the "DUNMat"
Matrix.
DUNMat
DUNMat
A symmetrical matrix. 20x20 representing the 20 standard amino acids
These values were described in the source article and reported in Table 2. After the optimal parameters were determined, the authors further refined the gap costs. Therefore, it is recommended to use these parameters for any alignment utilizing this matrix. These were:
DUNMat | Gap Open | Gap Extension |
Original Optimization | -3 | -0.5 |
Further Refinement | -3.2 | -0.1 |
It should also be noted that a more recent work, Trivedi and Nagarajaram (2019), determined optimal parameters based on the disordered content of query sequences, as reported in the paper's Supplementary Table S5.
Matrix Name | Gap Open (LD) | Gap Extension (LD) | Gap Open (MD) | Gap Extension (MD) | Gap Open (HD) | Gap Extension (HD) |
DUNMat | -6 | -1 | -6 | -1 | -16 | -2 |
Please see the referenced paper for additional information and original
reporting. Additionally, please see EDSSMat
.
Trivedi, R., Nagarajaram, H.A. Amino acid substitution scoring matrices specific to intrinsically disordered regions in proteins. Sci Rep 9, 16380 (2019). https://doi.org/10.1038/s41598-019-52532-8
Radivojac, P., Obradovic, Z., Brown, C. J., & Dunker, A. K. (2001). Improving sequence alignments for intrinsically disordered proteins. In Biocomputing 2002 (pp. 589-600): World Scientific. https://doi.org/10.1142/9789812799623_0055
EDSSMat62 and the Disordered Matrices Vignette within idpr
Other IDP-based Substitution Matrices:
DisorderMat
,
EDSSMat
The EDSSMat series of matrices were developed and described in
Trivedi and
Nagarajaram (2019).
In short: These are substitution scoring matrices
used to align proteins or regions which experience intrinsic disorder.
Alignment blocks, used to compute the matrix values, were composed of
predicted intrinsically disordered regions. When compared to other, more
frequently used substitution matrices (like BLOSUM and PAM), EDSSMat
had significantly smaller E-values when aligning regions of disorder.
Additionally, EDSSMat62 was shown to identify both close and distant
homologs of a specific IDP while other matrices could only identify some
close homologs. See the source article for additional information
and for comparisons to other matrices.
Additionally, please cite the source article when using any
EDSSMat matrix.
EDSSMat50 EDSSMat60 EDSSMat62 EDSSMat70 EDSSMat75 EDSSMat80 EDSSMat90
EDSSMat50 EDSSMat60 EDSSMat62 EDSSMat70 EDSSMat75 EDSSMat80 EDSSMat90
All matrices are symmetric. 24 residues are represented:
Each of the standard 20 standard amino acids
Four ambiguous residues:
B: Asparagine or Aspartic Acid (Asx)
Z: Glutamine or Glutamic Acid (Glx)
X: Unspecified or unknown amino acid
*: Stop
An object of class matrix
(inherits from array
) with 24 rows and 24 columns.
An object of class matrix
(inherits from array
) with 24 rows and 24 columns.
An object of class matrix
(inherits from array
) with 24 rows and 24 columns.
An object of class matrix
(inherits from array
) with 24 rows and 24 columns.
An object of class matrix
(inherits from array
) with 24 rows and 24 columns.
An object of class matrix
(inherits from array
) with 24 rows and 24 columns.
An object of class matrix
(inherits from array
) with 24 rows and 24 columns.
There are 7 reported EDSSMat matrices. Each vary depending on the percent
identity threshold used to cluster protein sequences.
EDSSMat50 clustered proteins with 50% identity or higher,
EDSSMat62 clustered proteins with 62% identity or higher, etc.
See Usage Section for available matrices
These values were described in the source article and reported in
Supplemental Table S5. Therefore, it is recommended to use these parameters
for any alignment utilizing the respective EDSS matrix. These were
determined for 3 categories: Proteins containing Less Disorder (LD),
defined as [0-20%] disorder, Moderate Disorder (MD), defined as (20-40%]
disorder, and High Disorder (HD), defined as (40-100%] disorder.
Please see the source article for additional information.
Matrix Name | Gap Open (LD) | Gap Extension (LD) | Gap Open (MD) | Gap Extension (MD) | Gap Open (HD) | Gap Extension (HD) |
EDSSMat60 | -7 | -1 | -6 | -2 | -14 | -3 |
EDSSMat62 | -8 | -1 | -5 | -2 | -19 | -2 |
EDSSMat70 | -7 | -1 | -5 | -2 | -19 | -2 |
EDSSMat75 | -8 | -1 | -5 | -2 | -19 | -2 |
EDSSMat80 | -7 | -1 | -5 | -2 | -15 | -3 |
EDSSMat90 | -7 | -1 | -5 | -2 | -19 | -2 |
Trivedi, R., Nagarajaram, H.A. Amino acid substitution scoring matrices specific to intrinsically disordered regions in proteins. Sci Rep 9, 16380 (2019). https://doi.org/10.1038/s41598-019-52532-8
Disordered Matrices Vignette within the idpr package
Other IDP-based Substitution Matrices:
DUNMat
,
DisorderMat
This is used to calculate the prediction of intrinsic disorder based on the scaled hydropathy and absolute net charge of an amino acid sequence using a sliding window. FoldIndex described this relationship and implemented it graphically in 2005 by Prilusky, Felder, et al, and this tool has been implemented into multiple disorder prediction programs. When windows have a negative score (<0) sequences are predicted as disordered. When windows have a positive score (>0) sequences are predicted as disordered. Graphically, this cutoff is displayed by the dashed line at y = 0. Calculations are at pH 7.0 based on the described method and the default is a sliding window of size 51.
foldIndexR( sequence, window = 51, proteinName = NA, pKaSet = "IPC_protein", plotResults = TRUE, ... )
foldIndexR( sequence, window = 51, proteinName = NA, pKaSet = "IPC_protein", plotResults = TRUE, ... )
sequence |
amino acid sequence as a single character string, a vector of single characters, or an AAString object. It also supports a single character string that specifies the path to a .fasta or .fa file. |
window |
a positive, odd integer. 51 by default. Sets the size of sliding window, must be an odd number. The window determines the number of residues to be analyzed and averaged for each position along the sequence. |
proteinName |
character string with length = 1. optional setting to replace the name of the plot if plotResults = TRUE. |
pKaSet |
A character string or data frame. "IPC_protein" by default. Character string to load specific, preloaded pKa sets. c("EMBOSS", "DTASelect", "Solomons", "Sillero", "Rodwell", "Lehninger", "Toseland", "Thurlkill", "Nozaki", "Dawson", "Bjellqvist", "ProMoST", "Vollhardt", "IPC_protein", "IPC_peptide") Alternatively, the user may supply a custom pKa dataset. The format must be a data frame where: Column 1 must be a character vector of residues named "AA" AND Column 2 must be a numeric vector of pKa values. |
plotResults |
logical value, TRUE by default.
If |
... |
any additional parameters, especially those for plotting. |
The output is either a data frame or graph
showing the calculated scores for each window along the sequence.
The equation used was originally described in Uversky et al. (2000)
https://doi.org/10.1002/1097-0134(20001115)41:3<415::AID-PROT130>3.0.CO;2-7
.
The FoldIndex method of using a sliding window and utilizing the Uversky
equation is described in Prilusky, J., Felder, C. E., et al. (2005).
FoldIndex: a simple tool to predict whether a given protein sequence
is intrinsically unfolded. Bioinformatics, 21(16), 3435-3438.
see plotResults argument
For users who wish to keep a common aesthetic, the following colors are
used when plotResults = TRUE.
Dynamic line colors:
Close to -1 = "#9672E6"
Close to 1 = "#D1A63F"
Close to midpoint = "grey65" or "#A6A6A6"
Kyte, J., & Doolittle, R. F. (1982). A simple method for
displaying the hydropathic character of a protein.
Journal of molecular biology, 157(1), 105-132.
Kozlowski, L. P. (2016). IPC – Isoelectric Point Calculator. Biology
Direct, 11(1), 55. https://doi.org/10.1186/s13062-016-0159-9
Kyte, J., & Doolittle, R. F. (1982). A simple method for
displaying the hydropathic character of a protein.
Journal of molecular biology, 157(1), 105-132.
Prilusky, J., Felder, C. E., et al. (2005).
FoldIndex: a simple tool to predict whether a given protein sequence
is intrinsically unfolded. Bioinformatics, 21(16), 3435-3438.
Uversky, V. N., Gillespie, J. R., & Fink, A. L. (2000).
Why are “natively unfolded” proteins unstructured under physiologic
conditions?. Proteins: structure, function, and bioinformatics, 41(3),
415-427.
https://doi.org/10.1002/1097-0134(20001115)41:3<415::AID-PROT130>3.0.CO;2-7
KDNorm
for residue hydropathy values.
See pKaData
for residue pKa values and citations. See
hendersonHasselbalch
for charge calculations.
Other scaled hydropathy functions:
KDNorm
,
meanScaledHydropathy()
,
scaledHydropathyGlobal()
,
scaledHydropathyLocal()
This function calculates the ionic charge of a residue at a specific pH
when given the pKa.
Known, charged residues are accepted as well as
the protein termini and general property to allow customized calculations.
The output is a ratio comparing acid to conjugate base for acidic residues
or a ratio comparing conjugate base to acid for basic residues.
hendersonHasselbalch(pKa, pH = 7, residue)
hendersonHasselbalch(pKa, pH = 7, residue)
pKa |
numeric value. The point where A- = HA. |
pH |
numeric value. The pH of the environment. 7.0 by default |
residue |
individual character or character string. accepted values are the exact aa c("C", "D", "E", "H", "K", "R", "Y"), termini c("COOH","COO","NH2","NH3"), or a general property c("acid", "base", "negative", "positive"). |
a numeric value giving the ratio of charged to uncharged residues.
pKaData
for residue pKa values and citations. See
other charge functions for use.
Other charge functions:
chargeCalculationGlobal()
,
chargeCalculationLocal()
,
netCharge()
#Calculating Lysine charge using the EMBOSS pKa data EMBOSS_pKa <- pKaData[, 1:2] EMBOSS_pKa Lys_pKa <- EMBOSS_pKa[EMBOSS_pKa$AA == "K", ] Lys_pKa$EMBOSS #This is Lysines pKa hendersonHasselbalch( pKa = as.numeric(Lys_pKa$EMBOSS), pH = 7.0, residue = "K") #residue = supports general properties as well hendersonHasselbalch( pKa = as.numeric(Lys_pKa$EMBOSS), pH = 7.0, residue = "base") hendersonHasselbalch( pKa = as.numeric(Lys_pKa$EMBOSS), pH = 7.0, residue = "positive") #CALCULATIONS ARE DEPENDENT ON RESIDUE PROPERTY! hendersonHasselbalch( pKa = as.numeric(Lys_pKa$EMBOSS), pH = 7.0, residue = "acid") #Inaccurate Description #You can also calculate charge at different pHs hendersonHasselbalch( pKa = as.numeric(Lys_pKa$EMBOSS), pH = 5.5, residue = "K") hendersonHasselbalch( pKa = as.numeric(Lys_pKa$EMBOSS), pH = 8, residue = "K")
#Calculating Lysine charge using the EMBOSS pKa data EMBOSS_pKa <- pKaData[, 1:2] EMBOSS_pKa Lys_pKa <- EMBOSS_pKa[EMBOSS_pKa$AA == "K", ] Lys_pKa$EMBOSS #This is Lysines pKa hendersonHasselbalch( pKa = as.numeric(Lys_pKa$EMBOSS), pH = 7.0, residue = "K") #residue = supports general properties as well hendersonHasselbalch( pKa = as.numeric(Lys_pKa$EMBOSS), pH = 7.0, residue = "base") hendersonHasselbalch( pKa = as.numeric(Lys_pKa$EMBOSS), pH = 7.0, residue = "positive") #CALCULATIONS ARE DEPENDENT ON RESIDUE PROPERTY! hendersonHasselbalch( pKa = as.numeric(Lys_pKa$EMBOSS), pH = 7.0, residue = "acid") #Inaccurate Description #You can also calculate charge at different pHs hendersonHasselbalch( pKa = as.numeric(Lys_pKa$EMBOSS), pH = 5.5, residue = "K") hendersonHasselbalch( pKa = as.numeric(Lys_pKa$EMBOSS), pH = 8, residue = "K")
idpr aims to integrate tools for the computational analysis of
intrinsically disordered proteins (IDPs) within R. This package is used to
identify known characteristics of IDPs for a sequence of interest with
easily reported and dynamic results. Additionally, this package includes
tools for IDP-based sequence analysis to be used in conjunction with other R
packages.
Please see the idpr vignettes for details on idpr functions and theory.
browseVignettes("idpr")
McFadden, W. M., & Yanowitz, J. L. (2022). idpr: A package for profiling and analyzing Intrinsically Disordered Proteins in R. PloS one, 17(4), e0266929
The IDPRofile is a summation of many features of the idpr package,
conveniently grouped into one function for quick analysis. This combines
many plotting functions in this package. These include:chargeHydropathyPlot
chargeCalculationLocal
scaledHydropathyLocal
structuralTendencyPlot
foldIndexR
All of the above linked functions only require the sequence argument
to output plots of characteristics associated with IDPs. The function also
includes options for IUPred functions. The function does one of the
following based on user-specified parameters:iupred
iupredAnchor
iupredRedox
The IUPred function used depends on the argument of iupredType. All
require the UniProt Accession to make a proper connection to the IUPred2A
REST API. If the UniProt Accession is not specified, the IUPred plot is
skipped.
idprofile( sequence, uniprotAccession = NA, proteinName = NA, iupredType = "long", window = 9, foldIndexWindowSize = 51, pH = 7, foldIndex_pH = 7, pKaSet = "IPC_protein", structuralTendencyType = "bar", structuralTendencySummarize = FALSE, disorderPromoting = c("P", "E", "S", "Q", "K", "A", "G"), disorderNeutral = c("D", "T", "R"), orderPromoting = c("M", "N", "V", "H", "L", "F", "Y", "I", "W", "C") )
idprofile( sequence, uniprotAccession = NA, proteinName = NA, iupredType = "long", window = 9, foldIndexWindowSize = 51, pH = 7, foldIndex_pH = 7, pKaSet = "IPC_protein", structuralTendencyType = "bar", structuralTendencySummarize = FALSE, disorderPromoting = c("P", "E", "S", "Q", "K", "A", "G"), disorderNeutral = c("D", "T", "R"), orderPromoting = c("M", "N", "V", "H", "L", "F", "Y", "I", "W", "C") )
sequence |
amino acid sequence as a single character string or vector of single characters. It also supports a single character string that specifies the location of a .fasta or .fa file. |
uniprotAccession |
character string specifying the UniProt Accession of the protein of interest. Used to fetch predictions from IUPreds REST API. Default is NA. Keep as NA if you do not have a UniProt Accession. |
proteinName |
character string, optional. Used to add protein name to the title in ggplot. |
iupredType |
character string specifying the type of IUPred2 prediction
to retrieve. Can be c("long", "short", "glob", "anchor", "redox"). "long"
by default. "long", "short", and "glob" use the |
window |
a positive, odd integer. 9 by default. Sets the size of sliding window, must be an odd number. The window determines the number of residues to be analyzed and averaged for each position along the sequence. For chargeCalculationLocal and scaledHydropathyLocal. |
foldIndexWindowSize |
a positive, odd integer. 51 by default. Sets the size of sliding window, must be an odd number. The window determines the number of residues to be scored and averaged for each position along the sequence. |
pH |
numeric value, 7.0 by default. The environmental pH used to calculate residue charge. |
foldIndex_pH |
numeric value, 7.0 by default. The environmental pH used to calculate residue charge. FoldIndex specifically uses pH = 7.0 in settings and thus, is distinct from changing pH in the other calculations. |
pKaSet |
A character string or data frame. "IPC_protein" by default. Character string to load specific, preloaded pKa sets. c("EMBOSS", "DTASelect", "Solomons", "Sillero", "Rodwell", "Lehninger", "Toseland", "Thurlkill", "Nozaki", "Dawson", "Bjellqvist", "ProMoST", "Vollhardt", "IPC_protein", "IPC_peptide") Alternatively, the user may supply a custom pKa dataset. The format must be a data frame where: Column 1 must be a character vector of residues named "AA" AND Column 2 must be a numeric vector of pKa values. |
structuralTendencyType |
a character string specifying the type of plot
the |
structuralTendencySummarize |
a logical value specifying the
|
disorderPromoting , disorderNeutral , orderPromoting
|
character vectors of individual residues to be matched with the input sequence. Defaults:
It is not recommended to change these. Arguments passed to
|
Described in McFadden, W. M., & Yanowitz, J. L. (2022). idpr: A package for profiling and analyzing Intrinsically Disordered Proteins in R. PloS one, 17(4), e0266929
4 or 5 plots, depending if a UniProt Accession is provided.
Kozlowski, L. P. (2016). IPC – Isoelectric Point Calculator. Biology Direct, 11(1), 55. https://doi.org/10.1186/s13062-016-0159-9
Kyte, J., & Doolittle, R. F. (1982). A simple method for displaying the hydropathic character of a protein. Journal of molecular biology, 157(1), 105-132.
Uversky, V. N. (2016). Paradoxes and wonders of intrinsic disorder: Complexity of simplicity. Intrinsically Disordered Proteins, 4(1), e1135015. https://doi.org/10.1080/21690707.2015.1135015
Uversky, V. N., Gillespie, J. R., & Fink, A. L. (2000). Why are “natively unfolded” proteins unstructured under physiologic conditions?. Proteins: structure, function, and bioinformatics, 41(3), 415-427. https://doi.org/10.1002/1097-0134(20001115)41:3<415::AID-PROT130>3.0.CO;2-7
If a pKa set is specified, see pKaData
Kozlowski, L. P. (2016). IPC – Isoelectric Point Calculator. Biology Direct, 11(1), 55. https://doi.org/10.1186/s13062-016-0159-9
If a pKa set is specified, see pKaData
Kyte, J., & Doolittle, R. F. (1982). A simple method for displaying the hydropathic character of a protein. Journal of molecular biology, 157(1), 105-132.
Uversky, V. N. (2013). A decade and a half of protein intrinsic disorder: Biology still waits for physics. Protein Science, 22(6), 693-724. doi:10.1002/pro.2261
Prilusky, J., Felder, C. E., et al. (2005). FoldIndex: a simple tool to predict whether a given protein sequence is intrinsically unfolded. Bioinformatics, 21(16), 3435-3438.
Uversky, V. N., Gillespie, J. R., & Fink, A. L. (2000). Why are “natively unfolded” proteins unstructured under physiologic conditions?. Proteins: structure, function, and bioinformatics, 41(3), 415-427. https://doi.org/10.1002/1097-0134(20001115)41:3<415::AID-PROT130>3.0.CO;2-7
Also see citations for hydrapthy and charge plots above
iupred
,
iupredAnchor
,
iupredRedox
Bálint Mészáros, Gábor Erdős, Zsuzsanna Dosztányi, IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding, Nucleic Acids Research, Volume 46, Issue W1, 2 July 2018, Pages W329–W337, https://doi.org/10.1093/nar/gky384
Erdős, G., & Dosztányi, Z. (2020). Analyzing protein disorder with IUPred2A. Current Protocols in Bioinformatics, 70, e99. https://doi.org/10.1002/cpbi.99
chargeHydropathyPlot
chargeCalculationLocal
scaledHydropathyLocal
structuralTendencyPlot
foldIndexR
iupred
iupredAnchor
iupredRedox
#For most functions, a protein sequence is all that is needed. #The UniProt ID is optional but recommended for IUPred results. proteinID <- "P04637" p53Seq <- idpr:::TP53Sequences[2] ## Not run: idprofile( sequence = p53Seq, uniprotAccession = proteinID) #changing the iupred to redox ## and getting a pie chart for structuralTendency. idprofile( sequence = p53Seq, uniprotAccession = proteinID, pKaSet = EMBOSS, iupredType = "redox", structuralTendencyType = "pie") ## End(Not run)
#For most functions, a protein sequence is all that is needed. #The UniProt ID is optional but recommended for IUPred results. proteinID <- "P04637" p53Seq <- idpr:::TP53Sequences[2] ## Not run: idprofile( sequence = p53Seq, uniprotAccession = proteinID) #changing the iupred to redox ## and getting a pie chart for structuralTendency. idprofile( sequence = p53Seq, uniprotAccession = proteinID, pKaSet = EMBOSS, iupredType = "redox", structuralTendencyType = "pie") ## End(Not run)
This function makes a connection to the IUPred2A REST API based on the type
of analysis and UniProt accession number. This requires the user to know
the accession number of their protein and a connection to the internet.
The results are then formatted to match output in the idpr package.
Predictions are made on a scale of 0-1, where any residues with a score
over 0.5 are predicted to be disordered, and any residue scoring below 0.5
are predicted to be ordered (when using "long" and "short" predictions).
The output is either a graph (ggplot) or data frame of predictions.
iupred() is used for standard predictions of intrinsic disorder
of an amino acid sequence. This is the core of predictions.
Predictions vary by iupredType (details below)
The results are either a ggplot or data frame of the fetched IUPred2.
predictions.
iupredAnchor() is used to combine the output of IUPred2 long with
ANCHOR2 predictions. ANCHOR2 is a context-dependent predictor of binding
regions for protein-protein interactions. The results are either a ggplot
with 2 lines, one for IUPred2 long and another for ANCHOR predictions, or
a data frame with both IUPred2 long and ANCHOR Predictions. Values are
fetched by the IUPred2A REST API.
iupredRedox() is used to predict redox-sensitive regions that may
experience induced folding upon changing environments.
This is a context-dependent predictor of disordered regions depending on
a reducing (plus) or oxidizing (minus) environment. The results can be
a ggplot with two IUPred2 long predictions, one for plus and another for
minus environments, with redox sensitive regions shaded (if predicted).
Alternatively, the results can be a data frame with both IUPred2 long plus
and minus predictions as well as a column of logical values where a residue
that is TRUE is predicted to be in a redox sensitive region. Values are
fetched by the IUPred2A REST API.
IUPred2 website is located at https://iupred2a.elte.hu/.
For detailed information on using IUPred2A, please refer to
Erdős & Dosztány (2020)
Analyzing protein disorder with IUPred2A.
Current Protocols in Bioinformatics, 70, e99.
Additionally, please see
Mészáros et al (2019)
for further information, theory, and applications of IUPred2A.
Please cite these articles if you use any iupred function.
iupred( uniprotAccession, iupredType = "long", plotResults = TRUE, proteinName = NA ) iupredAnchor(uniprotAccession, plotResults = TRUE, proteinName = NA) iupredRedox(uniprotAccession, plotResults = TRUE, proteinName = NA)
iupred( uniprotAccession, iupredType = "long", plotResults = TRUE, proteinName = NA ) iupredAnchor(uniprotAccession, plotResults = TRUE, proteinName = NA) iupredRedox(uniprotAccession, plotResults = TRUE, proteinName = NA)
uniprotAccession |
character string specifying the UniProt Accession of the protein of interest. Used to fetch predictions from IUPreds REST API |
iupredType |
character string. "long" by default. accepted types are c("long", "short", "glob"). See "Prediction Type" information below. |
plotResults |
logical value. TRUE by default.
If |
proteinName |
character string, optional. Used to add protein name
to the title in ggplot. Ignored if |
see plotResults argument.
Information from https://iupred2a.elte.hu/help_new on 5.22.20 Additionally, see the sources for further details and source information. This is only relevant for iupred(). iupredAnchor() and iupredRedox() always utilize "long" for data in the REST API.
Long predictions of disorder (Default)
when iupredType = "long"
Optimized for global predictions of disorder, specifically disordered regions over 30 amino acids in length.
"long" is always used for iupredAnchor() and iupredRedox().
Short predictions of disorder
when iupredType = "short"
Best for predicting small regions of disorder, especially in mostly structured proteins.
Has adjustments for termini, since sequence ends are often disordered.
Structured predictions
when iupredType = "glob"
Used to predict regions of globular folding.
please see Erdős & Dosztány (2020) for further information on interpreting these results.
For users who wish to keep a common aesthetic, the following colors are
used when plotResults = TRUE.
iupred() iupredType = 'long', 'short', or 'glob'. Additionally, the 'long' prediction with iupredAnchor().
Dynamic iupred line colors:
Close to 0 = "darkolivegreen3" or "#A2CD5A"
Close to 1 = "darkorchid1" or "#BF3EFF"
Close to 0.5 (midpoint) = "grey65" or "#A6A6A6"
iupredAnchor :
Solid Line (ANCHOR2 Score) = "#92140C"
iupredRedox:
iupredPlus line = "darkorchid1" or "#BF3EFF"
iupredMin line = "#348AA7"
redox sensitive regions = "#5DD39E"
Bálint Mészáros, Gábor Erdős, Zsuzsanna Dosztányi,
IUPred2A: context-dependent prediction of protein disorder as a function of
redox state and protein binding, Nucleic Acids Research, Volume 46, Issue
W1, 2 July 2018, Pages W329–W337, https://doi.org/10.1093/nar/gky384
Erdős, G., & Dosztányi, Z. (2020). Analyzing protein disorder with
IUPred2A. Current Protocols in Bioinformatics, 70, e99.
https://doi.org/10.1002/cpbi.99
#A UniProt Accession must be specified. ##this example uses human P53. TP53_UniProt <- "P04637" ## Not run: #Getting data as a data frame exampleDF_long <- iupred(uniprotAccession = TP53_UniProt, iupredType = "long", plotResults = FALSE) head(exampleDF_long) exampleDF_short <- iupred(uniprotAccession = TP53_UniProt, iupredType = "short", plotResults = FALSE) head(exampleDF_short) exampleDF_anchor <- iupredAnchor(uniprotAccession = TP53_UniProt, plotResults = FALSE) head(exampleDF_anchor) exampleDF_redox <- iupredRedox(uniprotAccession = TP53_UniProt, plotResults = FALSE) head(exampleDF_redox) #Plotting iupred(uniprotAccession = TP53_UniProt, iupredType = "long", plotResults = TRUE) iupred(uniprotAccession = TP53_UniProt, iupredType = "short", plotResults = TRUE) iupredAnchor(uniprotAccession = TP53_UniProt, plotResults = TRUE) iupredRedox(uniprotAccession = TP53_UniProt, plotResults = TRUE) ## End(Not run) #A valid internet connection is needed to make ##A connection with the IUPred REST API
#A UniProt Accession must be specified. ##this example uses human P53. TP53_UniProt <- "P04637" ## Not run: #Getting data as a data frame exampleDF_long <- iupred(uniprotAccession = TP53_UniProt, iupredType = "long", plotResults = FALSE) head(exampleDF_long) exampleDF_short <- iupred(uniprotAccession = TP53_UniProt, iupredType = "short", plotResults = FALSE) head(exampleDF_short) exampleDF_anchor <- iupredAnchor(uniprotAccession = TP53_UniProt, plotResults = FALSE) head(exampleDF_anchor) exampleDF_redox <- iupredRedox(uniprotAccession = TP53_UniProt, plotResults = FALSE) head(exampleDF_redox) #Plotting iupred(uniprotAccession = TP53_UniProt, iupredType = "long", plotResults = TRUE) iupred(uniprotAccession = TP53_UniProt, iupredType = "short", plotResults = TRUE) iupredAnchor(uniprotAccession = TP53_UniProt, plotResults = TRUE) iupredRedox(uniprotAccession = TP53_UniProt, plotResults = TRUE) ## End(Not run) #A valid internet connection is needed to make ##A connection with the IUPred REST API
A dataset containing a measure of hydropathy for each amino acid residue
as reported by Kyte J. and Doolittle R.F. (1982). Values normalized from
0 to 1.
If you use these values, please cite the source article.
KDNorm
KDNorm
a data frame with 20 rows, one for each standard amino acid, and 2 variables
Amino acid residues as a single letter
Scaled Hydropathy of a residue, measured 0-1
Kyte, Jack, and Russell F. Doolittle. "A simple method for displaying the hydropathic character of a protein." Journal of molecular biology 157.1 (1982): 105-132. https://doi.org/10.1016/0022-2836(82)90515-0
Other scaled hydropathy functions:
foldIndexR()
,
meanScaledHydropathy()
,
scaledHydropathyGlobal()
,
scaledHydropathyLocal()
This function utilizes the scaledHydropathyGlobal() function and easily returns the averaged hydropathy as a numeric value.
meanScaledHydropathy(sequence, roundScore = NA)
meanScaledHydropathy(sequence, roundScore = NA)
sequence |
amino acid sequence as a single character string, a vector of single characters, or an AAString object. It also supports a single character string that specifies the path to a .fasta or .fa file. |
roundScore |
Number of decimals the score will be rounded to. NA by default. |
A numeric value equal to the Mean Scaled Hydropathy.
Kyte, J., & Doolittle, R. F. (1982). A simple method for displaying the hydropathic character of a protein. Journal of molecular biology, 157(1), 105-132.
KDNorm
for residue values.
Other scaled hydropathy functions:
KDNorm
,
foldIndexR()
,
scaledHydropathyGlobal()
,
scaledHydropathyLocal()
#Amino acid sequences can be character strings aaString <- "ACDEFGHIKLMNPQRSTVWY" #Amino acid sequences can also be character vectors aaVector <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y") #Alternatively, .fasta files can also be used by providing #Calculate the mean scaled hydropathy meanScaledHydropathy(aaString) meanScaledHydropathy(aaVector)
#Amino acid sequences can be character strings aaString <- "ACDEFGHIKLMNPQRSTVWY" #Amino acid sequences can also be character vectors aaVector <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y") #Alternatively, .fasta files can also be used by providing #Calculate the mean scaled hydropathy meanScaledHydropathy(aaString) meanScaledHydropathy(aaVector)
This function will determine the net charge of a peptide using the Henderson-Hasselbalch Equation. The output is a numeric value describing the total net charge or the average net charge.
netCharge( sequence, pKaSet = "IPC_protein", pH = 7, includeTermini = TRUE, averaged = FALSE )
netCharge( sequence, pKaSet = "IPC_protein", pH = 7, includeTermini = TRUE, averaged = FALSE )
sequence |
amino acid sequence as a character string or vector of individual residues. alternatively, a character string of the path to a .fasta / .fa file |
pKaSet |
A character string or data frame. "IPC_protein" by default. Character string to load specific, preloaded pKa sets. c("EMBOSS", "DTASelect", "Solomons", "Sillero", "Rodwell", "Lehninger", "Toseland", "Thurlkill", "Nozaki", "Dawson", "Bjellqvist", "ProMoST", "Vollhardt", "IPC_protein", "IPC_peptide") Alternatively, the user may supply a custom pKa dataset. The format must be a data frame where: Column 1 must be a character vector of residues named "AA" AND Column 2 must be a numeric vector of pKa values. |
pH |
numeric value, 7.0 by default. The environmental pH used to calculate residue charge. |
includeTermini |
Logical value, TRUE by default. This
determines how the calculation handles the N- and C- terminus.
includeTermini determines if the calculation will use the charge of the
amine and carboxyl groups at the ends of the peptide (When TRUE). These
charges are ignored when |
averaged |
logical value. FALSE by default.
When |
numeric value. Either the net charge or average net charge, depending on the value of the averaged argument
pKaData
for residue pKa values and citations. See
hendersonHasselbalch
for charge calculations.
Other charge functions:
chargeCalculationGlobal()
,
chargeCalculationLocal()
,
hendersonHasselbalch()
#Amino acid sequences can be character strings aaString <- "ACDEFGHIKLMNPQRSTVWY" #Amino acid sequences can also be character vectors aaVector <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y") #Alternatively, .fasta files can also be used by providing a character string # of the path to the file. #Calculate the Net Charge netCharge(aaString, averaged = FALSE) netCharge(aaVector, averaged = FALSE) #Calculate the Average Net Charge netCharge(aaString, averaged = TRUE) netCharge(aaVector, averaged = TRUE) #Change the pH netCharge(aaString, pH = 8) netCharge(aaString, pH = 7) netCharge(aaString, pH = 5.5) #Specify which pKa set to use netCharge(aaString, pKaSet = "IPC_protein") #Default netCharge(aaString, pKaSet = "IPC_peptide") netCharge(aaString, pKaSet = "Dawson") netCharge(aaString, pKaSet = "EMBOSS") #Should the termini be included in charge calculations? netCharge(aaString, includeTermini = TRUE) #Default netCharge(aaString, includeTermini = FALSE)
#Amino acid sequences can be character strings aaString <- "ACDEFGHIKLMNPQRSTVWY" #Amino acid sequences can also be character vectors aaVector <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y") #Alternatively, .fasta files can also be used by providing a character string # of the path to the file. #Calculate the Net Charge netCharge(aaString, averaged = FALSE) netCharge(aaVector, averaged = FALSE) #Calculate the Average Net Charge netCharge(aaString, averaged = TRUE) netCharge(aaVector, averaged = TRUE) #Change the pH netCharge(aaString, pH = 8) netCharge(aaString, pH = 7) netCharge(aaString, pH = 5.5) #Specify which pKa set to use netCharge(aaString, pKaSet = "IPC_protein") #Default netCharge(aaString, pKaSet = "IPC_peptide") netCharge(aaString, pKaSet = "Dawson") netCharge(aaString, pKaSet = "EMBOSS") #Should the termini be included in charge calculations? netCharge(aaString, includeTermini = TRUE) #Default netCharge(aaString, includeTermini = FALSE)
A dataset containing the various pKa accepted values for each charged amino
acid residue. N- and C-terminus values are also included. See "IPC -
Isoelectric Point Calculator" Kozlowski (2016) for information on
variability in pKa Data sets.
https://doi.org/10.1186/s13062-016-0159-9
Citations are also contained in the data frame for convenience.
Please cite the specific pKa set source and/or Kozlowski (2016).
pKaData
pKaData
a data frame with 10 rows and 16 variables.
Amino acid residues as a single letter.
Residues are:
Cys (C), Asp (D), Glu (E), His (H), Lys (K), Arg (R), and Tyr (Y).
N- and C-termini as NH2 and COOH, respectively. "citation"
in the final row
pKa Dataset from https://doi.org/10.1016/S0168-9525(00)02024-2
pKa Dataset from https://doi.org/10.1021/pr015504q
pKa Dataset from:
https://doi.org/10.1186/s13062-016-0159-9
pKa Dataset original source: ISBN: 978-1-118-87576-6
pKa Dataset from https://doi.org/10.1016/0003-2697(89)90136-X
pKa Dataset from https://doi.org/10.1016/0003-2697(82)90611-X
pKa Dataset from ISBN-13: 978-1-4641-2611-6
pKa Dataset from https://doi.org/10.1093/nar/gkj035
pKa Dataset from https://doi.org/10.1110/ps.051840806
pKa Dataset from:
https://doi.org/10.1110/ps.051840806
pKa Dataset original source:
https://doi.org/10.1016/S0076-6879(67)11088-4
pKa Dataset from:
https://doi.org/10.1186/s13062-016-0159-9
pKa Dataset original source: ISBN: 9780198552994
pKa Dataset from:
https://doi.org/10.1186/s13062-016-0159-9
pKa Dataset original source:
https://doi.org/10.1002/elps.1150150171
pKa Dataset from https://doi.org/10.1093/nar/gkh356
pKa Dataset from https://doi.org/10.1186/s13062-016-0159-9
pKa Dataset from https://doi.org/10.1186/s13062-016-0159-9
pKa Dataset from ISBN-13: 978-1-4641-2027-5
Values for NH2 and COOH are averages of values provided within the
Lehringer, ProMoST, and Volhardt datasets.
Lehringer and Volhardt both are the Seventh edition. Lehringer varies
from data presented in the IPC paper.
When values could not be sourced to the original
source, values were taken from Kozlowski (2016),
https://doi.org/10.1186/s13062-016-0159-9. Both Kozlowski (2016) and
the original source DOI (where available) or ISBN are provided within the
Format section of this documentation.
Dawson, Elliott, Elliott, & Jones, 2002; Halligan et al., 2004; Kozlowski,
2016; Nelson & Cox, 2017; Nozaki & Tanford, 1967; Rice, Longden, & Bleasby,
2000; Rodwell, 1982; Sillero & Ribeiro, 1989; Tabb, McDonald, & Yates, 2002;
TG, 1992; Thurlkill, Grimsley, Scholtz, & Pace, 2006; Toseland, McSparron,
Davies, & Flower, 2006; Vollhardt & Schore, 2014)
Dawson, R. M. C., Elliott, D. C., Elliott, W. H., & Jones, K. M. (2002).
Data for biochemical research (Vol. 3): Clarendon Press.
Halligan, B. D., Ruotti, V., Jin, W., Laffoon, S., Twigger, S. N., & Dratz,
E. A. (2004). ProMoST (Protein Modification Screening Tool): a web-based
tool for mapping protein modifications on two-dimensional gels. Nucleic
Acids Research, 32(Web Server issue), W638-W644. doi:10.1093/nar/gkh356
Kozlowski, L. P. (2016). IPC – Isoelectric Point Calculator. Biology Direct,
11(1), 55. doi:10.1186/s13062-016-0159-9
Nelson, D. L., & Cox, M. M. (2017). Lehninger Principles of Biochemistry
(Seventh ed.). New York, NY: W. H. Freeman and Company.
Nozaki, Y., & Tanford, C. (1967). [84] Examination of titration behavior.
In Methods in Enzymology (Vol. 11, pp. 715-734): Academic Press.
Rice, P., Longden, I., & Bleasby, A. (2000). EMBOSS: The European Molecular
Biology Open Software Suite. Trends in Genetics, 16(6), 276-277.
doi:10.1016/S0168-9525(00)02024-2
Rodwell, J. D. (1982). Heterogeneity of component bands in isoelectric
focusing patterns. Analytical Biochemistry, 119(2), 440-449.
doi:https://doi.org/10.1016/0003-2697(82)90611-X
Sillero, A., & Ribeiro, J. M. (1989). Isoelectric points of proteins:
Theoretical determination. Analytical Biochemistry, 179(2), 319-325.
doi:https://doi.org/10.1016/0003-2697(89)90136-X
Tabb, D. L., McDonald, W. H., & Yates, J. R. (2002). DTASelect and Contrast:
Tools for Assembling and Comparing Protein Identifications from Shotgun
Proteomics. Journal of Proteome Research, 1(1), 21-26. doi:10.1021/pr015504q
TG, S. (1992). Organic chemistry. USA: John Wiley & Sons.
Thurlkill, R. L., Grimsley, G. R., Scholtz, J. M., & Pace, C. N. (2006).
pK values of the ionizable groups of proteins. Protein science :
a publication of the Protein Society, 15(5), 1214-1218.
doi:10.1110/ps.051840806
Toseland, C. P., McSparron, H., Davies, M. N., & Flower, D. R. (2006). PPD
v1.0—an integrated, web-accessible database of experimentally determined
protein pKa values. Nucleic Acids Research, 34(suppl_1), D199-D203.
doi:10.1093/nar/gkj035
Vollhardt, P., & Schore, N. (2014). Organic Chemistry: Structure and
Function (Seventh ed.). New York, NY: W. H. Freeman and Company.
Kozlowski, L. P. (2016). IPC – Isoelectric Point Calculator. Biology Direct, 11(1), 55. doi:10.1186/s13062-016-0159-9
This is used to calculate the scaled hydropathy of an amino acid sequence for each residue in the sequence. The output is either a data frame or graph showing the matched scores for each residue along the sequence.
scaledHydropathyGlobal(sequence, plotResults = FALSE, proteinName = NA, ...)
scaledHydropathyGlobal(sequence, plotResults = FALSE, proteinName = NA, ...)
sequence |
amino acid sequence as a single character string, a vector of single characters, or an AAString object. It also supports a single character string that specifies the path to a .fasta or .fa file. |
plotResults |
logical value, FALSE by default.
If |
proteinName |
character string with length = 1. optional setting to include the name in the plot title. |
... |
any additional parameters, especially those for plotting. |
if plotResults = TRUE
, a graphical representation data.
Average is shown by the horizontal line.
If plotResults = FALSE
, a data frame is reported
with each amino acid and each residue value shown.
Score for each residue shown in the column "Hydropathy".
For users who wish to keep a common aesthetic, the following colors are
used when plotResults = TRUE.
Dynamic line colors:
Close to 0 = "skyblue3" or "#6CA6CD"
Close to 1 = "chocolate1" or "#FF7F24"
Close to midpoint = "grey65" or "#A6A6A6"
Kyte, J., & Doolittle, R. F. (1982). A simple method for displaying the hydropathic character of a protein. Journal of molecular biology, 157(1), 105-132.
KDNorm
for residue values.
Other scaled hydropathy functions:
KDNorm
,
foldIndexR()
,
meanScaledHydropathy()
,
scaledHydropathyLocal()
#Amino acid sequences can be character strings aaString <- "ACDEFGHIKLMNPQRSTVWY" #Amino acid sequences can also be character vectors aaVector <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y") #Alternatively, .fasta files can also be used by providing ##The path to the file as a character string exampleDF <- scaledHydropathyGlobal(aaString, plotResults = FALSE) head(exampleDF) exampleDF <- scaledHydropathyGlobal(aaVector, plotResults = FALSE) head(exampleDF) #plotResults = TRUE will output a ggplot scaledHydropathyGlobal(aaString, plotResults = TRUE) #since it is a ggplot, you can change or annotate the plot gg <- scaledHydropathyGlobal(aaVector, plotResults = TRUE) gg <- gg + ggplot2::ylab("Local Hydropathy") gg <- gg + ggplot2::geom_text(data = exampleDF, ggplot2::aes(label = AA, y = Hydropathy + 0.1)) plot(gg)
#Amino acid sequences can be character strings aaString <- "ACDEFGHIKLMNPQRSTVWY" #Amino acid sequences can also be character vectors aaVector <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y") #Alternatively, .fasta files can also be used by providing ##The path to the file as a character string exampleDF <- scaledHydropathyGlobal(aaString, plotResults = FALSE) head(exampleDF) exampleDF <- scaledHydropathyGlobal(aaVector, plotResults = FALSE) head(exampleDF) #plotResults = TRUE will output a ggplot scaledHydropathyGlobal(aaString, plotResults = TRUE) #since it is a ggplot, you can change or annotate the plot gg <- scaledHydropathyGlobal(aaVector, plotResults = TRUE) gg <- gg + ggplot2::ylab("Local Hydropathy") gg <- gg + ggplot2::geom_text(data = exampleDF, ggplot2::aes(label = AA, y = Hydropathy + 0.1)) plot(gg)
This is used to calculate the scaled hydropathy of an amino acid sequence using a sliding window. The output is either a data frame or graph showing the calculated scores for each window along the sequence.
scaledHydropathyLocal( sequence, window = 9, plotResults = TRUE, proteinName = NA, ... )
scaledHydropathyLocal( sequence, window = 9, plotResults = TRUE, proteinName = NA, ... )
sequence |
amino acid sequence as a single character string, a vector of single characters, or an AAString object. It also supports a single character string that specifies the path to a .fasta or .fa file. |
window |
a positive, odd integer. 9 by default. Sets the size of sliding window, must be an odd number. The window determines the number of residues to be analyzed and averaged for each position along the sequence. |
plotResults |
logical value, TRUE by default.
If |
proteinName |
character string with length = 1. optional setting to replace the name of the plot if hydropathy = TRUE. |
... |
any additional parameters, especially those for plotting. |
see plotResults argument
For users who wish to keep a common aesthetic, the following colors are
used when plotResults = TRUE.
Dynamic line colors:
Close to 0 = "skyblue3" or "#6CA6CD"
Close to 1 = "chocolate1" or "#FF7F24"
Close to midpoint = "grey65" or "#A6A6A6"
Kyte, J., & Doolittle, R. F. (1982). A simple method for displaying the hydropathic character of a protein. Journal of molecular biology, 157(1), 105-132.
KDNorm
for residue values.
Other scaled hydropathy functions:
KDNorm
,
foldIndexR()
,
meanScaledHydropathy()
,
scaledHydropathyGlobal()
#Amino acid sequences can be character strings aaString <- "ACDEFGHIKLMNPQRSTVWY" #Amino acid sequences can also be character vectors aaVector <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y") #Alternatively, .fasta files can also be used by providing ##The path to the file as a character string. exampleDF <- scaledHydropathyLocal(aaString, plotResults = FALSE) head(exampleDF) exampleDF <- scaledHydropathyLocal(aaVector, plotResults = FALSE) head(exampleDF) #Changing window will alter the number of residues analyzed exampleDF_window3 <- scaledHydropathyLocal(aaString, window = 3, plotResults = FALSE) head(exampleDF_window3) exampleDF_window15 <- scaledHydropathyLocal(aaString, window = 15, plotResults = FALSE) head(exampleDF_window15) #plotResults = TRUE will output a ggplot scaledHydropathyLocal(aaString, plot = TRUE) #since it is a ggplot, you can change or annotate the plot gg <- scaledHydropathyLocal(aaVector, window = 3, plot = TRUE) gg <- gg + ggplot2::ylab("Local Hydropathy") gg <- gg + ggplot2::geom_text(data = exampleDF_window3, ggplot2::aes(label = CenterResidue, y = WindowHydropathy + 0.1)) plot(gg)
#Amino acid sequences can be character strings aaString <- "ACDEFGHIKLMNPQRSTVWY" #Amino acid sequences can also be character vectors aaVector <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y") #Alternatively, .fasta files can also be used by providing ##The path to the file as a character string. exampleDF <- scaledHydropathyLocal(aaString, plotResults = FALSE) head(exampleDF) exampleDF <- scaledHydropathyLocal(aaVector, plotResults = FALSE) head(exampleDF) #Changing window will alter the number of residues analyzed exampleDF_window3 <- scaledHydropathyLocal(aaString, window = 3, plotResults = FALSE) head(exampleDF_window3) exampleDF_window15 <- scaledHydropathyLocal(aaString, window = 15, plotResults = FALSE) head(exampleDF_window15) #plotResults = TRUE will output a ggplot scaledHydropathyLocal(aaString, plot = TRUE) #since it is a ggplot, you can change or annotate the plot gg <- scaledHydropathyLocal(aaVector, window = 3, plot = TRUE) gg <- gg + ggplot2::ylab("Local Hydropathy") gg <- gg + ggplot2::geom_text(data = exampleDF_window3, ggplot2::aes(label = CenterResidue, y = WindowHydropathy + 0.1)) plot(gg)
This is used validate a sequence of amino acids. It can additionally be used to load an amino acid sequence. It can also be used to coerce a sequence into a specific format.
sequenceCheck( sequence, method = "stop", outputType = "string", nonstandardResidues = NA, suppressAAWarning = FALSE, suppressOutputMessage = FALSE )
sequenceCheck( sequence, method = "stop", outputType = "string", nonstandardResidues = NA, suppressAAWarning = FALSE, suppressOutputMessage = FALSE )
sequence |
amino acid sequence as a single character string, a vector of single characters, or an AAString object. It also supports a single character string that specifies the path to a .fasta or .fa file. |
method |
Required Setting.
|
outputType |
Required Setting. "string" By default.
|
nonstandardResidues |
Optional setting.
Expands the amino acid alphabet.
NA or Character vector required.
Default values are "ACDEFGHIKLMNPQRSTVWY". Additional letters added here.
|
suppressAAWarning |
If using nonstandardResidues,
a warning will be issued.
set |
suppressOutputMessage |
Set |
A message and sequence are returned.
If suppressOutputMessage = T
, the message is not returned.
If outputType = "None")
, the sequence is not returned.
Otherwise, outputType will determine the format of the returned sequence.
If the sequence contains an error, it will be reported
based on the value of method.
The Sequence will be assigned to the value "Sequence" if sequenceName
is not specified. Otherwise the sequence is assigned to the value of
sequenceName. This allows the sequences to be called by the user.
#Amino acid sequences can be character strings aaString <- "ACDEFGHIKLMNPQRSTVWY" #Amino acid sequences can also be character vectors aaVector <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y") #Alternatively, .fasta files can also be used by providing ##The path to the file as a character string ## Not run: sequenceCheck(aaString) sequenceCheck(aaVector) #To allow O and U sequenceCheck(aaString, nonstandardResidues = c("O", "U"), suppressAAWarning = TRUE) #To turn off output message sequenceCheck(aaString, suppressOutputMessage = TRUE) #To change string to be a vector sequenceCheck(aaString, outputType = "vector") #To not return a sequence but check the input sequenceCheck(aaVector, outputType = "none") ## End(Not run)
#Amino acid sequences can be character strings aaString <- "ACDEFGHIKLMNPQRSTVWY" #Amino acid sequences can also be character vectors aaVector <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y") #Alternatively, .fasta files can also be used by providing ##The path to the file as a character string ## Not run: sequenceCheck(aaString) sequenceCheck(aaVector) #To allow O and U sequenceCheck(aaString, nonstandardResidues = c("O", "U"), suppressAAWarning = TRUE) #To turn off output message sequenceCheck(aaString, suppressOutputMessage = TRUE) #To change string to be a vector sequenceCheck(aaString, outputType = "vector") #To not return a sequence but check the input sequenceCheck(aaVector, outputType = "none") ## End(Not run)
This is a graphical function used to visualize data along an
amino acid sequence.
The purpose of this function is to show the entire sequence and color
residues based on properties. This may help identify important residues
along a protein. This was designed with the goal of visualizing discrete
values, but has since been expanded to visualize numeric/continuous values.
sequenceMap( sequence, property, nbResidues = 30, labelType = "both", everyN = c(1, 10), labelLocation = c("on", "below"), rotationAngle = c(0, 0), customColors = NA )
sequenceMap( sequence, property, nbResidues = 30, labelType = "both", everyN = c(1, 10), labelLocation = c("on", "below"), rotationAngle = c(0, 0), customColors = NA )
sequence |
amino acid sequence as a single character string, a vector of single characters, or an AAString object. It also supports a single character string that specifies the path to a .fasta or .fa file. |
property |
a vector with length equal to sequence length. This is what is visualized on the function. Can be discrete or continuous values. |
nbResidues |
numeric value, 30 by default. The number of residues to display on each row of the plot. It is not recommended to be over 50 or under 10 for standard sequences. Optimal value may vary between sequences of extreme lengths. |
labelType |
character string, "both" by default.
accepted values are |
everyN |
numeric value or vector of numeric values with length = 2.
This is used to show every Nth amino acid and/or residue number.
To show every value, set |
labelLocation |
character string or vector of character strings
with length = 2. When |
rotationAngle |
numeric value or vector of numeric values with length = 2. This value is used to rotate text. Especially useful when printing many residue numbers. |
customColors |
vector of colors as character strings. NA by default. Used to support custom plot colors. If property is a discrete scale, a character vector of colors with length = number of unique discrete observations is required. If property is a continuous scale, a character vector of the colors for c("highColor","lowColor","midColor"). Set NA to skip custom colors. |
A ggplot.
sequenceMapCoordinates
for mapping coordinates
#Get a data frame returned from another function aaVector <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y") ## As a continuous property exampleDF_cont <- chargeCalculationGlobal(sequence = aaVector) head(exampleDF_cont) ## Or as a discrete property exampleDF_disc <- structuralTendency(sequence = aaVector) head(exampleDF_disc) sequenceMap(sequence = exampleDF_cont$AA, property = exampleDF_cont$Charge, nbResidues = 3, labelType = "both") sequenceMap(sequence = exampleDF_disc$AA, property = exampleDF_disc$Tendency, nbResidues = 3, labelType = "both") #Change the layout of labels sequenceMap( sequence = exampleDF_disc$AA, property = exampleDF_disc$Tendency, nbResidues = 3, labelType = "AA") #Only AA residue Labels sequenceMap( sequence = exampleDF_disc$AA, property = exampleDF_disc$Tendency, nbResidues = 3, labelType = "number") #Only residue numner labels sequenceMap( sequence = exampleDF_disc$AA, property = exampleDF_disc$Tendency, nbResidues = 3, labelType = "none") #No labels #The text can also be rotated for ease of reading, ## espeically helpful for larger sequences. sequenceMap( sequence = exampleDF_disc$AA, property = exampleDF_disc$Tendency, labelType = "number", labelLocation = "on", rotationAngle = 90) #Specify colors for continuous values sequenceMap( sequence = exampleDF_cont$AA, property = exampleDF_cont$Charge, customColors = c("purple", "pink", "grey90")) #or discrete values sequenceMap( sequence = exampleDF_disc$AA, property = exampleDF_disc$Tendency, customColors = c("#999999", "#E69F00", "#56B4E9")) #change the number of residues on each line with nbResidue #or discrete values sequenceMap( sequence = exampleDF_disc$AA, property = exampleDF_disc$Tendency, nbResidues = 1) sequenceMap( sequence = exampleDF_disc$AA, property = exampleDF_disc$Tendency, nbResidues = 3) sequenceMap( sequence = exampleDF_disc$AA, property = exampleDF_disc$Tendency, nbResidues = 10) #Use sequenceMapCoordinates for additional annotations gg <- sequenceMap(sequence = exampleDF_disc$AA, property = exampleDF_disc$Tendency, nbResidues = 3, labelType = "both") #Change the nbResidues to correspond to the sequenceMap setting mapCoordDF <- sequenceMapCoordinates(aaVector, nbResidues = 3) head(mapCoordDF) #subsetting for positive residues mapCoordDF_subset <- mapCoordDF$AA %in% c("K", "R", "H") mapCoordDF_subset <- mapCoordDF[mapCoordDF_subset,] library(ggplot2) gg <- gg + geom_point(inherit.aes = FALSE, data = mapCoordDF_subset, aes(x = col + 0.5, #to center on the residue y = row + 0.2), #to move above on the residue color = "purple", size = 3, shape = 3) plot(gg)
#Get a data frame returned from another function aaVector <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y") ## As a continuous property exampleDF_cont <- chargeCalculationGlobal(sequence = aaVector) head(exampleDF_cont) ## Or as a discrete property exampleDF_disc <- structuralTendency(sequence = aaVector) head(exampleDF_disc) sequenceMap(sequence = exampleDF_cont$AA, property = exampleDF_cont$Charge, nbResidues = 3, labelType = "both") sequenceMap(sequence = exampleDF_disc$AA, property = exampleDF_disc$Tendency, nbResidues = 3, labelType = "both") #Change the layout of labels sequenceMap( sequence = exampleDF_disc$AA, property = exampleDF_disc$Tendency, nbResidues = 3, labelType = "AA") #Only AA residue Labels sequenceMap( sequence = exampleDF_disc$AA, property = exampleDF_disc$Tendency, nbResidues = 3, labelType = "number") #Only residue numner labels sequenceMap( sequence = exampleDF_disc$AA, property = exampleDF_disc$Tendency, nbResidues = 3, labelType = "none") #No labels #The text can also be rotated for ease of reading, ## espeically helpful for larger sequences. sequenceMap( sequence = exampleDF_disc$AA, property = exampleDF_disc$Tendency, labelType = "number", labelLocation = "on", rotationAngle = 90) #Specify colors for continuous values sequenceMap( sequence = exampleDF_cont$AA, property = exampleDF_cont$Charge, customColors = c("purple", "pink", "grey90")) #or discrete values sequenceMap( sequence = exampleDF_disc$AA, property = exampleDF_disc$Tendency, customColors = c("#999999", "#E69F00", "#56B4E9")) #change the number of residues on each line with nbResidue #or discrete values sequenceMap( sequence = exampleDF_disc$AA, property = exampleDF_disc$Tendency, nbResidues = 1) sequenceMap( sequence = exampleDF_disc$AA, property = exampleDF_disc$Tendency, nbResidues = 3) sequenceMap( sequence = exampleDF_disc$AA, property = exampleDF_disc$Tendency, nbResidues = 10) #Use sequenceMapCoordinates for additional annotations gg <- sequenceMap(sequence = exampleDF_disc$AA, property = exampleDF_disc$Tendency, nbResidues = 3, labelType = "both") #Change the nbResidues to correspond to the sequenceMap setting mapCoordDF <- sequenceMapCoordinates(aaVector, nbResidues = 3) head(mapCoordDF) #subsetting for positive residues mapCoordDF_subset <- mapCoordDF$AA %in% c("K", "R", "H") mapCoordDF_subset <- mapCoordDF[mapCoordDF_subset,] library(ggplot2) gg <- gg + geom_point(inherit.aes = FALSE, data = mapCoordDF_subset, aes(x = col + 0.5, #to center on the residue y = row + 0.2), #to move above on the residue color = "purple", size = 3, shape = 3) plot(gg)
This is a function used to create a coordinate grid for the
sequenceMap
function. It is based on the length of the
sequence being mapped, and how many residues per line are specified.
The function wraps the sequence to have a number of columns that is
the sequence length / number of residues per row, rounded up.
This is intended for use within the sequenceMap function, however, this
can also be used to identify the coordinates of residues within the ggplot
coordinate plane for addition annotations.
sequenceMapCoordinates(sequence, nbResidues = 30)
sequenceMapCoordinates(sequence, nbResidues = 30)
sequence |
amino acid sequence as a single character string, a vector of single characters, or an AAString object. It also supports a single character string that specifies the path to a .fasta or .fa file. |
nbResidues |
numeric value, 30 by default. The number of residues to display on each row of the plot. It is not recommended to be over 50 or under 10 for standard sequences. Optimal value may vary between sequences of extreme lengths. |
A data frame with rows containing the amino acid sequence, residue position within the sequence, as well as the row and column of each residue within the ggplot output of sequenceMap().
sequenceMapCoordinates
for mapping coordinates
#Amino acid sequences can be character strings aaString <- "ACDEFGHIKLMNPQRSTVWY" #Amino acid sequences can also be character vectors aaVector <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y") #Alternatively, .fasta files can also be used by providing ##The path to the file as a character string exampleDF <- sequenceMapCoordinates(aaString, nbResidues = 10) head(exampleDF) exampleDF <- sequenceMapCoordinates(aaVector, nbResidues = 10) head(exampleDF) #Getting a data frame for plotting with sequenceMap tendencyDF <- structuralTendency(sequence = aaVector) #Making a sequenceMap ggplot to annotate gg <- sequenceMap(sequence = tendencyDF$AA, property = tendencyDF$Tendency, nbResidues = 3, labelType = "both") #Change the nbResidues to correspond to the sequenceMap setting mapCoordDF <- sequenceMapCoordinates(aaVector, nbResidues = 3) head(mapCoordDF) #subsetting for positive residues mapCoordDF_subset <- mapCoordDF$AA %in% c("K", "R", "H") mapCoordDF_subset <- mapCoordDF[mapCoordDF_subset,] #use mapCoordDF to annotate positive residues with a plus library(ggplot2) gg <- gg + geom_point(inherit.aes = FALSE, data = mapCoordDF_subset, aes(x = col + 0.5, #to center on the residue y = row + 0.2), #to move above on the residue color = "purple", size = 3, shape = 3) plot(gg)
#Amino acid sequences can be character strings aaString <- "ACDEFGHIKLMNPQRSTVWY" #Amino acid sequences can also be character vectors aaVector <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y") #Alternatively, .fasta files can also be used by providing ##The path to the file as a character string exampleDF <- sequenceMapCoordinates(aaString, nbResidues = 10) head(exampleDF) exampleDF <- sequenceMapCoordinates(aaVector, nbResidues = 10) head(exampleDF) #Getting a data frame for plotting with sequenceMap tendencyDF <- structuralTendency(sequence = aaVector) #Making a sequenceMap ggplot to annotate gg <- sequenceMap(sequence = tendencyDF$AA, property = tendencyDF$Tendency, nbResidues = 3, labelType = "both") #Change the nbResidues to correspond to the sequenceMap setting mapCoordDF <- sequenceMapCoordinates(aaVector, nbResidues = 3) head(mapCoordDF) #subsetting for positive residues mapCoordDF_subset <- mapCoordDF$AA %in% c("K", "R", "H") mapCoordDF_subset <- mapCoordDF[mapCoordDF_subset,] #use mapCoordDF to annotate positive residues with a plus library(ggplot2) gg <- gg + geom_point(inherit.aes = FALSE, data = mapCoordDF_subset, aes(x = col + 0.5, #to center on the residue y = row + 0.2), #to move above on the residue color = "purple", size = 3, shape = 3) plot(gg)
This is a graphical function used to visualize numeric data along an amino acid sequence.
sequencePlot( position, property, hline = NA, propertyLimits = NA, dynamicColor = NA, customColors = NA, midpoint = hline, customTitle = NA )
sequencePlot( position, property, hline = NA, propertyLimits = NA, dynamicColor = NA, customColors = NA, midpoint = hline, customTitle = NA )
position |
numeric vector of residue positions. Typically c(1 : sequenceLength). ie a sequence with 215 amino acids has a vector of values 1 to 215. This is the X axis |
property |
vector of values, typically numeric. Equal in length to position. This is the Y axis. |
hline , propertyLimits
|
optional, numeric values or numeric vectors. Prints horizontal lines. Set to NA to skip (default). *hline* specifies the location for a dashed, grey line to be printed underneath the plot's data line. Good for separating cutoff values. *propertyLimits* specifies the location for a solid, black line to be printed. Good for showing maximum and minimum values. |
dynamicColor |
optional vector. Typically numeric. Equal in length to position. Can be used to set colors based on values. Can be categorical (discrete) or continuous. Set to NA to skip (default). |
customColors |
optional vector of colors as character strings.
Used to support custom plot colors. If property is a discrete scale, a
character vector of colors with length = number of unique discrete
observations is required. If property is a continuous scale, a character
vector of the colors for c("highColor","lowColor","midColor").
Set NA to skip custom colors (default). Ignored if
|
midpoint |
needed for proper scales of customColors. The default value is equal to hline (if provided). If there is no hline, the average of propertyLimits is the midpoint (if provided). If neither is provided, the value will be NA. The user can explicitly assign the midpoint to avoid this or to overwrite the defaults. |
customTitle |
optional, character string. Allows adding custom title. Set to NA to skip (default). |
a ggplot
#Get a data frame returned from another function aaVector <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y") exampleDF <- chargeCalculationGlobal(sequence = aaVector) head(exampleDF) #Making a sequence plot sequencePlot( position = exampleDF$Position, property = exampleDF$Charge) #Change the horizontal lines sequencePlot( position = exampleDF$Position, property = exampleDF$Charge, hline = 0.0, propertyLimits = c(-1.0, 1.0)) #Adding a dynamic colors based on the property values sequencePlot( position = exampleDF$Position, property = exampleDF$Charge, hline = 0.0, propertyLimits = c(-1.0, 1.0), dynamicColor = exampleDF$Charge, customColors = c("red", "blue", "grey50"), customTitle = "Charge of Each Residue / Terminus")
#Get a data frame returned from another function aaVector <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y") exampleDF <- chargeCalculationGlobal(sequence = aaVector) head(exampleDF) #Making a sequence plot sequencePlot( position = exampleDF$Position, property = exampleDF$Charge) #Change the horizontal lines sequencePlot( position = exampleDF$Position, property = exampleDF$Charge, hline = 0.0, propertyLimits = c(-1.0, 1.0)) #Adding a dynamic colors based on the property values sequencePlot( position = exampleDF$Position, property = exampleDF$Charge, hline = 0.0, propertyLimits = c(-1.0, 1.0), dynamicColor = exampleDF$Charge, customColors = c("red", "blue", "grey50"), customTitle = "Charge of Each Residue / Terminus")
Each amino acid residue has a tendency to impact the order / disorder of the amino acid sequence. Some residues are disorder promoting, meaning they tend to favor disorder over ordered structures. These are typically hydrophilic, charged, or small residues. Order promoting residues tend to be aliphatic, hydrophobic, aromatic, or form tertiary structures. Disorder neutral residues neither favor order nor disordered structures.
structuralTendency( sequence, disorderPromoting = c("P", "E", "S", "Q", "K", "A", "G"), disorderNeutral = c("D", "T", "R"), orderPromoting = c("M", "N", "V", "H", "L", "F", "Y", "I", "W", "C"), printCitation = FALSE )
structuralTendency( sequence, disorderPromoting = c("P", "E", "S", "Q", "K", "A", "G"), disorderNeutral = c("D", "T", "R"), orderPromoting = c("M", "N", "V", "H", "L", "F", "Y", "I", "W", "C"), printCitation = FALSE )
sequence |
amino acid sequence as a single character string, a vector of single characters, or an AAString object. It also supports a single character string that specifies the path to a .fasta or .fa file. |
disorderPromoting , disorderNeutral , orderPromoting
|
character vectors of individual residues to be matched with the input sequence. Defaults:
It is not recommended to change these. These definitions are from Uversky (2013). |
printCitation |
logical, FALSE by default.
When |
a data frame containing each residue from the sequence
matched with its structural tendency, defined by disorderPromoting,
disorderNeutral, and orderPromoting.
For convenient plotting see structuralTendencyPlot
.
Uversky, V. N. (2013). A decade and a half of protein intrinsic disorder:
Biology still waits for physics. Protein Science, 22(6), 693-724.
https://doi.org/10.1002/pro.2261.
Kulkarni, Prakash, and Vladimir N. Uversky. "Intrinsically
disordered proteins: the dark horse of the dark proteome."
Proteomics 18.21-22 (2018): 1800061.
https://doi.org/10.1002/pmic.201800061.
Other structural tendency:
structuralTendencyPlot()
#Amino acid sequences can be character strings aaString <- "ACDEFGHIKLMNPQRSTVWY" #Amino acid sequences can also be character vectors aaVector <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y") #Alternatively, .fasta files can also be used by providing ##The path to the file as a character string exampleDF <- structuralTendency(aaString) head(exampleDF) exampleDF <- structuralTendency(aaVector) head(exampleDF) #This example shows if a user changes the default definition of residues. ##These residues are labeled as such from Dunker et al (2001), ##"Intrinsically disordered protein." exampleDF <- structuralTendency(aaString, disorderPromoting = c("A", "R", "G", "Q", "S", "P", "E", "K"), disorderNeutral = c("H", "M", "T", "D"), orderPromoting = c("W", "C", "F", "I", "Y", "V", "L", "N")) head(exampleDF)
#Amino acid sequences can be character strings aaString <- "ACDEFGHIKLMNPQRSTVWY" #Amino acid sequences can also be character vectors aaVector <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y") #Alternatively, .fasta files can also be used by providing ##The path to the file as a character string exampleDF <- structuralTendency(aaString) head(exampleDF) exampleDF <- structuralTendency(aaVector) head(exampleDF) #This example shows if a user changes the default definition of residues. ##These residues are labeled as such from Dunker et al (2001), ##"Intrinsically disordered protein." exampleDF <- structuralTendency(aaString, disorderPromoting = c("A", "R", "G", "Q", "S", "P", "E", "K"), disorderNeutral = c("H", "M", "T", "D"), orderPromoting = c("W", "C", "F", "I", "Y", "V", "L", "N")) head(exampleDF)
Convenient graphing for the structuralTendency
function.
structuralTendencyPlot( sequence, graphType = "pie", summarize = FALSE, proteinName = NA, alphabetical = FALSE, disorderPromoting = c("P", "E", "S", "Q", "K", "A", "G"), disorderNeutral = c("D", "T", "R"), orderPromoting = c("M", "N", "V", "H", "L", "F", "Y", "I", "W", "C"), ... )
structuralTendencyPlot( sequence, graphType = "pie", summarize = FALSE, proteinName = NA, alphabetical = FALSE, disorderPromoting = c("P", "E", "S", "Q", "K", "A", "G"), disorderNeutral = c("D", "T", "R"), orderPromoting = c("M", "N", "V", "H", "L", "F", "Y", "I", "W", "C"), ... )
sequence |
amino acid sequence (or pathway to a fasta file) as a character string. Supports multiple sequences / files, as a character vector of strings. |
graphType |
character string, required.
graphType must be set to c("pie", "bar", "none").
When |
summarize |
logical value, FALSE by default.
When |
proteinName |
optional character string. NA by default. Used to either add the name of the protein to the plot title. |
alphabetical |
logical value, FALSE by default.
Order of residues on plot axis. Only relevant when
|
disorderPromoting , disorderNeutral , orderPromoting
|
character vectors of individual residues to be matched with the input sequence. Defaults:
It is not recommended to change these. |
... |
additional arguments to be passed to
|
a data frame containing each residue from the sequence matched with its structural tendency, defined by disorderPromoting, disorderNeutral, and orderPromoting.
For users who wish to keep a common aesthetic, the following colors are
used when graphType = "bar" or "pie"
Disorder Neutral = "#F0B5B3"
Disorder Promoting = "darkolivegreen3" or "#A2CD5A"
Order Promoting = "darkorchid1" or "#BF3EFF"
Uversky, V. N. (2013). A decade and a half of protein intrinsic disorder:
Biology still waits for physics. Protein Science, 22(6), 693-724.
https://doi.org/10.1002/pro.2261.
Kulkarni, Prakash, and Vladimir N. Uversky. "Intrinsically
disordered proteins: the dark horse of the dark proteome."
Proteomics 18.21-22 (2018): 1800061.
https://doi.org/10.1002/pmic.201800061.
Other structural tendency:
structuralTendency()
#Amino acid sequences can be character strings aaString <- "ACDEFGHIKLMNPQRSTVWY" #Amino acid sequences can also be character vectors aaVector <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y") #Alternatively, .fasta files can also be used by providing ##The path to the file as a character string structuralTendencyPlot(aaString) structuralTendencyPlot(aaVector) #The plot can be a pie chart (default) structuralTendencyPlot(aaString, graphType = "pie") #Or the plot can be a bar graph structuralTendencyPlot(aaString, graphType = "bar") #To display general tendency rather than residues, set summarize = T structuralTendencyPlot(aaString, graphType = "pie", summarize = TRUE) structuralTendencyPlot(aaString, graphType = "bar", summarize = TRUE) #If you wish to export this as a dataframe, set graphType = "none" exampleDF <- structuralTendencyPlot(aaString, graphType = "none") head(exampleDF) #If using a different definition of disordered residues ##These residues are labeled as such from Dunker et al (2001), ##"Intrinsically disordered protein." structuralTendencyPlot(aaString, disorderPromoting = c("A", "R", "G", "Q", "S", "P", "E", "K"), disorderNeutral = c("H", "M", "T", "D"), orderPromoting = c("W", "C", "F", "I", "Y", "V", "L", "N"), graphType = "bar", alphabetical = TRUE)
#Amino acid sequences can be character strings aaString <- "ACDEFGHIKLMNPQRSTVWY" #Amino acid sequences can also be character vectors aaVector <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y") #Alternatively, .fasta files can also be used by providing ##The path to the file as a character string structuralTendencyPlot(aaString) structuralTendencyPlot(aaVector) #The plot can be a pie chart (default) structuralTendencyPlot(aaString, graphType = "pie") #Or the plot can be a bar graph structuralTendencyPlot(aaString, graphType = "bar") #To display general tendency rather than residues, set summarize = T structuralTendencyPlot(aaString, graphType = "pie", summarize = TRUE) structuralTendencyPlot(aaString, graphType = "bar", summarize = TRUE) #If you wish to export this as a dataframe, set graphType = "none" exampleDF <- structuralTendencyPlot(aaString, graphType = "none") head(exampleDF) #If using a different definition of disordered residues ##These residues are labeled as such from Dunker et al (2001), ##"Intrinsically disordered protein." structuralTendencyPlot(aaString, disorderPromoting = c("A", "R", "G", "Q", "S", "P", "E", "K"), disorderNeutral = c("H", "M", "T", "D"), orderPromoting = c("W", "C", "F", "I", "Y", "V", "L", "N"), graphType = "bar", alphabetical = TRUE)
This is a vector of sequences as character strings. This contains the amino acid sequence of Human Cellular tumor antigen p53 (UniProt ID: P04637) and sequences of several homologous sequences. These sequences in TP53Sequences selected due to their highly similar identity on UniProt ( The UniProt Consortium, 2019 ).
TP53Sequences GorillaTP53
TP53Sequences GorillaTP53
An object of class character
of length 9.
An object of class character
of length 1.
P02340
P04637
P10361
Q29537
Q00366
O09185
Q9TTA1
Q95330
A0A2I2Y7Z8*
* The Gorilla p53 sequence is not within the TP53Sequences vector, and is its own object named GorillaTP53. This is because the Gorilla p53 is unreviewed on UniProt, so we chose to exclude it from the list of otherwise SwissProt sequences.
UniProt Consortium. (2019). UniProt: a worldwide hub of protein knowledge. Nucleic acids research, 47(D1), D506-D515. https://doi.org/10.1093/nar/gky1049