Title: | Host-Pathogen Interaction Prediction |
---|---|
Description: | HPiP (Host-Pathogen Interaction Prediction) uses an ensemble learning algorithm for prediction of host-pathogen protein-protein interactions (HP-PPIs) using structural and physicochemical descriptors computed from amino acid-composition of host and pathogen proteins.The proposed package can effectively address data shortages and data unavailability for HP-PPI network reconstructions. Moreover, establishing computational frameworks in that regard will reveal mechanistic insights into infectious diseases and suggest potential HP-PPI targets, thus narrowing down the range of possible candidates for subsequent wet-lab experimental validations. |
Authors: | Matineh Rahmatbakhsh [aut, trl, cre], Mohan Babu [led] |
Maintainer: | Matineh Rahmatbakhsh <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.13.0 |
Built: | 2024-12-29 05:43:51 UTC |
Source: | https://github.com/bioc/HPiP |
This function calculates Amino Acid Composition (AAC) descriptor for the data input.
calculateAAC(x)
calculateAAC(x)
x |
A data.frame containing gene/protein names and their fasta sequences. |
calculateAAC
A length 20 named vector for the data input.
Matineh Rahmatbakhsh, [email protected]
Dey, L., Chakraborty, S., and Mukhopadhyay, A. (2020). Machine learning techniques for sequence-based prediction of viral–host interactions between SARS-CoV-2 and human proteins. Biomed. J. 43, 438–450.
See calculateDC
and
calculateTC
for Dipeptide Composition and
Tripeptide Composition descriptors.
data(UP000464024_df) x_df <- calculateAAC(UP000464024_df) head(x_df, n = 2L)
data(UP000464024_df) x_df <- calculateAAC(UP000464024_df) head(x_df, n = 2L)
This function calculates autocorrelation descriptors:
moran
- moran autocorrelation,
(Dim: length(target.props) * nlag
).
geary
- geary autocorrelation,
(Dim: length(target.props) * nlag
).
moreaubroto
- moreau-broto autocorrelation,
(Dim: length(target.props) * nlag
).
calculateAutocor( x, target.props = c("CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"), nlag = 30L, type = c("moran", "geary", "moreaubroto") )
calculateAutocor( x, target.props = c("CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"), nlag = 30L, type = c("moran", "geary", "moreaubroto") )
x |
A data.frame containing gene/protein names and their fasta sequences. |
target.props |
A character vector, specifying the accession number of the target properties. 8 properties are used by default, as listed below:
|
nlag |
Maximum value of the lag parameter. Default is |
type |
The autocorrelation type:
|
calculateAutocor
A length nlag
named vector for data input.
Matineh Rahmatbakhsh <[email protected]>, Nan Xiao
AAindex: Amino acid index database. http://www.genome.ad.jp/dbget/aaindex.html
Feng, Z.P. and Zhang, C.T. (2000) Prediction of membrane protein types based on the hydrophobic index of amino acids. Journal of Protein Chemistry, 19, 269-275.
Horne, D.S. (1988) Prediction of protein helix content from an autocorrelation analysis of sequence hydrophobicities. Biopolymers, 27, 451-477.
Sokal, R.R. and Thomson, B.A. (2006) Population structure inferred by local spatial autocorrelation: an Usage from an Amerindian tribal population. American Journal of Physical Anthropology, 129, 121-131.
data(UP000464024_df) x_df <- calculateAutocor(UP000464024_df,type = 'moran') head(x_df, n = 2L)
data(UP000464024_df) x_df <- calculateAutocor(UP000464024_df,type = 'moran') head(x_df, n = 2L)
This function transform each residue in a peptide into 20 coding values.
calculateBE(x)
calculateBE(x)
x |
A data.frame containing gene/protein names and their fasta sequences. |
calculateBE
A length 400 named vector for the data input.
Matineh Rahmatbakhsh, [email protected]
Al-Barakati, H. J., Saigo, H., and Newman, R. H. (2019). RF-GlutarySite: a random forest based predictor for glutarylation sites. Mol. Omi. 15, 189–204.
This function calculates Composition (C) descriptor for data input.
calculateCTDC(x)
calculateCTDC(x)
x |
A data.frame containing gene/protein names and their fasta sequences. |
calculateCTDC
A length 21 named vector for the data input.
Matineh Rahmatbakhsh, [email protected]
Dubchak, I., Muchnik, I., Holbrook, S. R., and Kim, S.-H. (1995). Prediction of protein folding class using global description of amino acid sequence.Proc. Natl. Acad. Sci. 92, 8700–8704.
See calculateCTDT
and calculateCTDD
for Transition and Distribution descriptors.
data(UP000464024_df) x_df <- calculateCTDC(UP000464024_df) head(x_df, n = 2L)
data(UP000464024_df) x_df <- calculateCTDC(UP000464024_df) head(x_df, n = 2L)
This function calculates Distribution (D) descriptor for data input.
calculateCTDD(x)
calculateCTDD(x)
x |
A data.frame containing gene/protein names and their fasta sequences. |
calculateCTDD
A length 105 named vector for the data input.
Matineh Rahmatbakhsh, [email protected]
Dubchak, I., Muchnik, I., Holbrook, S. R., and Kim, S.-H. (1995). Prediction of protein folding class using global description of amino acid sequence.Proc. Natl. Acad. Sci. 92, 8700–8704.
See calculateCTDC
and calculateCTDT
for Composition and Transition descriptors.
data(UP000464024_df) x_df <- calculateCTDD(UP000464024_df) head(x_df, n = 1L)
data(UP000464024_df) x_df <- calculateCTDD(UP000464024_df) head(x_df, n = 1L)
This function calculates Transition (T) descriptor for data input.
calculateCTDT(x)
calculateCTDT(x)
x |
A data.frame containing gene/protein names and their fasta sequences. |
calculateCTDT
A length 21 named vector for the data input.
Matineh Rahmatbakhsh, [email protected]
Dubchak, I., Muchnik, I., Holbrook, S. R., and Kim, S.-H. (1995). Prediction of protein folding class using global description of amino acid sequence. Proc. Natl. Acad. Sci. 92, 8700–8704.
See calculateCTDC
and calculateCTDD
for Composition and Distribution descriptors.
data(UP000464024_df) x_df <- calculateCTDT(UP000464024_df) head(x_df, n = 2L)
data(UP000464024_df) x_df <- calculateCTDT(UP000464024_df) head(x_df, n = 2L)
This function calculates Conjoint Triad descriptor for data input.
calculateCTriad(x)
calculateCTriad(x)
x |
A data.frame containing gene/protein names and their fasta sequences. |
calculateCTriad
A length 343 named vector for the data input.
Matineh Rahmatbakhsh, [email protected]
Shen, J., Zhang, J., Luo, X., Zhu, W., Yu, K., Chen, K., et al. (2007). Predicting protein–protein interactions based only on sequences information. Proc. Natl. Acad. Sci. 104, 4337–4341.
data(UP000464024_df) x_df <- calculateCTriad(UP000464024_df) head(x_df, n = 2L)
data(UP000464024_df) x_df <- calculateCTriad(UP000464024_df) head(x_df, n = 2L)
This function calculates Dipeptide Composition (DC) descriptor for data input.
calculateDC(x)
calculateDC(x)
x |
A data.frame containing gene/protein names and their fasta sequences. |
calculateDC
A length 400 named vector for the data input.
Matineh Rahmatbakhsh, [email protected]
Bhasin, M., and Raghava, G. P. S. (2004). Classification of nuclear receptors based on amino acid composition and dipeptide composition. J. Biol. Chem. 279, 23262–23266.
See calculateAAC
and calculateTC
for Amino Acid Composition and Tripeptide Composition descriptors.
data(UP000464024_df) x_df <- calculateDC(UP000464024_df) head(x_df, n = 2L)
data(UP000464024_df) x_df <- calculateDC(UP000464024_df) head(x_df, n = 2L)
This function calculates F1 or F2 descriptors:
F1
- sum of squared length of Single Amino Acid Repeats
(SARs) in the entire protein sequence.
F2
- maximum of the sum of Single Amino Acid Repeats (SARs)
in a window of 6 residues.
calculateF(x, type = c("F1", "F2"))
calculateF(x, type = c("F1", "F2"))
x |
A data.frame containing gene/protein names and their fasta sequences. |
type |
The descriptor type:
|
calculateF
A length 20 named vector for the data input.
Matineh Rahmatbakhsh, [email protected]
Alguwaizani, S., Park, B., Zhou, X., Huang, D.-S., and Han, K. (2018). Predicting interactions between virus and host proteins using repeat patterns and composition of amino acids. J. Healthc. Eng. 2018.
data(UP000464024_df) x_df <- calculateF(UP000464024_df, type = "F1") head(x_df, n = 2L)
data(UP000464024_df) x_df <- calculateF(UP000464024_df, type = "F1") head(x_df, n = 2L)
This function calculates k-spaced Amino Acid Pairs
(KSAAP) Descriptor for data input.
This function is adapted from the CkSAApair
function in the ftrCOOL package.
calculateKSAAP(x, spc = 3)
calculateKSAAP(x, spc = 3)
x |
A data.frame containing gene/protein names and their fasta sequences. |
spc |
A number of spaces separating two adjacent residues by a distance of spc, which can be any number up to two less than the length of the peptide; default to 3. |
calculateKSAAP
A length 400 named vector for the data input.
Matineh Rahmatbakhsh, [email protected]
Kao, H.-J., Nguyen, V.-N., Huang, K.-Y., Chang, W.-C., and Lee, T.-Y. (2020).SuccSite: incorporating amino acid composition and informative k-spaced amino acid pairs to identify protein succinylation sites. Genomics. Proteomics Bioinformatics 18, 208–219.
data(UP000464024_df) x_df <- calculateKSAAP(UP000464024_df) head(x_df, n = 2L)
data(UP000464024_df) x_df <- calculateKSAAP(UP000464024_df) head(x_df, n = 2L)
This function calculates Quadruples Composition (QC) descriptor from biochemical similarity classes.
calculateQD_Sm(x)
calculateQD_Sm(x)
x |
A data.frame containing gene/protein names and their fasta sequences. |
calculateQD_Sm
A length 1296 named vector for the data input.
Matineh Rahmatbakhsh, [email protected]
Ahmed, I., Witbooi, P., and Christoffels, A. (2018). Prediction of human-Bacillus anthracis protein–protein interactions using multi-layer neural network.Bioinformatics 34, 4159–4164.
data(UP000464024_df) x_df <- calculateQD_Sm(UP000464024_df) head(x_df, n = 2L)
data(UP000464024_df) x_df <- calculateQD_Sm(UP000464024_df) head(x_df, n = 2L)
This function calculates Tripeptide Composition (TC) descriptor for data input.
calculateTC(x)
calculateTC(x)
x |
A data.frame containing gene/protein names and their fasta sequences. |
calculateTC
A length 8,000 named vector for the data input.
Matineh Rahmatbakhsh, [email protected]
Liao, B., Jiang, J.-B., Zeng, Q.-G., and Zhu, W. (2011). Predicting apoptosis protein subcellular location with PseAAC by incorporating tripeptide composition. Protein Pept. Lett. 18, 1086–1092
See calculateAAC
,calculateDC
and
calculateTC_Sm
for
Amino Acid Composition, Dipeptide Composition and Tripeptide
Composition (TC) Descriptor from Biochemical Similarity Classes.
data(UP000464024_df) x_df <- calculateTC(UP000464024_df) head(x_df, n = 2L)
data(UP000464024_df) x_df <- calculateTC(UP000464024_df) head(x_df, n = 2L)
This function calculates Tripeptide Composition (TC) descriptor from biochemical similarity classes.
calculateTC_Sm(x)
calculateTC_Sm(x)
x |
A data.frame containing gene/protein names and their fasta sequences. |
calculateTC_Sm
A length 216 named vector for the data input.
Matineh Rahmatbakhsh, [email protected]
Ahmed, I., Witbooi, P., and Christoffels, A. (2018). Prediction of human-Bacillus anthracis protein–protein interactions using multi-layer neural network.Bioinformatics 34, 4159–4164.
Cui, G., Fang, C., and Han, K. (2012). Prediction of protein-protein interactions between viruses and human by an SVM model. BMC bioinformatics, 1–10.
See calculateTC
for Tripeptide Composition
descriptor.
data(UP000464024_df) x_df <- calculateTC_Sm(UP000464024_df) head(x_df, n = 2L)
data(UP000464024_df) x_df <- calculateTC_Sm(UP000464024_df) head(x_df, n = 2L)
A graphical display of a correlation matrix.
corr_plot(cormat, method = "number", cex = 0.9)
corr_plot(cormat, method = "number", cex = 0.9)
cormat |
A correlation matrix. |
method |
The visualization method of correlation matrix;
defaults to number.
See |
cex |
The size of x/y axis label. |
corr_plot
A correlation plot.
Matineh Rahmatbakhsh, [email protected].
data('example_data') x <- na.omit(example_data) #perform feature selection s <- FSmethod(x, type = 'both', cor.cutoff = 0.7, resampling.method = "repeatedcv", iter = 5, repeats = 3, metric = "ROC", verbose = TRUE) corr_plot(s$cor.result$corProfile, method = 'square' , cex = 0.5)
data('example_data') x <- na.omit(example_data) #perform feature selection s <- FSmethod(x, type = 'both', cor.cutoff = 0.7, resampling.method = "repeatedcv", iter = 5, repeats = 3, metric = "ROC", verbose = TRUE) corr_plot(s$cor.result$corProfile, method = 'square' , cex = 0.5)
Input data for enrichplot
data(enrich.df)
data(enrich.df)
To construct this dataset, predicted interactions, generated
from pred_ensembel
was used as data input for
enrichplot
.
This function uses gost
function
in gprofiler2
package to perfrom functional enrichment analysis
for predicted modules.
enrichfind_cpx( predcpx, threshold = 0.05, sources = c("GO", "KEGG"), p.corrction.method = "bonferroni", org = "hsapiens" )
enrichfind_cpx( predcpx, threshold = 0.05, sources = c("GO", "KEGG"), p.corrction.method = "bonferroni", org = "hsapiens" )
predcpx |
Predicted modules resutled from
|
threshold |
Custom p-value threshold for significance. |
sources |
A vector of data sources to use.
See |
p.corrction.method |
The algorithm used for multiple testing
correction;defaults to 'bonferroni'.
See |
org |
An organism name;defaults to 'hsapiens'.
See |
enrichfind_cpx
A data.frame with the enrichment analysis results.
Matineh Rahmatbakhsh, [email protected]
This function uses gost
function
in gprofiler2
package to perfrom functional enrichment analysis
for all predicted host proteins in the high-confidence network.
enrichfind_hp( ppi, threshold = 0.05, sources = c("GO", "KEGG"), p.corrction.method = "bonferroni", org = "hsapiens" )
enrichfind_hp( ppi, threshold = 0.05, sources = c("GO", "KEGG"), p.corrction.method = "bonferroni", org = "hsapiens" )
ppi |
A data.frame containing pathogen proteins in the first column and host proteins in the second column. |
threshold |
Custom p-value threshold for significance. |
sources |
A vector of data sources to use.
See |
p.corrction.method |
The algorithm used for multiple testing
correction;defaults to 'bonferroni'.
See |
org |
An organism name;defaults to 'hsapiens'.
See |
enrichfind_hp
A data.frame with the enrichment analysis results.
Matineh Rahmatbakhsh, [email protected]
See enrichplot
for plotting enrichment analysis.
data('predicted_PPIs') #perform enrichment enrich.df <- enrichfind_hp(predicted_PPIs, threshold = 0.05, sources = c("GO", "KEGG"), p.corrction.method = "bonferroni", org = "hsapiens")
data('predicted_PPIs') #perform enrichment enrich.df <- enrichfind_hp(predicted_PPIs, threshold = 0.05, sources = c("GO", "KEGG"), p.corrction.method = "bonferroni", org = "hsapiens")
This function uses gost
function
in gprofiler2
package to perfrom functional enrichment analysis
for pathogen interactors in the high-confidence network.
enrichfindP( ppi, threshold = 0.05, sources = c("GO", "KEGG"), p.corrction.method = "bonferroni", org = "hsapiens" )
enrichfindP( ppi, threshold = 0.05, sources = c("GO", "KEGG"), p.corrction.method = "bonferroni", org = "hsapiens" )
ppi |
A data.frame containing pathogen proteins in the first column and host proteins in the second column. |
threshold |
Custom p-value threshold for significance. |
sources |
A vector of data sources to use.
See |
p.corrction.method |
The algorithm used for multiple testing
correction;defaults to 'bonferroni'.
See |
org |
An organism name;defaults to 'hsapiens'.
See |
enrichfindP
A data.frame with the enrichment analysis results.
Matineh Rahmatbakhsh, [email protected]
See enrichplot
for plotting enrichment analysis.
data('predicted_PPIs') #perform enrichment enrich.df <- enrichfindP(predicted_PPIs, threshold = 0.05, sources = c("GO", "KEGG"), p.corrction.method = "bonferroni", org = "hsapiens")
data('predicted_PPIs') #perform enrichment enrich.df <- enrichfindP(predicted_PPIs, threshold = 0.05, sources = c("GO", "KEGG"), p.corrction.method = "bonferroni", org = "hsapiens")
This function plots the enrichment result.
enrichplot(x, low = "blue", high = "red", cex.size = 15)
enrichplot(x, low = "blue", high = "red", cex.size = 15)
x |
A data.frame with the enrichment analysis results. |
low |
Colours for low. |
high |
Colours for high. |
cex.size |
Text size. |
enrichplot
An enrichment plot.
Matineh Rahmatbakhsh, [email protected]
See enrichfindP
for functional enrichment
analysis.
data('enrich.df') #select enrichment for one of the example (e.g., E protein) enrich.df <- enrich.df[enrich.df$id == "E:P0DTC4", ] enrichplot(enrich.df, low = "blue", high = "red", cex.size = 10)
data('enrich.df') #select enrichment for one of the example (e.g., E protein) enrich.df <- enrich.df[enrich.df$id == "E:P0DTC4", ] enrichplot(enrich.df, low = "blue", high = "red", cex.size = 10)
Input data for pred_ensembel
data(example_data)
data(example_data)
a data.frame containing unlabeled or labeled HP-PPIs and pre-computed numerical features.
Given an input matrix, compute the missingness rate for each features and keep only features with missing rate more than user-defined percentage.
filter_missing_values(x, max_miss_rate = 20)
filter_missing_values(x, max_miss_rate = 20)
x |
A numeric matrix as input. |
max_miss_rate |
Maximal missing rate allowed for a feature;default is 20. |
filter_missing_values
A dataframe with features with missingness rate of more than user-defined threshold.
Matineh Rahmatbakhsh, [email protected]
x <- matrix(1:10, ncol = 2) x[, 2] <- NA filter_missing_values(x, 30)
x <- matrix(1:10, ncol = 2) x[, 2] <- NA filter_missing_values(x, 30)
This function plots the pathogen proteins' Frequency of interactions with host proteins
FreqInteractors(ppi, cex.size = 12)
FreqInteractors(ppi, cex.size = 12)
ppi |
A data.frame containing pathogen proteins in the first column and host proteins in the second column. |
cex.size |
Text size. |
FreqInteractors
A frequency plot.
Matineh Rahmatbakhsh, [email protected]
ppi <- data.frame( node1 = c("A", "A", "A", "B", "B", "B", "B"), node2 = c("C", "E", "D", "F", "G", "H", "I") ) FreqInteractors(ppi)
ppi <- data.frame( node1 = c("A", "A", "A", "B", "B", "B", "B"), node2 = c("C", "E", "D", "F", "G", "H", "I") ) FreqInteractors(ppi)
This function performs feature selections via two approaches
filter.corr
- compute matrix correlation between features
and filter using a threshold.
rfeFS
- perform recursive feature elimination (RFE) method
wrapped with a Random Forest (RF) algorithm for feature importance
evaluation.
FSmethod( x, type = c("cor", "rfe", "both"), cor.cutoff = 0.7, resampling.method = "cv", iter = 2, repeats = 3, metric = "Accuracy", verbose = TRUE )
FSmethod( x, type = c("cor", "rfe", "both"), cor.cutoff = 0.7, resampling.method = "cv", iter = 2, repeats = 3, metric = "Accuracy", verbose = TRUE )
x |
A data.frame containing protein-protein interactions, class labels and features. |
type |
The feature selection type, one or two of
|
cor.cutoff |
Correlation coefficient cutoff used for filtering.
See |
resampling.method |
The resampling method for RFE :'boot',
'boot632', optimism_boot',boot_all', 'cv', 'repeatedcv', 'LOOCV',
'LGOCV';defaults to cv. See |
iter |
Number of partitions for cross-validation;
defaults to 2. See |
repeats |
For repeated k-fold cross validation only;
defaults to 3.See |
metric |
A string that specifies what summary metric will be used
to select the optimal feature ; default to ROC.See |
verbose |
Make the output verbose.See |
FSmethod
If the type set to filter.corr
, the output includes
the following elements:
corProfile - A correlation matrix.
corSelectedFeatures - Name of features that retained after the correlation analysis.
cordf - A data.frame filtered.
If the type set to rfeFS
, the output includes the following
elements:
rfProfile - A list of elements. See rfe
for
more details.
rfSelectedFeatures - Name of features that retained in the feature selection process.
rfdf - A data.frame filtered.
If type set to both
the output includes the following elements:
rfdf - The final data.frame that includes the selected features
retained after both filter.corr
and rfeFS
analysis.
Matineh Rahmatbakhsh, [email protected].
data('example_data') x <- na.omit(example_data) s <- FSmethod(x, type = 'both', cor.cutoff = 0.7, resampling.method = "repeatedcv", iter = 5, repeats = 3, metric = "ROC", verbose = TRUE)
data('example_data') x <- na.omit(example_data) s <- FSmethod(x, type = 'both', cor.cutoff = 0.7, resampling.method = "repeatedcv", iter = 5, repeats = 3, metric = "ROC", verbose = TRUE)
Construct true negative protein-protein interactions from the positive interactions. In the context of PPI prediction, a negative interaction is a pair of proteins that unlikely to interact. Since there is no experimentally verified non-interacting pair, the negative sampling can be used to construct the negative reference set. The negative sampling can be constructed from a set of host proteins, a set of pathogen proteins, and a list of positive reference interactions between members of host and pathogen proteins (Eid et al., 2016).
get_negativePPI(prot1, prot2, TPset)
get_negativePPI(prot1, prot2, TPset)
prot1 |
A character vector containing pathogen proteins. |
prot2 |
A character vector containing host proteins. |
TPset |
A character vector containing positive reference interactions. |
get_negativePPI
A Data.frame containing true negative interactions.
Matineh Rahmatbakhsh, [email protected]
Eid, F.-E., ElHefnawi, M., and Heath, L. S. (2016). DeNovo: virus-host sequence-based protein–protein interaction prediction. Bioinformatics 32, 1144–1150.
See get_positivePPI
for generating positive
protein-protein interaction.
prot1 <- c("P0DTC4", "P0DTC5", "P0DTC9") prot2 <- c("Q9Y679", "Q9NW15", "Q9NXF8") TPset <- c("P0DTC4~P31948", "P0DTC8~Q13438") TN_PPI <- get_negativePPI(prot1, prot2, TPset) head(TN_PPI)
prot1 <- c("P0DTC4", "P0DTC5", "P0DTC9") prot2 <- c("Q9Y679", "Q9NW15", "Q9NXF8") TPset <- c("P0DTC4~P31948", "P0DTC8~Q13438") TN_PPI <- get_negativePPI(prot1, prot2, TPset) head(TN_PPI)
This function retrieves positive reference host-pathogen protein-protein interactions directly from BioGRID database.
get_positivePPI( organism.taxID, access.key, filename = "PositiveInt.RData", path = "PositiveInt" )
get_positivePPI( organism.taxID, access.key, filename = "PositiveInt.RData", path = "PositiveInt" )
organism.taxID |
Taxonomy identifier for the pathogen. |
access.key |
Access key for using BioGRID webpage. To retrieve interactions from the BioGRID database, the users are first required to register for access key at https://webservice.thebiogrid.org/. |
filename |
A character string, indicating the output filename as an RData object to store the retrieved interactions. |
path |
A character string indicating the path to the project directory that contains the interaction data. If the directory is missing, it will be stored in the current directory. Default is PositiveInt. |
get_positivePPI
A Data.frame containing true positive protein-protein interactions for the selected pathogen.
Matineh Rahmatbakhsh, [email protected]
See get_negativePPI
for generating negative
protein-protein interaction.
local = tempdir() try(get_positivePPI(organism.taxID = 2697049, access.key = 'XXXX', filename = "PositiveInt.RData", path = local))
local = tempdir() try(get_positivePPI(organism.taxID = 2697049, access.key = 'XXXX', filename = "PositiveInt.RData", path = local))
This function retrieves protein sequences in FASTA format directly from the UniProt database via UniProt protein IDs. This function also checks if the amino-acid composition of protein sequences is in the 20 default types.
getFASTA(uniprot.id, filename = "FASTA.RData", path = "FASTASeq")
getFASTA(uniprot.id, filename = "FASTA.RData", path = "FASTASeq")
uniprot.id |
A character vector of UniProt identifiers. |
filename |
A character string, indicating the output filename as an RData object to store the retrieved sequences. |
path |
A character string indicating the path to the project directory that contains the interaction data. If the directory is missing, it will be stored in the current directory. Default is FASTASeq. |
getFASTA
A list containing protein FASTA sequences.
Matineh Rahmatbakhsh, [email protected]
# get fasta sequences for three proteins of SARS-Cov-2 local = tempdir() uniprot.id <- c("P0DTC4", "P0DTC5", "P0DTC9") fasta_df <- getFASTA(uniprot.id, filename = 'FASTA.RData', path = local) head(fasta_df)
# get fasta sequences for three proteins of SARS-Cov-2 local = tempdir() uniprot.id <- c("P0DTC4", "P0DTC5", "P0DTC9") fasta_df <- getFASTA(uniprot.id, filename = 'FASTA.RData', path = local) head(fasta_df)
This function calculates Host-Pathogen Protein-Protein Interaction (HP-PPI) descriptors via two approaches
combine
- combine the two descriptor matrix,
result has (p1 + p2)
columns
kron.prod
- if A has m x n matrix and B is q x p matrix,
then the Kronecker product is the code(pm × qn) block matrix
getHPI(pathogenData, hostData, type = c("combine", "kron.prod"))
getHPI(pathogenData, hostData, type = c("combine", "kron.prod"))
pathogenData |
The pathogen descriptor matrix. |
hostData |
The host descriptor matrix. |
type |
The interaction type, one or two of
|
getHPI
A matrix containing the Host-Pathogen Protein-Protein Interaction (HP-PPI) descriptors.
Matineh Rahmatbakhsh [email protected]
x <- matrix(c(1, 2, 3, 1), nrow = 2, ncol = 2, byrow = TRUE) y <- matrix(c(0, 3, 2, 1), nrow = 2, ncol = 2, byrow = TRUE) getHPI(x, y, "combine") getHPI(x, y, "kron.prod")
x <- matrix(c(1, 2, 3, 1), nrow = 2, ncol = 2, byrow = TRUE) y <- matrix(c(0, 3, 2, 1), nrow = 2, ncol = 2, byrow = TRUE) getHPI(x, y, "combine") getHPI(x, y, "kron.prod")
This dataset consists of experimentally validated human-SARS-CoV-2 interactions (positive set) and non-interacting pairs (negative set). The following data consists of:
PPI: SARS-CoV-2-human protein-protein interactions (PPIs)
Official Symbol Interactor A: SARS-CoV-2 gene names
official Symbol Interactor B: human host gene names
Pathogen_Protein: UniProt identifiers for SARS-CoV-2 virus
Host_Protein: UniProt identifiers for human proteins
class: labeled examples (both positive and negative)
data(Gold_ReferenceSet)
data(Gold_ReferenceSet)
a data.frame containing 500 validated pairs (i.e., positive set) and 500 non-interacting pairs (i.e., negative set).
To construct this dataset, validated interactions (positive set) were
retrieved from BioGrid database and were further filtered to only include
those interactions provided by (Samavarchi-Tehrani et al., 2020).
In this study, the authors mapped interaction between 27 SARS-CoV-2 and
human proteins via the proximity-dependent biotinylation (BioID)
approach. 500 SARS-CoV-2-host interaction pairs then randomly
selected from all pairs to serve as positive examples.
To construct negative examples,negative sampling were used using
get_negativePPI
.
https://www.biorxiv.org/content/10.1101/2020.09.03.282103v1
Samavarchi-Tehrani,P. et al. (2020) A SARS-CoV-2-host proximity interactome. BioRxiv.
SummarizedExperiment object of numerical features for host proteins.
data(host_se)
data(host_se)
To construct this object, first protein sequences were converted to numerical features using (CTD) descriptors provided in the HPiP package, followed by converting each numerical features matrix to SummarizedExperiment object.Each object is then merged into one object using 'cbind()'.
Given an input matrix, impute the missing values via three approaches including mean, median or zero.
impute_missing_data(x, method = c("mean", "median", "zero"))
impute_missing_data(x, method = c("mean", "median", "zero"))
x |
A numeric matrix as input. |
method |
Imputation method for missing values (mean, median or zero). |
impute_missing_data
Imputed matrix.
Matineh Rahmatbakhsh [email protected]
x <- matrix(1:10, ncol = 2) x[1:3, 2] <- NA row.names(x) <- c("A", "B", "C", "D", "E") colnames(x) <- c("col1", "col2") impute_missing_data(x, method = "mean") impute_missing_data(x, method = "median") impute_missing_data(x, method = "zero")
x <- matrix(1:10, ncol = 2) x[1:3, 2] <- NA row.names(x) <- c("A", "B", "C", "D", "E") colnames(x) <- c("col1", "col2") impute_missing_data(x, method = "mean") impute_missing_data(x, method = "median") impute_missing_data(x, method = "zero")
Plot the predicted PPIs. This function uses the plot
function of the igraph
.
plotPPI( ppi, edge.name = "ensemble_score", node.color = "grey", edge.color = "orange", cex.node = 4, node.label.dist = 1.5 )
plotPPI( ppi, edge.name = "ensemble_score", node.color = "grey", edge.color = "orange", cex.node = 4, node.label.dist = 1.5 )
ppi |
A data.frame containing protein-protein interactions with edge score. |
edge.name |
A character string giving an edge attribute name. |
node.color |
The fill color of the node. |
edge.color |
The color of the edge. |
cex.node |
The size of the node. |
node.label.dist |
The distance of the label from the center of the node. |
plotPPI
A PPI plot.
Matineh Rahmatbakhsh, [email protected]
df <- data.frame( node1 = c("A", "B", "C", "D", "E"), node2 = c("C", "E", "E", "E", "A"), edge.scores = c(0.5, 0.4, 0.3, 0.2, 0.7) ) plotPPI(df, edge.name = "edge.scores")
df <- data.frame( node1 = c("A", "B", "C", "D", "E"), node2 = c("C", "E", "E", "E", "A"), edge.scores = c(0.5, 0.4, 0.3, 0.2, 0.7) ) plotPPI(df, edge.name = "edge.scores")
This function uses an ensemble of classifiers to predict interactions from the sequence-based dataset. This ensemble algorithm combines different results generated from individual classifiers within the ensemble via average to enhance prediction.
pred_ensembel( features, gold_standard, classifier = c("avNNet", "svmRadial", "ranger"), resampling.method = "cv", ncross = 2, repeats = 2, verboseIter = TRUE, plots = TRUE, filename = "plots.pdf" )
pred_ensembel( features, gold_standard, classifier = c("avNNet", "svmRadial", "ranger"), resampling.method = "cv", ncross = 2, repeats = 2, verboseIter = TRUE, plots = TRUE, filename = "plots.pdf" )
features |
A data frame with host-pathogen protein-protein interactions (HP-PPIs) in the first column, and features to be passed to the classifier in the remaining columns. |
gold_standard |
A data frame with gold_standard HP-PPIs and class label indicating if such PPIs are positive or negative. |
classifier |
The type of classifier to use. See |
resampling.method |
The resampling method:'boot', 'boot632',
'optimism_boot', boot_all', 'cv', 'repeatedcv', 'LOOCV', 'LGOCV';
defaults to cv. See |
ncross |
Number of partitions for cross-validation;
defaults to 5.See |
repeats |
for repeated k-fold cross validation only;
defaults to 3.See |
verboseIter |
Logical value, indicating whether to check the status of training process;defaults to FALSE. |
plots |
Logical value, indicating whether to plot the performance of ensemble learning algorithm as compared to individual classifiers; defaults to TRUE.If the argument set to TRUE, plots will be saved in the current working directory. These plots are :
|
filename |
A character string, indicating the output filename as an pdf object. |
pred_ensembel
Ensemble_training_output
prediction score - Prediction scores for whole dataset from each individual classifier.
Best - Selected hyper parameters.
Parameter range - Tested hyper parameters.
prediction_score_test - Scores probabilities for test data from each individual classifier.
class_label - Class probabilities for test data from each individual classifier.
classifier_performance
cm - A confusion matrix.
ACC - Accuracy.
SE - Sensitivity.
SP - Specificity.
PPV - Positive Predictive Value.
F1 - F1-score.
MCC - Matthews correlation coefficient.
Roc_Object - A list of elements.
See roc
for more details.
PR_Object - A list of elements.
See pr.curve
for more details.
predicted_interactions - The input data frame of pairwise interactions, including classifier scores averaged across all models.
Matineh Rahmatbakhsh, [email protected]
data('example_data') features <- example_data[, -2] gd <- example_data[, c(1,2)] gd <- na.omit(gd) ppi <-pred_ensembel(features,gd, classifier = c("avNNet", "svmRadial", "ranger"), resampling.method = "cv",ncross = 2,verboseIter = FALSE,plots = FALSE, filename = "plots.pdf") #extract predicted interactions pred_interaction <- ppi[["predicted_interactions"]]
data('example_data') features <- example_data[, -2] gd <- example_data[, c(1,2)] gd <- na.omit(gd) ppi <-pred_ensembel(features,gd, classifier = c("avNNet", "svmRadial", "ranger"), resampling.method = "cv",ncross = 2,verboseIter = FALSE,plots = FALSE, filename = "plots.pdf") #extract predicted interactions pred_interaction <- ppi[["predicted_interactions"]]
Input data for enrichfindP
data(predicted_PPIs)
data(predicted_PPIs)
This function contains five module detection algorithms
including fast-greedy algorithm
(FC
), walktrap algorithm (RW
), multi-level community
algorithm
(ML
), label propagation algorithm (clp
), and markov
clustering (MCL
).
run_clustering( ppi, method = c("FC", "RW", "ML", "clp", "MCL"), expan = 2, infla = 5, iter = 50 )
run_clustering( ppi, method = c("FC", "RW", "ML", "clp", "MCL"), expan = 2, infla = 5, iter = 50 )
ppi |
A data.frame containing pathogen proteins in the first column,host proteins in the second column, and edge weight in the third column. |
method |
Module detection algorithms including:
|
expan |
Numeric value > 1 for the expansion parameter.
See |
infla |
Numeric value > 0 for the inflation power coefficient.
See |
iter |
An integer, the maximum number of iterations for the MCL.
See |
run_clustering
A data.frame with the enrichment analysis results.
Matineh Rahmatbakhsh, [email protected]
This dataset consists of interactions between SARS-CoV-2 and human proteins, achieved by AP-MS (affinity purification mass spectrometry).
data(unlabel_data)
data(unlabel_data)
A data.frame containing 700 SARS-CoV-2-Human protein-protein interactions (PPIs) with pre-computed numerical features using CTD (composition/transition/distribution) descriptors.
To construct this dataset, data (supplementary table 1) containing SARS-CoV-2-human PPIs was retrieved from (Gordon et al., 2020) and 700 pairs were randomly selected from total pairs, followed by converting protein sequences of host or viral proteins to numerical features and finally concatenating the computed features in order to construct host-pathogen PPIs.
https://www.nature.com/articles/s41586-020-2286-9#Sec36
Gordon,D.E. et al. (2020) A SARS-CoV-2 protein interaction map reveals targets for drug repurposing. Nature, 583, 459–468.
This data includes one protein sequence per SARS-CoV-2 gene,
retrieved directly from UniProt database using
getFASTA
.
data(UP000464024_df)
data(UP000464024_df)
A data.frame with two columns:(1) UniprotKBID, UniProt identifier.(2) FASTASEQ, sequences per SARS-CoV-2 gene.
https://www.uniprot.org/uniprot/?query=proteome:UP000464024
A graphical display of variable importance of selected features.
var_imp(x, cex.x = 1, cex.y = 2)
var_imp(x, cex.x = 1, cex.y = 2)
x |
A list of elements returned from RFE analysis.
See |
cex.x |
The size of x axis label. |
cex.y |
The size of y axis label. |
var_imp
Variable Importance Plot.
Matineh Rahmatbakhsh, [email protected].
data('example_data') x <- na.omit(example_data) #perform feature selection s <- FSmethod(x, type = 'both', cor.cutoff = 0.7, resampling.method = "repeatedcv", iter = 5, repeats = 3, metric = "ROC", verbose = TRUE) var_imp(s$rf.result$rfProfile, cex.x = 10, cex.y = 10)
data('example_data') x <- na.omit(example_data) #perform feature selection s <- FSmethod(x, type = 'both', cor.cutoff = 0.7, resampling.method = "repeatedcv", iter = 5, repeats = 3, metric = "ROC", verbose = TRUE) var_imp(s$rf.result$rfProfile, cex.x = 10, cex.y = 10)
SummarizedExperiment object of numerical features for SARS-CoV-2 proteins.
data(viral_se)
data(viral_se)
To construct this object, first protein sequences were converted to numerical features using (CTD) descriptors provided in the HPiP package, followed by converting each numerical features matrix to SummarizedExperiment object. Each object is then merged into one object using 'cbind()'.