Infectious disease imposes a significant threat to human health and pose substantial healthcare costs. Infectious diseases result from the cross-talks between hosts and pathogens, which are mainly mediated by protein-protein interactions between hosts and pathogen proteins (HP-PPIs). The potential (HP-PPIs) represents the crucial elements of the infection mechanism as they decide the outcome, leading to either pathogen clearance or spread of the pathogen in the host due to evading the host immune response (Rahmatbakhsh et al., 2021). Therefore, the study of the host-pathogen interactome is increasingly vital to uncover the molecular attributes of infectious diseases and potentially discover novel pharmacological targets or laying a strong foundation for repurposing of existing drugs.
In the past decades, several high throughput experimental approaches have been developed to chart HP-PPIs on a large scale (e.g., yeast two-hybrid (Y2H) system (Ito et al., 2001) or affinity purification (AP) coupled to mass spectrometry (MS) (Puig et al., 2001)). However, such high-throughput experimental screens are typically laborious, time-consuming, and challenging to capture the complete interactome, resulting in limited number of experimentally validated interactome in a database of HP-PPIs (Hart et al., 2006). In-silico prediction of HP-PPIs can complement wet-lab experiments by suggesting candidate interaction partners for all the host proteins and excluding partners with low interacting probability scores to reduce the range of possible PPI candidates that need to be validated via wet-lab experiments. Specialized computational approaches to predict HP-PPIs are therefore of significant importance. While many computational tools have been developed to predict intra-species PPIs (i.e., PPIs within the same species) (Wu et al., 2006; Shen et al., 2007; Nourani et al., 2015), the availability of computational tools to predict inter-species PPIs such as HP-PPIs are rare.
For this purpose, we describe HPiP (host-pathogen interaction prediction), an R package for automated prediction of HP-PPIs using structural and physicochemical descriptors computed from amino acid-composition of host and pathogen proteins. Briefly, HPiP extracts gold-standard of experimentally verified HP-PPIs (i.e., positive interactions) from public repository, construct negative interactions via negative sampling, retrieve and convert protein sequences to numerical representation via various descriptors, applies multivariate feature selection based on correlation and recursive feature elimination (RFE)-embedded, and finally applies ensemble averaging to predict interactions. Taken together, we hope that the HPiP package not only contributes a useful predictor to accelerate the exploration of host-pathogen PPIs, but also provides some meaningful insights into host-pathogen relationships.
Briefly, HPiP downloads the gold-standard data sets of experimentally verified host-pathogen PPIs from the BioGRID database (Stark et al., 2006). Such interactions serve as a positive set. In the absence of ground truth negative examples, HPiP uses negative sampling to construct a negative set. Following the construction of gold-standard data, HPiP retrieves the FASTA sequences of associated proteins. HPiP then represents protein sequences into a fixed-length feature vector using a series of structural and physicochemical descriptors. Host-pathogen feature vectors and the accompanying gold standard reference set also called the training set, are fed into the hybrid filter-wrapper feature selection method to select the most relevant features in inferring the target variable. In the following step, HPiP uses a training set to train a series of individual machine learning models (base learners) provided in the caret package (Kuhn et al., 2020). For each applied model, hyperparameters are tweaked throughout training via resampling techniques (e.g., k-fold cross-validation), and the best set of hyperparameters are selected based on the accuracy performance measure. The optimized models will then be applied to host-pathogen feature vectors with an unknown class label to return a classification result for each pair. The HPiP then uses ensemble averaging to average classification results over an ensemble of classifiers for each possible interaction. Finally, HPiP compares the algorithmic performance of the ensemble model with individual base learners through resampling technique (e.g., k-fold cross-validation) and various performance metrics (e.g., accuracy).
In the following sections, we explain the main components of the HPiP package, including dataset preparation (i.e., construction of the gold-standard set, FASTA sequence retrieval), feature extraction, data processing steps (i.e., imputation of missing values, feature selection), ensemble model generation and evaluation, prediction of HP-PPIs, network visualization and external validation of the predicted network using functional enrichment analysis. Furthermore, we guide users through each step by applying the HPiP to sample data derived from public databases.
In this tutorial, we use the data provided by Samavarchi-Tehrani et al. (2020) as our
benchmark dataset. In this study, the authors mapped interaction between
27 SARS-CoV-2 and host proteins via the proximity-dependent
biotinylation (BioID) approach. We then randomly selected 500
SARS-CoV-2-host interaction pairs from all pairs as the positive
samples. Since ground truth negatives are not available, here negative
examples are generated from the positive set using negative sampling
(Eid et al., 2016). In this
approach, negative instances are sampled from all the possible pairwise
combinations of host and viral proteins, as long as the possible pairs
do not occur in the positive reference set. To prevent statistical
differences, the same scale is assumed for the negative and positive
instances (i.e., the ratio of positive to negative 1:1) (Zhou et al., 2018). The
gold-reference
data set can be loaded with the following
command:
# Loading packages required for data handling & data manipulation
library(dplyr)
library(tibble)
library(stringr)
# Loading HPiP package
library(HPiP)
# Loading data
data(Gold_ReferenceSet)
dim(Gold_ReferenceSet)
## [1] 1000 6
As stated by dim()
the gold reference set includes 1000
HP-PPIs interaction between 27 SARS-CoV-2 and 784 host proteins.
In addition, users can use get_positivePPI
in the HPiP
package to construct positive set
from the BioGRID database
(Stark et al., 2006).
This function takes the following parameters:
organism.taxID
Taxonomy identifier for the
pathogen.access.key
Access key for using BioGRID webpage. To
retrieve interactions from the BioGRID database, the users are first
required to register for access key.filename
A character string, indicating the output
filename as an RData object to store the retrieved interactions.path
A character string indicating the path to the
project directory that contains the interaction data. If the directory
is missing, it will be stored in the current directory.local = tempdir()
#Get positive interactions from BioGrid
TP <- get_positivePPI(organism.taxID = 2697049,
access.key = 'XXXX',
filename = "PositiveInt.RData",
path = local)
Subsequently, we can construct negative set via negative sampling using the following command:
#pathogen proteins
prot1 <- unique(TP$`Official Symbol Interactor A`)
#host proteins
prot2 <- unique(TP$`Official Symbol Interactor B`)
#true positive PPIs
TPset <- TP$PPI
TN <- get_negativePPI(prot1 , prot2, TPset)
dim(TN)
## [1] 2600 1
To compute different features from protein sequences, we must first
extract their sequences (in FASTA format). The getFASTA
function in the HPiP package can retrieve the sequences for any organism
from the UniProt database.
To apply a learning algorithm on a host or pathogen protein sequence, it is needed to encode sequence information as numerical features. However, one of the critical challenges in inferring protein-protein interactions from the protein sequences is finding an appropriate way to encode protein sequences’ important information fully. Also, the amino-acid sequences of different lengths should be converted to fixed-length feature vectors, which is crucial in classifying training data using machine-learning techniques as such techniques require fixed-length patterns. The HPiP offers multiple functions for generating various numerical features from protein sequences.
These feature coding schemes listed in HPiP include amino acid composition (AAC) , dipeptide composition (DC), tripeptide composition (TC), tripeptide composition (TC) from biochemical similarity classes, quadruplets composition (QC), F1, F2, CTD (composition/transition/distribution), conjoint triad, autocorrelation, k-spaced amino acid pairs, and binary encoding.
The amino acid composition (AAC) has low complexity and has been widely used to predict protein-protein interactions (PPIs) (Beltran et al., 2019; Dey et al., 2020).The AAC explains the fraction of a type of amino acid found within a protein sequence (Dey et al., 2020). This gives 20-dimensional feature vectors. For example, the fraction of all 20 natural amino acids is computed as follow:
$$ f_{(i)}=\frac{n_i}{L} \text{ }\ (i = 1,2,3,....,20) $$
where ni is the number of amino acid type and L is the sequence length. The ACC descriptor from the protein sequences can be loaded with the following command:
# Convert the list of sequences obtained in the previous section to data.frame
fasta_df <- do.call(rbind, fasta_list)
fasta_df <- data.frame(UniprotID = row.names(fasta_df),
sequence = as.character(fasta_df))
#calculate AAC
acc_df <- calculateAAC(fasta_df)
#only print out the result for the first row
acc_df[1,-1]
## A C D E F G
## 0.062058130 0.031421838 0.048703849 0.037706206 0.060487038 0.064414768
## H I K L M N
## 0.013354281 0.059701493 0.047918303 0.084838963 0.010997643 0.069128044
## P Q R S T V
## 0.045561665 0.048703849 0.032992930 0.077769049 0.076197958 0.076197958
## W Y
## 0.009426551 0.042419482
The dipeptide composition (DC) is simply the fraction of the different adjacent pairs of amino acids within a protein sequence (Bhasin and Raghava, 2004). Also, this descriptor encapsulates the properties of neighboring amino acids. Dipeptide composition converts a protein sequence to a vector of 400 dimensions. The composition of all 400 natural amino acids can be calculated using the following equation:
$$ f_{(m,k)}=\frac{n_{m,k}}{L-1} \text{ }\ (m,k = 1,2,3,....,20) $$
where nm,k corresponds to the number of dipeptide compositions characterized by amino acid type m and type k, while L is the sequence length.The DC descriptor from the protein sequences can be loaded with the following command:
# using data.frame provided by getFASTA function as data input
dc_df <- calculateDC(fasta_df)
#only print out the first 30 elements for the first row
dc_df[1, c(2:31)]
## AA AC AD AE AF AG
## 0.0047169811 0.0000000000 0.0047169811 0.0031446541 0.0000000000 0.0062893082
## AH AI AK AL AM AN
## 0.0007861635 0.0055031447 0.0007861635 0.0062893082 0.0007861635 0.0023584906
## AP AQ AR AS AT AV
## 0.0031446541 0.0039308176 0.0015723270 0.0062893082 0.0031446541 0.0031446541
## AW AY CA CC CD CE
## 0.0015723270 0.0039308176 0.0023584906 0.0031446541 0.0015723270 0.0007861635
## CF CG CH CI CK CL
## 0.0007861635 0.0031446541 0.0007861635 0.0000000000 0.0007861635 0.0023584906
The tripeptide composition explains the occurrence of adjacent triune residues in a protein sequence (Liao et al., 2011). Tripeptide composition converts a protein sequence to a vector of 8,000 dimensions. The composition of all 8,000-dimensional descriptor can be calculated using the following equation: $$ f_{(m,k,j)}=\frac{n_{m,k,j}}{L-2} \text{ }\ (m,k,j = 1,2,3,....,20) $$ where nm,k,j corresponds to the number of tripeptide compositions characterized by amino acid type m, k and j, while L is the sequence length.The TC descriptor from the protein sequences can be loaded with the following command:
# using data.frame provided by getFASTA function as data input
tc_df <- calculateTC(fasta_df)
#only print out the first 30 elements for the first row
tc_df[1, c(2:31)]
## AAA AAC AAD AAE AAF AAG
## 0.0007867821 0.0000000000 0.0000000000 0.0007867821 0.0000000000 0.0000000000
## AAH AAI AAK AAL AAM AAN
## 0.0000000000 0.0000000000 0.0000000000 0.0007867821 0.0000000000 0.0000000000
## AAP AAQ AAR AAS AAT AAV
## 0.0000000000 0.0000000000 0.0007867821 0.0000000000 0.0007867821 0.0000000000
## AAW AAY ACA ACC ACD ACE
## 0.0000000000 0.0007867821 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## ACF ACG ACH ACI ACK ACL
## 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
In order to reduce the dimension of length-8,000 TC descriptor, the sequence alphabet is reduced from 20 amino acids to six classes based on biochemical similarity. The classes are [{IVLM}, {FYW}, {HKR}, {DE}, {QNTP}, and {ACGS} (Ahmed et al., 2018)]. This classification of amino acids converts a protein sequence to a vector of 216 (i.e., 6 * 6 * 6) different combinations of possible substrings of length 3. The frequency of triplet for each encoded class in the protein sequence can be calculated as follows:
$$ q_{(i)}=\frac{f_i - M_0}{M_1-M_0} $$ M0 = min(f1, f2, ..., f216) and M1 = max(f1, f2, ..., f216)
Here fi is the frequency of ith triplet in the sequence i=1,2,…,216. To get 216-dimensional descriptor from the protein sequences, the following command can be used:
# using data.frame provided by getFASTA function as data input
TC_Sm_df <- calculateTC_Sm(fasta_df)
#only print out the first 30 elements for the first row
TC_Sm_df[1, c(2:31)]
## 1 2 3 4 5 6 7 8
## 0.7053571 0.2946429 0.4732143 0.4910714 0.8392857 0.8660714 0.2857143 0.1517857
## 9 10 11 12 13 14 15 16
## 0.2142857 0.1875000 0.2946429 0.3482143 0.4107143 0.1964286 0.2053571 0.1160714
## 17 18 19 20 21 22 23 24
## 0.2589286 0.3571429 0.3035714 0.1428571 0.2321429 0.1517857 0.2410714 0.3928571
## 25 26 27 28 29 30
## 0.7321429 0.3035714 0.4285714 0.3303571 0.7500000 0.6250000
To compute these features, the sequence alphabet is first reduced to
six classes reported above (section 3.3.2.4
). This
reduction converts a protein sequence to a vector of 1296 (i.e., 6 * 6 *
6 * 6) different combinations of possible substrings of length 4 (Ahmed et al., 2018). The frequency of
quadruplets for each encoded class in the protein sequence can be
calculated similarly to the equation mentioned above:
$$ q_{(i)}=\frac{f_i - M_0}{M_1-M_0} $$ M0 = min(f1, f2, ..., f1296) and M1 = max(f1, f2, ..., f1296) To get 1296-dimensional descriptor from the protein sequences, the following command can be used:
# using data.frame provided by getFASTA function as data input
QD_df <- calculateQD_Sm(fasta_df)
#only print out the first 30 elements for the first row
QD_df[1, c(2:31)]
## 1 2 3 4 5 6 7
## 0.42424242 0.18181818 0.27272727 0.48484848 0.48484848 0.66666667 0.15151515
## 8 9 10 11 12 13 14
## 0.21212121 0.15151515 0.09090909 0.24242424 0.27272727 0.45454545 0.06060606
## 15 16 17 18 19 20 21
## 0.24242424 0.18181818 0.21212121 0.57575758 0.24242424 0.15151515 0.36363636
## 22 23 24 25 26 27 28
## 0.24242424 0.27272727 0.51515152 0.66666667 0.42424242 0.54545455 0.21212121
## 29 30
## 0.63636364 0.48484848
F1 composition gives 20-dimensional description, defined as:
F1(SAR) = ∑SAR ϵ sequencelength(SAR)2
Where SAR is the sum of squared length of single amino acid repeats (SARs) in the entire protein sequence. Since F1 includes SAR of length 1, the F1 descriptor reveals global composition of amino acids and amino acid repeats (Alguwaizani et al., 2018).
Figure 1: Example of calculating F1 (repeats of S) in the protein sequence.
While, to calculate feature F2, the sequence alphabet is first split into substrings of length 6 residues (Alguwaizani et al., 2018). There are two main differences between feature F2 and F1:
F2 composition gives 20-dimensional description, defined as:
F1(SAR) = maxwindows ϵ sequence sumSAR ϵ sequencelength(SAR)2
Where SAR is the sum of squared length of single amino acid repeats (SARs) in the entire protein sequence.
# using data.frame provided by getFASTA function as data input
F1_df <- calculateF(fasta_df, type = "F1")
#only print out the result the first row
F1_df[1,-1]
## A C D E F G H I K L M N P Q R S T V W Y
## 93 48 66 50 79 88 17 82 65 130 14 102 64 66 44 117 109 113 12 60
# using data.frame provided by getFASTA function as data input
F2_df <- calculateF(fasta_df, type = "F2")
#only print out the result the first row
F2_df[1,-1]
## A C D E F G H I K L M N P Q R S T V W Y
## 9 4 4 4 4 4 1 4 4 9 1 4 4 4 4 9 4 9 1 4
To calculate CTD descriptors developed by (Dubchak et al., 1995, 1999), the 20
standard amino acids is first clustered into three classes according to
its attribute. Then, each amino acid in the protein sequence is encoded
by one of the indices 1,2,3 depending on its grouping. The corresponding
divisions for each group are shown in Table 1.
According to Hydrophobicity grouping mentioned in Table
1, the protein sequence CLVIMFWGASTPHYRKEDQN
is
replaced by 11111112222222333333
. Next, for a given
attribute, three types of descriptors, composition (C), transition (T),
and distribution (D) can be calculated, which will be explained in the
following sections.
Sl | Property | Group 1 | Group 2 | Group 3 |
---|---|---|---|---|
1 | Charge | Neutral | Negatively charged | Positively charged |
Amino acids | A,C,F,G,H,I,L,M,N,P,Q,S, | |||
T,V,W,Y | D,E | K,R | ||
2 | Hydrophobicity | Hydrophobicity | Neutral | Polar |
Amino acids | C,F,I,L,M,V,W | A,G,H,P,S,T,Y | D,E,K,N,Q,R | |
3 | Normalized vander Waals | 0-2.78 | 2.95-4.0 | 4.03-8.08 |
Amino acids | A,C,D,G,P,S,T | E,I,L,N,Q,V | F,H,K,M,R,W,Y | |
4 | Polarity | 4.9-6.2 | 8.0-9.2 | 10.4-13.0 |
Amino acids | C,F,I,L,M,V,W,Y | A,G,P,S,T | D,E,H,K,N,Q,R | |
5 | Polariizability | 0 - .108 | 0.128-0.186 | 0.219-0.409 |
Amino acids | A,D,G,S,T | C,E,I,L,N,P,Q,V | F,H,K,M,R,W,Y | |
6 | Secondary Structure | Coil | Helix | Strand |
Amino acids | D,G,N,P,S | A,E,H,K,L,M,Q,R | C,F,I,T,V,W,Y | |
7 | Solvent Accessibility | Buried | Intermediate | Exposed |
Amino acids | A,C,F,G,I,L,V,W | H,M,P,S,T,Y | D,E,K,N,R,Q |
The composition represents the number of amino acids of a particular
property (e.g., hydrophobicity) for each encoded class divided by the
protein sequence length (You et al.,
2014). In the above example using the hydrophobicity attribute,
the number for encoded classes 1
, 2
,
3
are 7,7,6 respectively. Therefore, the compositions for
each class are 7/20 =35%, 7/20 =35%,
and 6/20 =30%, respectively. Composition descriptor can
be defined as:
$$ C_{(i)}=\frac{n_i}{L} \text{ }\ (i = 1,2,3) $$
where ni is the number of amino acid type i and L is the sequence length. The C descriptor from the protein sequences can be loaded with the following command:
# using data.frame provided by getFASTA function as data input
CTDC_df <- calculateCTDC(fasta_df)
CTDC_df[1, c(-1)]
## G1.charge G1.hydrophobicity G1.normwaalsvolume G1.polarity
## 0.08091123 0.28515318 0.40612726 0.37549097
## G1.polarizability G1.secondarystruct G1.solventaccess G2.charge
## 0.32914375 0.33857031 0.44854674 0.83267871
## G2.hydrophobicity G2.normwaalsvolume G2.polarity G2.polarizability
## 0.38177533 0.37627651 0.32600157 0.45326002
## G2.secondarystruct G2.solventaccess G3.charge G3.hydrophobicity
## 0.35585232 0.28515318 0.08641005 0.33307148
## G3.normwaalsvolume G3.polarity G3.polarizability G3.secondarystruct
## 0.21759623 0.29850746 0.21759623 0.30557738
## G3.solventaccess
## 0.26630008
Transition (T) characterizes the percent frequency from a type of
amino acids to another type (Wang et al., 2017). For instance, a
transition from class 1
to 2
or 2
to 1
is the percent frequency with which class
1
is followed by class 2
or vice versa (Xiao et al., 2015). The frequency of
these transitions can be computed as follow:
$$ T_{(rs)}=\frac{n_{rs} + n_{sr}}{L-1} \text{ }\ (rs = 12,13,23) $$
where nrs,nsr are the number
of dipeptide encoded as rs
and sr
in the
sequence and and L is the sequence length.The T descriptor from
the
protein sequences can be calculated with the following command:
# using data.frame provided by getFASTA function as data input
CTDT_df <- calculateCTDT(fasta_df)
#only print out the result for the first row
CTDT_df[1, -1]
## charge.Tr1221.prob charge.Tr1331.prop
## 0.13207547 0.01572327
## charge.Tr2332.prob hydrophobicity.Tr1221.prob
## 0.14465409 0.21462264
## hydrophobicity.Tr1331.prop hydrophobicity.Tr2332.prob
## 0.20047170 0.26415094
## normwaalsvolume.Tr1221.prob normwaalsvolume.Tr1331.prop
## 0.31367925 0.16823899
## normwaalsvolume.Tr2332.prob polarity.Tr1221.prob
## 0.16902516 0.25000000
## polarity.Tr1331.prop polarity.Tr2332.prob
## 0.23977987 0.19261006
## polarizability.Tr1221.prob polarizability.Tr1331.prop
## 0.30660377 0.14072327
## polarizability.Tr2332.prob secondarystruct.Tr1221.prob
## 0.19654088 0.24371069
## secondarystruct.Tr1331.prop secondarystruct.Tr2332.prob
## 0.20676101 0.23663522
## solventaccess.Tr1221.prob solventaccess.Tr1331.prop
## 0.25943396 0.23899371
## solventaccess.Tr2332.prob
## 0.15566038
The distribution measures the chain length within which the first,
25%, 50%, 75%, and 100% of the amino acids of a particular property
(e.g., hydrophobicity) for a certain encoded class are located,
respectively (Dubchak et al.,
1995). For example, as shown in Figure 3, there
are 8 residues as 1
, the position for the first residue
1
, the 2nd residue 1
(25% * 8 = 2), the 5th
1
residue (50% * 8 = 4), the 7th 1
(75% * 8=
6) and the 10th residue 2
(100% * 8 =8) in the encoded
sequence are 1, 2, 13, 17, 22 respectively, so that the distribution
descriptors for residue 1
are : (1/22) ×100% = 4.55%,
(2/22)×100% = 9.09%, (13/22) ×100% = 59.09%, (17/22)×100% = 77.27%,
(22/22)×100% = 100%, respectively. Likewise, the distribution descriptor
for 2
and 3
is (18.18%, 18.18%, 27.27%,
63.64%, 95.45%) and (13.64%, 31.82%, 45.45%, 54.55%, 86.36%),
respectively.
Figure 2:The sequence of hypothetical protein showing the
construction of CTD descriptors of a protein. The index 1, 2
and 3 indicates the position of amino acid for each encoded class. 1-2
transitions indicated the position of 12
or 21
pairs in the sequence. Similarly, 1-3 and 2-3 transitions are defined in
the same way.
The D descriptor from the protein sequences can be calculated with the following command:
# using data.frame provided by getFASTA function as data input
CTDD_df <- calculateCTDD(fasta_df)
#only print out the first 30 elements for the first row
CTDD_df[1, c(2:31)]
## charge.g1.residue0 charge.g1.residue100
## 1.6496465 99.6857816
## charge.g1.residue25 charge.g1.residue50
## 23.5663786 43.7549097
## charge.g1.residue75 charge.g2.residue0
## 75.7266300 0.0785546
## charge.g2.residue100 charge.g2.residue25
## 100.0000000 24.9018068
## charge.g2.residue50 charge.g2.residue75
## 49.9607227 74.1555381
## charge.g3.residue0 charge.g3.residue100
## 3.1421838 99.1358995
## charge.g3.residue25 charge.g3.residue50
## 26.7085625 55.1453260
## charge.g3.residue75 hydrophobicity.g1.residue0
## 80.9897879 1.0997643
## hydrophobicity.g1.residue100 hydrophobicity.g1.residue25
## 99.6857816 25.7659073
## hydrophobicity.g1.residue50 hydrophobicity.g1.residue75
## 50.3534957 75.8051846
## hydrophobicity.g2.residue0 hydrophobicity.g2.residue100
## 0.7069914 100.0000000
## hydrophobicity.g2.residue25 hydrophobicity.g2.residue50
## 24.7446976 49.4893951
## hydrophobicity.g2.residue75 hydrophobicity.g3.residue0
## 71.4846819 0.0785546
## hydrophobicity.g3.residue100 hydrophobicity.g3.residue25
## 99.7643362 23.8020424
## hydrophobicity.g3.residue50 hydrophobicity.g3.residue75
## 50.5106049 76.9835035
The conjoint triad is one of the popular sequence-based approaches for protein-protein interactions prediction (Shen et al., 2007). This method encodes a protein sequence as a feature vector by calculating the frequency of amino acid triplets as follows (Figure 2) :
Similar to section 3.3.2.4
, it encodes 20 amino
acids to seven classes based on their dipoles and volumes of the side
chains. These seven classes are [{AGV}, {DE}, {FILP}, {KR}, {MSTY}, and
{C} (Shen et al., 2007)]
A given protein sequence is then represented using three consecutive amino acids (i.e., amino acid triple).
It uses 343-dimensional feature vectors to represent a given protein sequence, where then each feature vector v is then mapped to frequency vector di (i= 1,2,…343), which is defined as follow:
$$ d_i = \frac{f_i - \min\{\,f_1, f_2 , \ldots, f_{343}\,\}}{\max\{\,f_1, f_2, \ ldots, f_{343}\,\}} $$
Where fi is the frequency of i-th triplet type in the protein sequence. The numerical value of di of each protein ranges between 0 to 1, which therefore allows the comparison between proteins.
Figure 3: Schematic diagram for constructing conjoint triad method. V is the vector space of feature vectors that includes a fixed number of features; each feature (vi) includes amino acid triplet; F represents the frequency vector corresponding to V, and the value of i-th dimension of F(fi) corresponds to the frequency of that vi-triad observed in the sequence.
The conjoint triad Descriptor descriptor from the protein sequences can be calculated with the following command:
# using data.frame provided by getFASTA function as data input
CTriad_df <- calculateCTriad(fasta_df)
#only print out the first 30 elements for the first row
CTriad_df[1, c(2:31)]
## 1 2 3 4 5 6 7
## 0.78409091 0.78409091 0.61363636 0.53409091 0.38636364 0.56818182 0.09090909
## 8 9 10 11 12 13 14
## 0.72727273 0.79545455 0.77272727 0.42045455 0.35227273 0.42045455 0.10227273
## 15 16 17 18 19 20 21
## 0.68181818 0.76136364 0.63636364 0.43181818 0.34090909 0.30681818 0.11363636
## 22 23 24 25 26 27 28
## 0.46590909 0.46590909 0.39772727 0.20454545 0.35227273 0.14772727 0.07954545
## 29 30
## 0.29545455 0.44318182
Autocorrelation descriptors, also known as molecular connectivity indices, explain the magnitude of the correlation between protein or peptide sequences based on their particular structural or physiochemical information, which are defined according to the distribution of amino acid properties along the protein sequence (Ong et al., 2007). Eight default properties (Xiao et al., 2015) are used here for deriving the autocorrelation descriptors: normalized average hydrophobicity scales (AccNo. CIDH920105), average flexibility indices (AccNo. BHAR88010), polarizability parameter (AccNo. CHAM820101), free energy of solution in water(AccNo. CHAM820102), residue accessible surface area in tripeptide (AccNo. CHOC760101), residue volume (AccNo. BIGC670101), steric parameter (AccNo. CHAM810101), and relative mutability (AccNo. DAYM780201). Autocorrelation descriptors includes three types of descriptors (Morean-Broto, Moran, and Geary) which are described below. Prior to integrating any of the physiochemical attributes into the autocorrelation formula, these attributes need to be normalized by the following equation:
$$ P_r = \frac{P_r - \bar{P}}{\sigma} $$ where P̄ is the mean value of the eight physiochemical attributes, and sigma represents the standard deviation, in which both can be defined as:
$$ \bar{P} = \frac{\sum_{r=1}^{20} P_r}{20} \quad \textrm{and} \quad \sigma = \sqrt{\frac{1}{2} \sum_{r=1}^{20} (P_r - \bar{P})^2} $$
The first type of autocorrelation is known as Moreau-Broto autocorrelation (Broto et al., 1984). Application of Moreau-Broto autocorrelation to protein sequence is calculated by the following equation:
$$ AC(d) = \sum_{i=1}^{L-d} P_i P_{i + d} \quad d = 1, 2, \ldots, \textrm{nlag} $$
where Pi and Pi + d represent the amino acid property at position i and i + d and d is termed the lag of the autocorrelation along the protein sequence; Pi and Pi + d. While, nlag is the maximum value of the lag. This equation can be normalized based on peptide length to get normalized Moreau-Broto autocorrelation:
$$ ATS(d) = \frac{AC(d)}{L-d} \quad d = 1, 2, \ldots, \textrm{nlag} $$
The second type of autocorrelation, named the Moran autocorrelation (Moran, 1950), can be defined as:
$$ I(d) = \frac{\frac{1}{L-d} \sum_{i=1}^{L-d} (P_i - \bar{P}') (P_{i+d} - \bar{P}')}{\frac{1}{L} \sum_{i=1}^{L} (P_i - \bar{P}')^2} \quad d = 1, 2, \ldots, 30 $$
where d, Pi, and Pi + d are described in the same fashion as that for Moreau-Broto autocorrelation; P̄′ is the mean of the given amino acid property P across the protein sequence, i.e.,
$$ \bar{P}' = \frac{\sum_{i=1}^L P_i}{L} $$
d, P, Pi and Pi + d, nlag are defined as above. The main difference between Moran and Moreau-Broto autocorrelation is that, unlike Moreau-Broto, the Moran autocorrelation utilizes the mean value of the given amino acid property instead of the actual value of the property (Al-Barakati et al., 2019).
The last type of autocorrelation , known as the Geary autocorrelation, can be calculated as: $$ C(d) = \frac{\frac{1}{2(L-d)} \sum_{i=1}^{L-d} (P_i - P_{i+d})^2}{\frac{1}{L-1} \sum_{i=1}^{L} (P_i - \bar{P}')^2} \quad d = 1, 2, \ldots, 30 $$
where d, P, Pi, Pi + d, and nlag are defined above. The key difference between Geary and the other two autocorrelations is that the Geary autocorrelation utilizes the square difference of the property values (Al-Barakati et al., 2019).
Computing autocorrelation with HPiP is simple as the following commands:
# using data.frame provided by getFASTA function as data input
moran_df <- calculateAutocor(fasta_df,type = "moran", nlag = 10)
#only print out the first 30 elements for the first row
moran_df[1, c(2:31)]
## CIDH920105.lag1 CIDH920105.lag2 CIDH920105.lag3 CIDH920105.lag4
## -2.277533e-02 -6.586333e-02 -6.065475e-03 7.793094e-03
## CIDH920105.lag5 CIDH920105.lag6 CIDH920105.lag7 CIDH920105.lag8
## -1.965556e-02 -2.291896e-02 3.609957e-02 2.404702e-02
## CIDH920105.lag9 CIDH920105.lag10 BHAR880101.lag1 BHAR880101.lag2
## -7.224129e-03 -1.962279e-02 5.838844e-04 -3.936188e-02
## BHAR880101.lag3 BHAR880101.lag4 BHAR880101.lag5 BHAR880101.lag6
## -4.993847e-04 1.610445e-02 7.256299e-03 2.432986e-03
## BHAR880101.lag7 BHAR880101.lag8 BHAR880101.lag9 BHAR880101.lag10
## 2.951910e-02 1.431268e-02 -9.954176e-03 -6.734817e-03
## CHAM820101.lag1 CHAM820101.lag2 CHAM820101.lag3 CHAM820101.lag4
## 2.246027e-02 1.045731e-04 1.598308e-02 -2.012865e-02
## CHAM820101.lag5 CHAM820101.lag6 CHAM820101.lag7 CHAM820101.lag8
## -3.081543e-05 -2.027915e-02 2.091813e-02 1.198716e-02
## CHAM820101.lag9 CHAM820101.lag10
## -3.860567e-03 -2.132450e-02
# using data.frame provided by getFASTA function as data input
mb_df <- calculateAutocor(fasta_df,type = "moreaubroto", nlag = 10)
#only print out the first 30 elements for the first row
mb_df[1, c(2:31)]
## CIDH920105.lag1 CIDH920105.lag2 CIDH920105.lag3 CIDH920105.lag4
## -0.0188245843 -0.0605085976 -0.0026823499 0.0107362310
## CIDH920105.lag5 CIDH920105.lag6 CIDH920105.lag7 CIDH920105.lag8
## -0.0157906557 -0.0189568440 0.0381003015 0.0264468353
## CIDH920105.lag9 CIDH920105.lag10 BHAR880101.lag1 BHAR880101.lag2
## -0.0037862857 -0.0157775749 0.0082438732 -0.0239249619
## BHAR880101.lag3 BHAR880101.lag4 BHAR880101.lag5 BHAR880101.lag6
## 0.0073538778 0.0207404479 0.0136327763 0.0097268396
## BHAR880101.lag7 BHAR880101.lag8 BHAR880101.lag9 BHAR880101.lag10
## 0.0315019066 0.0192685645 -0.0002740419 0.0023150775
## CHAM820101.lag1 CHAM820101.lag2 CHAM820101.lag3 CHAM820101.lag4
## 0.0646786378 0.0473994319 0.0596316658 0.0317531550
## CHAM820101.lag5 CHAM820101.lag6 CHAM820101.lag7 CHAM820101.lag8
## 0.0472702127 0.0316032556 0.0633505635 0.0564643542
## CHAM820101.lag9 CHAM820101.lag10
## 0.0442242141 0.0307261016
# using data.frame provided by getFASTA function as data input
geary_df <- calculateAutocor(fasta_df,type = "geary", nlag = 10)
#only print out the first 30 elements for the first row
geary_df[1, c(2:31)]
## CIDH920105.lag1 CIDH920105.lag2 CIDH920105.lag3 CIDH920105.lag4
## 1.0226568 1.0657160 1.0059053 0.9920253
## CIDH920105.lag5 CIDH920105.lag6 CIDH920105.lag7 CIDH920105.lag8
## 1.0194498 1.0227587 0.9637505 0.9757426
## CIDH920105.lag9 CIDH920105.lag10 BHAR880101.lag1 BHAR880101.lag2
## 1.0068772 1.0192243 0.9991465 1.0391630
## BHAR880101.lag3 BHAR880101.lag4 BHAR880101.lag5 BHAR880101.lag6
## 1.0003447 0.9837570 0.9926833 0.9974919
## BHAR880101.lag7 BHAR880101.lag8 BHAR880101.lag9 BHAR880101.lag10
## 0.9703168 0.9853979 1.0097805 1.0066186
## CHAM820101.lag1 CHAM820101.lag2 CHAM820101.lag3 CHAM820101.lag4
## 0.9774946 0.9999835 0.9841586 1.0203869
## CHAM820101.lag5 CHAM820101.lag6 CHAM820101.lag7 CHAM820101.lag8
## 1.0004284 1.0207839 0.9794036 0.9882033
## CHAM820101.lag9 CHAM820101.lag10
## 1.0041758 1.0217759
The k-spaced amino acid pairs (KSAAP) feature describes the number of
occurrences of all possible amino acid pairs by a distance of k, which
can be any number of residues up to two less than the protein length
(Al-Barakati et al., 2019). For
instance, given 400 (20 x 20) amino acid pairs and four values for k (k
= 1-4), there would be 1600 attributes resulted from the KSAAP feature,
and the frequency of each amino acid pair with k spaces is calculated by
sliding through protein sequence one by once. Table 2 illustrates the
outputs of using KSAAP features with various values of k for protein
sequence ARAQRTAAADARAKAAKAGCAARRAAATANYN
.
K = 1 | Amino Acid pairs | AXA | AXC | AXD | AXE | —- | YXY |
Numbe of occurances | 8 | 1 | 1 | 0 | —- | 1 | |
K =2 | Amino Acid pairs | AXXA | AXXC | AXXD | AXXE | —- | YXXY |
Numbe of occurances | 7 | 0 | 1 | 0 | —- | 0 | |
K = 3 | Amino Acid pairs | AXXXA | AXXXC | AXXXD | AXXXE | —- | YXXXY |
Numbe of occurances | 8 | 1 | 1 | 0 | —- | 0 | |
K = 4 | Amino Acid pairs | AXXXXA | AXXXXC | AXXXXD | AXXXXE | —- | YXXXXY |
Numbe of occurances | 7 | 1 | 0 | 0 | —- | 0 | |
The KSAAP descriptor from the protein sequences can be calculated with the following command:
# using data.frame provided by getFASTA function as data input
KSAAP_df <- calculateKSAAP(fasta_df)
#only print out the first 30 elements for the first row
KSAAP_df[1, c(2:31)]
## AsssA RsssA NsssA DsssA CsssA EsssA
## 0.0067643743 0.0018320180 0.0026775648 0.0028184893 0.0028184893 0.0031003382
## QsssA GsssA HsssA IsssA LsssA KsssA
## 0.0011273957 0.0036640361 0.0047914318 0.0066234498 0.0025366404 0.0040868095
## MsssA FsssA PsssA SsssA TsssA WsssA
## 0.0022547914 0.0016910936 0.0016910936 0.0039458850 0.0059188275 0.0056369786
## YsssA VsssA AsssR RsssR NsssR DsssR
## 0.0009864713 0.0036640361 0.0025366404 0.0012683202 0.0016910936 0.0009864713
## CsssR EsssR QsssR GsssR HsssR IsssR
## 0.0007046223 0.0023957159 0.0015501691 0.0012683202 0.0021138670 0.0029594138
Binary encoding (BE) can be used to transform each residue in a protein sequence into 20 coding values (Al-Barakati et al., 2019). For example, ALa is described as (10000000000000000000) while Cys is defined as (01000000000000000000), etc. Thus, the total length of this feature is 400(20 * 20) vectors.
Alternatively, we can directly read the FASTA sequences into R using
Biostrings
package (Pagès et al., 2019),
followed by converting the protein sequences into numerical
features.
#loading the package
library(Biostrings)
#Read fasta sequences provided by HPiP package using Biostrings
fasta <-
readAAStringSet(system.file("extdata/UP000464024.fasta", package="HPiP"),
use.names=TRUE)
#Convert to df
fasta_bios = data.frame(ID=names(fasta),sequences=as.character(fasta))
#Extract the UniProt identifier
fasta_bios$ID <- sub(".*[|]([^.]+)[|].*", "\\1", fasta_bios$ID)
# for example, run ACC
acc_bios <- calculateAAC(fasta_bios)
fasta_bios
can be used as data
input for all the descriptors listed in section
3.3.2
.SummerizedExperiment
objects can be used to store and merge rectangular matrices of different
outputs, as long as they have similar rownames
or
colnames
. As illustrated in section 3.3.2
, all
the computed data.frames have the same rownames
but
different features; therefore, we can easily use the cbind
functions to merge multiple SummerizedExperiment
objects to
one object. The HPiP package provides two example SummarizedExperiment
objects: viral_se
and host_se
.
viral_se
includes pre-computed (CTD)
numerical features per viral proteins present in the
Gold_ReferenceSet
. Similarly,host_se
includes
(CTD) pre-computed numerical features per host proteins
in the Gold_ReferenceSet
.
## class: SummarizedExperiment
## dim: 13 147
## metadata(0):
## assays(1): counts
## rownames(13): P0DTD1 P0DTC7 ... P0DTC9 P0DTC2
## rowData names(0):
## colnames(147): G1.charge G1.hydrophobicity ...
## solventaccess.Tr1331.prop solventaccess.Tr2332.prob
## colData names(1): X
## class: SummarizedExperiment
## dim: 785 147
## metadata(0):
## assays(1): counts
## rownames(785): Q9Y2Z0 Q6NUM9 ... Q02880 Q7KZF4
## rowData names(0):
## colnames(147): G1.charge G1.hydrophobicity ...
## solventaccess.Tr1331.prop solventaccess.Tr2332.prob
## colData names(1): X
The numerical features from each SummarizedExperiment object can be
easily retrieved using the assays()$counts
, where each row
represent the viral or host proteins and each column represents one of
the numerical features.
As an example, construction of SummarizedExperiment
for
viral proteins using CTD descriptors is as follows:
#generate descriptors
CTDC_df <- calculateCTDC(fasta_df)
CTDC_m <- as.matrix(CTDC_df[, -1])
row.names(CTDC_m) <- CTDC_df$identifier
CTDT_df <- calculateCTDT(fasta_df)
CTDT_m <- as.matrix(CTDT_df[, -1])
row.names(CTDT_m) <- CTDT_df$identifier
CTDD_df <- calculateCTDD(fasta_df)
CTDD_m <- as.matrix(CTDD_df[, -1])
row.names(CTDD_m) <- CTDD_df$identifier
#convert matrix to se object
ctdc_se <- SummarizedExperiment(assays = list(counts = CTDC_m),
colData = paste0(colnames(CTDC_df[,-1]),
"CTDC"))
ctdt_se <- SummarizedExperiment(assays = list(counts = CTDT_m),
colData = paste0(colnames(CTDT_df[,-1]),
"CTDT"))
ctdd_se <- SummarizedExperiment(assays = list(counts = CTDD_m),
colData = paste0(colnames(CTDD_df[,-1]),
"CTDD"))
#combine all se objects to one
viral_se <- cbind(ctdc_se,ctdd_se,ctdt_se)
Descriptor name | Length vectors | Function |
---|---|---|
Amino Acid Composition | 20 | calculateAAC() |
Dipeptide Composition | 400 | calculateDC() |
Tripeptide Composition | 8000 | calculateTC() |
Tripeptide Composition (TC) from BS Classes | 216 | calculateTC_Sm() |
Quadruples Composition (QC) | 1296 | calculateQD_Sm() |
F1 | 20 | calculateF1() |
F2 | 20 | calculateF2() |
Composition | 21 | calculateCTDC() |
Transition | 21 | calculateCTDT() |
Distribution | 105 | calculateCTDD() |
Conjoint Triad | 343 | calculateCTriad() |
Geary (Default nlag = 30) | 240 | calculateGeary() |
Moran (Default nlag = 30) | 240 | calculateMoran() |
Normalized (Default nlag = 30) | 240 | calculateMoreauBroto() |
k-Spaced Amino Acid Pairs | 400 | calculateKSAAP() |
Binary encoding | 400 | calculateBE() |
3.2.1-3.2.14
).To generate host-pathogen protein-protein interaction descriptors,
sequence-based descriptors can be combined into one vector space using
getHPI()
, which provides two types of interactions,
controlled by argument type
. To illustrate the usage of
getHPI
, we will continue our example from section
3.2.16
1.Extraction of numerical features from viral_se
and host_se
objects
#extract features from viral_se
counts_v <- assays(viral_se)$counts
#extract row.names from viral_Se
rnames_v <- row.names(counts_v)
#extract features from host_se
counts_h <- assays(host_se)$counts
#extract row.names from viral_Se
rnames_h <- row.names(counts_h)
2.Map the features to the gold-standard data:
#Loading gold-standard data
gd <- Gold_ReferenceSet
x1_viral <- matrix(NA, nrow = nrow(gd), ncol = ncol(counts_v))
for (i in 1:nrow(gd))
x1_viral[i, ] <- counts_v[which(gd$Pathogen_Protein[i] == rnames_v), ]
x1_host <- matrix(NA, nrow = nrow(gd), ncol = ncol(counts_h))
for (i in 1:nrow(gd))
x1_host[i, ] <- counts_h[which(gd$Host_Protein[i] == rnames_h), ]
3.Generate host-pathogen interaction descriptors using
getHPI
:
It is crucial to pre-process the data (i.e., remove the noise) before feeding it into the machine learning model as the quality of data and valuable information that can be extracted from it directly affect the model’s performance. The pre-processing steps are as follow:
Handling missing values: in any
real-world data set, there are always missing values. The easiest option
is to remove rows or columns including missing values; however, such an
approach results in losing valuable information. The alternative method
is to impute missing values using neighboring information (e.g., average
or median) or replace the missing values with zeros. HPiP package
provides two functions to deal with the missing values. The
filter_missing_values
allows the user to drop the missing
values above a certain threshold, controlled by argument
max_miss_rate
, while the impute_missing_data
function replaces the null values with mean/median or zero, controlled
by argument method
.
Feature selection: some of the
sequence-based features are high dimensional, including hundreds to
thousands of features. Unfortunately, such high-dimensional data
includes many redundant features that reduce the predictive model’s
accuracy and increase the training time. The
FSmethod
function in the HPiP package combines two feature
selection (FS) methods, controlled by type()
argument, to
eliminate redundant features. The first FS method is based on
correlation analysis that computes the correlation between features
using Pearson correlation measure and removes highly correlated features
above the user-defined threshold. The second FS method uses the
Recursive Feature Elimination (RFE) algorithm (wrapped up with a random
forest (rf) machine learning algorithm) to perform feature selection.
RFE works by fitting the rf algorithm with all features in the
training data set, ranking features by importance, removing the least
important features, and re-fitting the model until the desired number of
features remains. The feature importance can be computed using
rf model-independent metric (e.g., ROC curve analysis or
accuracy), which is controlled by argument
metric()
.
The complete set of arguments for FSmethod
function
are:
x
A data.frame containing protein-protein interactions,
class labels and features.type
The feature selection typecor.cutoff
Correlation coefficient cutoff used for
filtering.resampling.method
The re-sampling method (e.g.,
k-fold cross-validation) for RFE.iter
Number of partitions for cross-validation.repeats
For repeated k-fold cross validation
only.metric
A string that specifies what summary metric will
be used to select the optimal feature.verbose
Make the output verbose.Continuing our example from section 3.3, feature selection using both correlation analysis and RFE approach can be performed using the following command:
#to use correlation analysis, make sure to drop the columns with sd zero
xx <- Filter(function(x) sd(x) != 0, x[,-c(1,2)])
xx <- cbind(x$PPI, x$class, xx)
colnames(xx)[1:2] <- c("PPI", "class")
#perform feature selection using both correlation analysis and RFE approach
set.seed(101)
x_FS <- FSmethod(xx, type = c("both"),
cor.cutoff = 0.8,resampling.method = "cv",
iter = 2,repeats =NULL, metric = "Accuracy",
verbose = FALSE)
We can also visualize the results from the FSmethod
analysis. For instance, the correlation matrix of unfiltered data can be
visualized using the corr_plot
. This will present us with a
heatmap showing the correlation between all the features prior to
filtration.
In addition, the variable importance of retained features after the RFE
feature selection approach can also be plotted using the
var_imp
function.
Sequence features and a list of gold-standard HP-PPIs can be fed into
an ensemble classifier to rank the potential HP-PPIs interaction. This
is accomplished via the pred_ensmebel
function. This
function uses the ensemble averaging approach, to combine any base
classifiers provided in the caret
package to predict
HP-PPIs. To score interactions, the pred_ensmebel
function
uses the the training data (i.e., labelled HP-PPIs with sequence
features) as well as unlabeled HP-PPIs data set to learn features that
allow reliable prediction of HP-PPIs, utilizing multiple ML algorithms.
Following training, the trained models will be used to make predictions
on unlabeled data. Then,ensemble averaging will be applied to average
predictions over an ensemble of classifiers, each with different
cross-validation splits. Finally, through resamplign techniques (e.g.,
k-fold cross-validation), this function also compares and
evaluates the performance of ensemble model with individual machine
learning models via commonly used measurements such as Recall
(Sensitivity), Specificity, Accuracy , Precision, F1-score, and Matthews
correlation coefficient (MCC). The corresponding formulae are as
follows:
$$ Recall=Sensitivity=TPR=\frac{TP}{TP+FN} $$
$$ Specificity=1-FPR=\frac{TN}{TN+FP} $$
$$ Accuracy=\frac{TP+TN}{TP+TN+FP+FN} $$
$$ Precision=\frac{TP}{TP+FP} $$
$$ F1=2 \text{ × } \frac{Precision \text{ × } Recall}{Precision + Recall} $$
$$ MCC=\frac{TP \text{ × } TN - FP \text{ × } FN}{\sqrt{(TP+FP)\text{ × } (TP+FN)\text{ × } (TN+FP)\text{ × } (TN+FN)}} $$
The pred_ensemble
takes the following parameters:
features
A data frame with host-pathogen
protein-protein interactions (HP-PPIs) in the first column, and features
to be passed to the classifier in the remaining columns.gold_standard
A data frame with gold_standard HP-PPIs
and class label indicating if such PPIs are positive or negative.classifier
The type of classifier to use. See
caret
package for all the available classifiers.resampling.method
The re-sampling technique (i.e.,
k-fold cross-validation).ncross
Number of partitions for cross-validation.plots
Logical value, indicating whether to plot the
performance of ensemble learning algorithm as compared to individual
classifiers.If the argument set to TRUE, plots will be saved in current
working directory.verboseIter
Make the output verbose.filename
A character string, indicating the output
filename as an pdf object.As an example, we will use three base learners , support vector
machine (svmRadial
), Fitting Generalized Linear Models
(glm
), and random forest (ranger
), controlled
by argument classifier
, to rank potential interaction. For
the sake of time, we use only five-fold cross-validation
(ncross = 5
). In order to perform prediction, we will use
unlabel_data
, retrieved from Supplementary Table 1
presented in (Gordon et al.,
2020) , which includes unlabeled HP-PPIs along with pre-computed
CTD
features, as well as constructed data containing
labeled HP-PPIs from section 3.4
.
#load the unlabeled HP-PPIs
data('unlabel_data')
#Constructed labeled HP-PPIs
labeled_dat <- x_FS$rf.result$rfdf
labeled_dat <- labeled_dat[,-1]
#select important features
unlabel_data <-
unlabel_data[names(unlabel_data) %in% names(x_FS$rf.result$rfdf)]
#merge them
ind_data <- rbind(unlabel_data,labeled_dat)
Now we can predict interactions using pred_ensembel
:
set.seed(102)
ppi <- pred_ensembel(ind_data,
gd,
classifier = c("svmRadial", "glm", "ranger"),
resampling.method = "cv",
ncross = 5,
verboseIter = TRUE,
plots = TRUE,
filename=file.path(tempdir(), "plots.pdf"))
To retrieve predicted interactions from the result generated by
pred_ensembel
function, we can just type:
## PPI Pathogen_protein Host_protein ensemble
## 1 M:P0DTC5~Q9UBM1 M:P0DTC5 Q9UBM1 0.7008682
## 2 nsp6:P0DTD1~Q8WVX9 nsp6:P0DTD1 Q8WVX9 0.6857571
## 3 ORF6:P0DTC6~P62861 ORF6:P0DTC6 P62861 0.6921793
## 4 nsp4:P0DTD1~Q6P2Q9 nsp4:P0DTD1 Q6P2Q9 0.6889793
## 5 M:P0DTC5~O75489 M:P0DTC5 O75489 0.6986460
## 6 nsp6:P0DTD1~Q8TEM1 nsp6:P0DTD1 Q8TEM1 0.6923126
Finally, users can subset their list of high-confidence interactions for further analysis, using a stringent classifier confidence ensemble score cutoff of 0.7:
## [1] 100 4
When the plots
argument set to TRUE, the
pred_ensembel
function generates one pdf file indicating
the performance of the ensemble classier as compared to individual base
learners
The first plot shows the Receiver Operating Characteristic (ROC) curve.
Figure 4: ROC_Curve curve.
The second plot shows the Precision-Recall (PR) curve
Figure 5: Precision-Recall (PR) curve.
The third plot shows the accuracy (ACC), F1-score ,positive predictive value (PPV),sensitivity (SE),and Matthews correlation coefficient (MCC) of ensemble classifier vs selected individual classifiers.
Figure 6: Point plot.
Following PPI prediction, users can visualize the predicted PPI
network using plotPPI
and FreqInteractors
functions.
plotPPI
function, which is based on the
igraph plotting function (Csardi,
2013), provide visualization on interacting host protein partners
of pathogen proteins. For instance, to get the PPI network of
SARS-CoV2-ORF8-human, we can run the following command:S_interc <- filter(pred_interactions,
str_detect(Pathogen_protein, "^ORF8:"))
#drop the first column
ppi <- S_interc[,-1]
plotPPI(ppi, edge.name = "ensemble",
node.color ="red",
edge.color = "grey",
cex.node = 10,
node.label.dist= 2)
FreqInteractors
function, shows the degree
distribution of pathogen proteins in the HP-PPI network:To identify significantly enriched annotation terms in predicted
interacting host protein partners of each pathogen protein, we can use
the enrichfindP
function based on the g:Profiler
tool (Kolberg et al., 2020).
Additionally, we can use the erichfind_hp
function to
analyze functional characteristics of all predicted human proteins in
the predicted network.
For instance, the following command can be used to performs functional enrichment analysis of host protein partners of each pathogen protein:
for this analysis, we utilizes the predicted HP-PPIS network
generated in section 3.5
as input data for complex
prediction. For the community detection analysis,
run_clustering
function provided in the HPiP package
includes the five most popular complex detection algorithms in including
fast-greedy, walktrap, label propagation, multi-level community, and
markov clustering. For example, to detect complexes using fast-greedy
algorithm, we can run the following command:
Additionally, we can analyze the functional characteristics of these
predicted modules via enrichfind_cpx
function.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] caret_6.0-94 lattice_0.22-6
## [3] ggplot2_3.5.1 HPiP_1.13.0
## [5] tibble_3.2.1 stringr_1.5.1
## [7] SummarizedExperiment_1.35.5 Biobase_2.67.0
## [9] GenomicRanges_1.57.2 MatrixGenerics_1.17.1
## [11] matrixStats_1.4.1 Biostrings_2.75.0
## [13] GenomeInfoDb_1.41.2 XVector_0.45.0
## [15] IRanges_2.39.2 S4Vectors_0.43.2
## [17] BiocGenerics_0.53.0 dplyr_1.1.4
## [19] readr_2.1.5 knitr_1.48
## [21] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-9 pROC_1.18.5 gridExtra_2.3
## [4] rlang_1.1.4 magrittr_2.0.3 e1071_1.7-16
## [7] compiler_4.4.1 gprofiler2_0.2.3 reshape2_1.4.4
## [10] vctrs_0.6.5 pkgconfig_2.0.3 crayon_1.5.3
## [13] fastmap_1.2.0 labeling_0.4.3 PRROC_1.3.1
## [16] utf8_1.2.4 rmarkdown_2.28 prodlim_2024.06.25
## [19] tzdb_0.4.0 UCSC.utils_1.1.0 bit_4.5.0
## [22] purrr_1.0.2 xfun_0.48 randomForest_4.7-1.2
## [25] zlibbioc_1.51.2 cachem_1.1.0 jsonlite_1.8.9
## [28] recipes_1.1.0 highr_0.11 DelayedArray_0.31.14
## [31] parallel_4.4.1 R6_2.5.1 bslib_0.8.0
## [34] stringi_1.8.4 ranger_0.16.0 parallelly_1.38.0
## [37] rpart_4.1.23 lubridate_1.9.3 jquerylib_0.1.4
## [40] Rcpp_1.0.13 iterators_1.0.14 future.apply_1.11.3
## [43] igraph_2.1.1 Matrix_1.7-1 splines_4.4.1
## [46] nnet_7.3-19 timechange_0.3.0 tidyselect_1.2.1
## [49] abind_1.4-8 yaml_2.3.10 timeDate_4041.110
## [52] codetools_0.2-20 listenv_0.9.1 plyr_1.8.9
## [55] withr_3.0.2 evaluate_1.0.1 future_1.34.0
## [58] survival_3.7-0 proxy_0.4-27 kernlab_0.9-33
## [61] pillar_1.9.0 BiocManager_1.30.25 corrplot_0.95
## [64] foreach_1.5.2 plotly_4.10.4 generics_0.1.3
## [67] RCurl_1.98-1.16 vroom_1.6.5 hms_1.1.3
## [70] munsell_0.5.1 scales_1.3.0 globals_0.16.3
## [73] class_7.3-22 glue_1.8.0 lazyeval_0.2.2
## [76] maketools_1.3.1 tools_4.4.1 sys_3.4.3
## [79] data.table_1.16.2 ModelMetrics_1.2.2.2 gower_1.0.1
## [82] buildtools_1.0.0 grid_4.4.1 tidyr_1.3.1
## [85] ipred_0.9-15 colorspace_2.1-1 nlme_3.1-166
## [88] GenomeInfoDbData_1.2.13 cli_3.6.3 protr_1.7-4
## [91] fansi_1.0.6 expm_1.0-0 viridisLite_0.4.2
## [94] ggthemes_5.1.0 S4Arrays_1.5.11 lava_1.8.0
## [97] gtable_0.3.6 MCL_1.0 sass_0.4.9
## [100] digest_0.6.37 SparseArray_1.5.45 htmlwidgets_1.6.4
## [103] farver_2.1.2 htmltools_0.5.8.1 lifecycle_1.0.4
## [106] hardhat_1.4.0 httr_1.4.7 bit64_4.5.2
## [109] MASS_7.3-61