Title: | A Deep-Learning Framework Based on Pre-trained Sequence Embeddings for Predicting Host-Viral Protein-Protein Interactions |
---|---|
Description: | Emerging infectious diseases, exemplified by the zoonotic COVID-19 pandemic caused by SARS-CoV-2, are grave global threats. Understanding protein-protein interactions (PPIs) between host and viral proteins is essential for therapeutic targets and insights into pathogen replication and immune evasion. While experimental methods like yeast two-hybrid screening and mass spectrometry provide valuable insights, they are hindered by experimental noise and costs, yielding incomplete interaction maps. Computational models, notably DeProViR, predict PPIs from amino acid sequences, incorporating semantic information with GloVe embeddings. DeProViR employs a Siamese neural network, integrating convolutional and Bi-LSTM networks to enhance accuracy. It overcomes the limitations of feature engineering, offering an efficient means to predict host-virus interactions, which holds promise for antiviral therapies and advancing our understanding of infectious diseases. |
Authors: | Matineh Rahmatbakhsh [aut, trl, cre] |
Maintainer: | Matineh Rahmatbakhsh <[email protected]> |
License: | MIT+ file LICENSE |
Version: | 1.3.0 |
Built: | 2024-10-30 06:40:33 UTC |
Source: | https://github.com/bioc/DeProViR |
This function first first encodes amino acids as a sequence of unique 20 integers though tokenizer. The padding token was added to the front of shorter sequences to ensure a fixed-length vector of defined size L (i.e., here is 1000). Embedding matrix is then constructed to transform amino acid tokens to pre-training embedding weights, in which rows represent the amino acid tokens created earlier, and columns correspond to 100-dimension weight vectors derived from GloVe word-vector-generation vector map.
encodeHostSeq(trainingSet, embeddings_index)
encodeHostSeq(trainingSet, embeddings_index)
trainingSet |
a data.frame containing training information |
embeddings_index |
embedding outputted from
|
A list containing Embedding matrix and tokenization
# Download and load the index embeddings_index <- gloveImport() #load training set dt <- loadTrainingSet() #encoding encodeHostSeq <- encodeHostSeq(dt, embeddings_index)
# Download and load the index embeddings_index <- gloveImport() #load training set dt <- loadTrainingSet() #encoding encodeHostSeq <- encodeHostSeq(dt, embeddings_index)
This function first first encodes amino acids as a sequence of unique 20 integers though tokenizer. The padding token was added to the front of shorter sequences to ensure a fixed-length vector of defined size L (i.e., here is 1000). Embedding matrix is then constructed to transform amino acid tokens to pre-training embedding weights, in which rows represent the amino acid tokens created earlier, and columns correspond to 100-dimension weight vectors derived from GloVe word-vector-generationvector map.
encodeViralSeq(trainingSet, embeddings_index)
encodeViralSeq(trainingSet, embeddings_index)
trainingSet |
a data.frame containing training information |
embeddings_index |
embedding outputted from
|
A list containing Embedding matrix and tokenization
# Download and load the index embeddings_index <- gloveImport() #load training set dt <- loadTrainingSet() #encoding encoded_seq <- encodeViralSeq(dt, embeddings_index)
# Download and load the index embeddings_index <- gloveImport() #load training set dt <- loadTrainingSet() #encoding encoded_seq <- encodeViralSeq(dt, embeddings_index)
This function cache and loads pre-trained GloVe vectors (100d).
gloveImport(url_path = "https://nlp.stanford.edu/data")
gloveImport(url_path = "https://nlp.stanford.edu/data")
url_path |
URL path to GloVe embedding. Defaults to "https://nlp.stanford.edu/data" |
glove embedding
options(timeout=240) embeddings_index <- gloveImport(url_path = "https://nlp.stanford.edu/data")
options(timeout=240) embeddings_index <- gloveImport(url_path = "https://nlp.stanford.edu/data")
This function loads the pre-trained model weights constructed
previously using modelTraining
loadPreTrainedModel( input_dim = 20, output_dim = 100, filters_layer1CNN = 32, kernel_size_layer1CNN = 16, filters_layer2CNN = 64, kernel_size_layer2CNN = 7, pool_size = 30, layer_lstm = 64, units = 8, metrics = "AUC", filepath = system.file("extdata", "Pre_trainedModel", package = "DeProViR") )
loadPreTrainedModel( input_dim = 20, output_dim = 100, filters_layer1CNN = 32, kernel_size_layer1CNN = 16, filters_layer2CNN = 64, kernel_size_layer2CNN = 7, pool_size = 30, layer_lstm = 64, units = 8, metrics = "AUC", filepath = system.file("extdata", "Pre_trainedModel", package = "DeProViR") )
input_dim |
Integer. Size of the vocabulary, i.e. amino acid tokens.
Defults to 20. See |
output_dim |
Integer. Dimension of the dense embedding, i.e., GloVe.
Defaults to 100. See |
filters_layer1CNN |
Integer, the dimensionality of the output space
(i.e. the number of output filters in the first convolution).
Defaults to 32. See |
kernel_size_layer1CNN |
An integer or tuple/list of 2 integers,
specifying the height and width of the convolution window in the first
layer. Can be a single integer to
specify the same value for all spatial dimensions.
Defaults to 16. See |
filters_layer2CNN |
Integer, the dimensionality of the output space
(i.e. the number of output filters in the second convolution).
Defaults to 64. See |
kernel_size_layer2CNN |
An integer or tuple/list of 2 integers,
specifying the height and width of the convolution window in the
second layer. Can be a single integer to
specify the same value for all spatial dimensions.
Defaults to 7. See |
pool_size |
Down samples the input representation by taking the
maximum value over a spatial window of size pool_size.
Defaults to 30.See |
layer_lstm |
Number of units in the Bi-LSTM layer. Defaults to 64.
See |
units |
Number of units in the MLP layer. Defaults to 8.
See |
metrics |
Vector of metric names to be evaluated by the model during
training and testing. Defaults to "AUC". See |
filepath |
A character string indicating the path contained pre-trained model weights, i.e., inst/extdata/Pre-trainedModel |
Pre-trained model.
Loading_trainedModel <- loadPreTrainedModel()
Loading_trainedModel <- loadPreTrainedModel()
This function loads demo training set.
loadTrainingSet( training_dir = system.file("extdata", "training_Set", package = "DeProViR") )
loadTrainingSet( training_dir = system.file("extdata", "training_Set", package = "DeProViR") )
training_dir |
dir containing a training data.frame .csv Default set to "extdata/training_testSets". |
data.frame
dt <- loadTrainingSet()
dt <- loadTrainingSet()
This function first transforms protein sequences to amino acid tokens wherein tokens are indexed by positive integers, then represents each amino acid token by pre-trained co-occurrence embedding vectors learned by GloVe, followed by applying an embedding layer. Then it employs Siamese-like neural network articheture on densly-connected neural net to predict interactions between host and viral proteins.
modelTraining( url_path = "https://nlp.stanford.edu/data", training_dir = system.file("extdata", "training_Set", package = "DeProViR"), input_dim = 20, output_dim = 100, filters_layer1CNN = 32, kernel_size_layer1CNN = 16, filters_layer2CNN = 64, kernel_size_layer2CNN = 7, pool_size = 30, layer_lstm = 64, units = 8, metrics = "AUC", cv_fold = 10, epochs = 100, batch_size = 128, plots = TRUE, tpath = tempdir(), save_model_weights = TRUE, filepath = tempdir() )
modelTraining( url_path = "https://nlp.stanford.edu/data", training_dir = system.file("extdata", "training_Set", package = "DeProViR"), input_dim = 20, output_dim = 100, filters_layer1CNN = 32, kernel_size_layer1CNN = 16, filters_layer2CNN = 64, kernel_size_layer2CNN = 7, pool_size = 30, layer_lstm = 64, units = 8, metrics = "AUC", cv_fold = 10, epochs = 100, batch_size = 128, plots = TRUE, tpath = tempdir(), save_model_weights = TRUE, filepath = tempdir() )
url_path |
URL path to GloVe embedding. Defaults to "https://nlp.stanford.edu/data/glove.6B.zip". |
training_dir |
dir containing viral-host training set.
See |
input_dim |
Integer. Size of the vocabulary, i.e. amino acid
tokens. Defults to 20. See |
output_dim |
Integer. Dimension of the dense embedding,
i.e., GloVe. Defaults to 100. See |
filters_layer1CNN |
Integer, the dimensionality of the output space
(i.e. the number of output filters in the first convolution).
Defaults to 32. See |
kernel_size_layer1CNN |
An integer or tuple/list of 2 integers,
specifying the height and width of the convolution window in the first
layer. Can be a single integer to specify the same value for all
spatial dimensions.Defaults to 16. See |
filters_layer2CNN |
Integer, the dimensionality of the output space
(i.e. the number of output filters in the second convolution).
Defaults to 64. See |
kernel_size_layer2CNN |
An integer or tuple/list of 2 integers,
specifying the
height and width of the convolution window in the second layer. Can be a
single integer to specify the same value for all spatial dimensions.
Defaults to 7. See |
pool_size |
Down samples the input representation by taking the
maximum value over a spatial window of size pool_size.
Defaults to 30.See |
layer_lstm |
Number of units in the Bi-LSTM layer. Defaults to 64.
See |
units |
Number of units in the MLP layer. Defaults to 8.
See |
metrics |
Vector of metric names to be evaluated by the model
during training and testing. Defaults to "AUC". See |
cv_fold |
Number of partitions for cross-validation. Defaults to 10. |
epochs |
Number of epochs to train the model. Defaults to 100.
See |
batch_size |
Number of samples per gradient update.Defults to 128.
See |
plots |
PDF file containing perfromance measures. Defaults to TRUE.
See |
tpath |
A character string indicating the path to the project
directory. If the directory is missing, PDF file containing perfromance
measures will be stored in the Temp directory.
See |
save_model_weights |
If TRUE, save the trained weights. Defaults to TRUE. |
filepath |
A character string indicating the path to save the model weights. Default to tempdir(). |
Trained model and perfromance measures.
This function plots model performance
performancePlots(pred_label, y_label, tpath = tempdir())
performancePlots(pred_label, y_label, tpath = tempdir())
pred_label |
predicted labels |
y_label |
Ground truth labels |
tpath |
A character string indicating the path to the project directory. If the directory is missing, PDF file will be stored in the Temp directory. |
Pdf file containing perfromanc plots
pred_label <- seq(0,1, length.out = 100) truth_label <- rep(c(0,1), each = 50) perf <- performancePlots(pred_label, truth_label, tpath = tempdir())
pred_label <- seq(0,1, length.out = 100) truth_label <- rep(c(0,1), each = 50) perf <- performancePlots(pred_label, truth_label, tpath = tempdir())
This function initially constructs an embedding matrix from the viral or host protein sequences and then predicts scores for unknown interactions. Interactions with scores greater than 0.5 are more likely to indicate interaction.
predInteractions( url_path = "https://nlp.stanford.edu/data", Testingset, trainedModel )
predInteractions( url_path = "https://nlp.stanford.edu/data", Testingset, trainedModel )
url_path |
URL path to GloVe embedding. Defaults to "https://nlp.stanford.edu/data/glove.6B.zip". |
Testingset |
A data.frame containing unknown interactions. For demo, we can use the file in extdata/test_Set. |
trainedModel |
Pre-trained model stored in extdata/Pre_trainedModel
or the training model "$merge_model" achieved by
|
Probability scores for unknown interactions
trainedModel <- loadPreTrainedModel() # load test set (i.e., unknown interactions) testing_set <- data.table::fread(system.file("extdata", "test_Set", "test_set_unknownInteraction.csv", package = "DeProViR")) # now predict interactions options(timeout=240) predInteractions <- predInteractions(url_path = "https://nlp.stanford.edu/data", testing_set, trainedModel)
trainedModel <- loadPreTrainedModel() # load test set (i.e., unknown interactions) testing_set <- data.table::fread(system.file("extdata", "test_Set", "test_set_unknownInteraction.csv", package = "DeProViR")) # now predict interactions options(timeout=240) predInteractions <- predInteractions(url_path = "https://nlp.stanford.edu/data", testing_set, trainedModel)