Title: | Tokenizing Text of Gene Set Enrichment Analysis |
---|---|
Description: | Functional enrichment analysis methods such as gene set enrichment analysis (GSEA) have been widely used for analyzing gene expression data. GSEA is a powerful method to infer results of gene expression data at a level of gene sets by calculating enrichment scores for predefined sets of genes. GSEA depends on the availability and accuracy of gene sets. There are overlaps between terms of gene sets or categories because multiple terms may exist for a single biological process, and it can thus lead to redundancy within enriched terms. In other words, the sets of related terms are overlapping. Using deep learning, this pakage is aimed to predict enrichment scores for unique tokens or words from text in names of gene sets to resolve this overlapping set issue. Furthermore, we can coin a new term by combining tokens and find its enrichment score by predicting such a combined tokens. |
Authors: | Dongmin Jung [cre, aut] |
Maintainer: | Dongmin Jung <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.15.0 |
Built: | 2024-10-31 06:23:35 UTC |
Source: | https://github.com/bioc/ttgsea |
A predefined function that is used as a model in "ttgsea". This is a simple model, but you can define your own model. The loss function is "mean_squared_error" and the optimizer is "adam". Pearson correlation is used as a metric.
bi_gru(num_tokens, embedding_dim, length_seq, num_units)
bi_gru(num_tokens, embedding_dim, length_seq, num_units)
num_tokens |
maximum number of tokens |
embedding_dim |
a non-negative integer for dimension of the dense embedding |
length_seq |
length of input sequences, input length of "layer_embedding"" |
num_units |
dimensionality of the output space in the GRU layer |
model
Dongmin Jung
keras::keras_model, keras::layer_input, keras::layer_embedding, keras::layer_gru, keras::bidirectional, keras::layer_dense, keras::compile
library(reticulate) if (keras::is_keras_available() & reticulate::py_available()) { num_tokens <- 1000 length_seq <- 30 embedding_dim <- 50 num_units <- 32 model <- bi_gru(num_tokens, embedding_dim, length_seq, num_units) # stacked gru num_units_1 <- 32 num_units_2 <- 16 stacked_gru <- function(num_tokens, embedding_dim, length_seq, num_units_1, num_units_2) { model <- keras::keras_model_sequential() %>% keras::layer_embedding(input_dim = num_tokens, output_dim = embedding_dim, input_length = length_seq, mask_zero = TRUE) %>% keras::layer_gru(units = num_units_1, activation = "relu", return_sequences = TRUE) %>% keras::layer_gru(units = num_units_2, activation = "relu") %>% keras::layer_dense(1) model %>% keras::compile(loss = "mean_squared_error", optimizer = "adam", metrics = custom_metric("pearson_correlation", metric_pearson_correlation)) } }
library(reticulate) if (keras::is_keras_available() & reticulate::py_available()) { num_tokens <- 1000 length_seq <- 30 embedding_dim <- 50 num_units <- 32 model <- bi_gru(num_tokens, embedding_dim, length_seq, num_units) # stacked gru num_units_1 <- 32 num_units_2 <- 16 stacked_gru <- function(num_tokens, embedding_dim, length_seq, num_units_1, num_units_2) { model <- keras::keras_model_sequential() %>% keras::layer_embedding(input_dim = num_tokens, output_dim = embedding_dim, input_length = length_seq, mask_zero = TRUE) %>% keras::layer_gru(units = num_units_1, activation = "relu", return_sequences = TRUE) %>% keras::layer_gru(units = num_units_2, activation = "relu") %>% keras::layer_dense(1) model %>% keras::compile(loss = "mean_squared_error", optimizer = "adam", metrics = custom_metric("pearson_correlation", metric_pearson_correlation)) } }
A predefined function that is used as a model in "ttgsea". This is a simple model, but you can define your own model. The loss function is "mean_squared_error" and the optimizer is "adam". Pearson correlation is used as a metric.
bi_lstm(num_tokens, embedding_dim, length_seq, num_units)
bi_lstm(num_tokens, embedding_dim, length_seq, num_units)
num_tokens |
maximum number of tokens |
embedding_dim |
a non-negative integer for dimension of the dense embedding |
length_seq |
length of input sequences, input length of "layer_embedding"" |
num_units |
dimensionality of the output space in the LSTM layer |
model
Dongmin Jung
keras::keras_model, keras::layer_input, keras::layer_embedding, keras::layer_lstm, keras::bidirectional, keras::layer_dense, keras::compile
library(reticulate) if (keras::is_keras_available() & reticulate::py_available()) { num_tokens <- 1000 length_seq <- 30 embedding_dim <- 50 num_units <- 32 model <- bi_lstm(num_tokens, embedding_dim, length_seq, num_units) # stacked lstm num_units_1 <- 32 num_units_2 <- 16 stacked_lstm <- function(num_tokens, embedding_dim, length_seq, num_units_1, num_units_2) { model <- keras::keras_model_sequential() %>% keras::layer_embedding(input_dim = num_tokens, output_dim = embedding_dim, input_length = length_seq, mask_zero = TRUE) %>% keras::layer_lstm(units = num_units_1, activation = "relu", return_sequences = TRUE) %>% keras::layer_lstm(units = num_units_2, activation = "relu") %>% keras::layer_dense(1) model %>% keras::compile(loss = "mean_squared_error", optimizer = "adam", metrics = custom_metric("pearson_correlation", metric_pearson_correlation)) } }
library(reticulate) if (keras::is_keras_available() & reticulate::py_available()) { num_tokens <- 1000 length_seq <- 30 embedding_dim <- 50 num_units <- 32 model <- bi_lstm(num_tokens, embedding_dim, length_seq, num_units) # stacked lstm num_units_1 <- 32 num_units_2 <- 16 stacked_lstm <- function(num_tokens, embedding_dim, length_seq, num_units_1, num_units_2) { model <- keras::keras_model_sequential() %>% keras::layer_embedding(input_dim = num_tokens, output_dim = embedding_dim, input_length = length_seq, mask_zero = TRUE) %>% keras::layer_lstm(units = num_units_1, activation = "relu", return_sequences = TRUE) %>% keras::layer_lstm(units = num_units_2, activation = "relu") %>% keras::layer_dense(1) model %>% keras::compile(loss = "mean_squared_error", optimizer = "adam", metrics = custom_metric("pearson_correlation", metric_pearson_correlation)) } }
From the result of GSEA, we can predict enrichment scores for unique tokens or words from text in names of gene sets by using deep learning. The function "text_token" is used for tokenizing text and the function "token_vector" is used for encoding. Then the encoded sequence is fed to the embedding layer of the model.
fit_model(gseaRes, text, score, model, ngram_min = 1, ngram_max = 2, num_tokens, length_seq, epochs, batch_size, use_generator = TRUE, ...)
fit_model(gseaRes, text, score, model, ngram_min = 1, ngram_max = 2, num_tokens, length_seq, epochs, batch_size, use_generator = TRUE, ...)
gseaRes |
a table with GSEA result having rows for gene sets and columns for text and scores |
text |
column name for text data |
score |
column name for enrichment score |
model |
deep learning model, input dimension and length for the embedding layer must be same to the "num_token" and "length_seq", respectively |
ngram_min |
minimum size of an n-gram (default: 1) |
ngram_max |
maximum size of an n-gram (default: 2) |
num_tokens |
maximum number of tokens, it must be equal to the input dimension of "layer_embedding" in the "model" |
length_seq |
length of input sequences, it must be equal to the input length of "layer_embedding" in the "model" |
epochs |
number of epochs |
batch_size |
batch size |
use_generator |
if "use_generator" is TRUE, the function "sampling_generator" is used for "fit_generator". Otherwise, the "fit" is used without a generator. |
... |
additional parameters for the "fit" or "fit_generator" |
model |
trained model |
tokens |
information for tokens |
token_pred |
prediction for every token, each row has a token and its predicted score |
token_gsea |
list of the GSEA result only for the corresponding token |
num_tokens |
maximum number of tokens |
length_seq |
length of input sequences |
Dongmin Jung
keras::fit_generator, keras::layer_embedding, keras::pad_sequences, textstem::lemmatize_strings, text2vec::create_vocabulary, text2vec::prune_vocabulary
library(reticulate) if (keras::is_keras_available() & reticulate::py_available()) { library(fgsea) data(examplePathways) data(exampleRanks) names(examplePathways) <- gsub("_", " ", substr(names(examplePathways), 9, 1000)) set.seed(1) fgseaRes <- fgsea(examplePathways, exampleRanks) num_tokens <- 1000 length_seq <- 30 batch_size <- 32 embedding_dims <- 50 num_units <- 32 epochs <- 1 ttgseaRes <- fit_model(fgseaRes, "pathway", "NES", model = bi_gru(num_tokens, embedding_dims, length_seq, num_units), num_tokens = num_tokens, length_seq = length_seq, epochs = epochs, batch_size = batch_size, use_generator = FALSE) }
library(reticulate) if (keras::is_keras_available() & reticulate::py_available()) { library(fgsea) data(examplePathways) data(exampleRanks) names(examplePathways) <- gsub("_", " ", substr(names(examplePathways), 9, 1000)) set.seed(1) fgseaRes <- fgsea(examplePathways, exampleRanks) num_tokens <- 1000 length_seq <- 30 batch_size <- 32 embedding_dims <- 50 num_units <- 32 epochs <- 1 ttgseaRes <- fit_model(fgseaRes, "pathway", "NES", model = bi_gru(num_tokens, embedding_dims, length_seq, num_units), num_tokens = num_tokens, length_seq = length_seq, epochs = epochs, batch_size = batch_size, use_generator = FALSE) }
Pearson correlation coefficient can be seen as one of the model performance metrics. This is a measure of how close the predicted value is to the true value. If it is close to 1, the model is considered a good fit. If it is close to 0, the model is not good. A value of 0 corresponds to a random prediction.
Dongmin Jung
keras::k_mean, keras::sum, keras::k_square, keras::k_sqrt
library(reticulate) if (keras::is_keras_available() & reticulate::py_available()) { num_tokens <- 1000 length_seq <- 30 embedding_dims <- 50 num_units_1 <- 32 num_units_2 <- 16 stacked_gru <- function(num_tokens, embedding_dims, length_seq, num_units_1, num_units_2) { model <- keras::keras_model_sequential() %>% keras::layer_embedding(input_dim = num_tokens, output_dim = embedding_dims, input_length = length_seq) %>% keras::layer_gru(units = num_units_1, activation = "relu", return_sequences = TRUE) %>% keras::layer_gru(units = num_units_2, activation = "relu") %>% keras::layer_dense(1) model %>% keras::compile(loss = "mean_squared_error", optimizer = "adam", metrics = custom_metric("pearson_correlation", metric_pearson_correlation)) } }
library(reticulate) if (keras::is_keras_available() & reticulate::py_available()) { num_tokens <- 1000 length_seq <- 30 embedding_dims <- 50 num_units_1 <- 32 num_units_2 <- 16 stacked_gru <- function(num_tokens, embedding_dims, length_seq, num_units_1, num_units_2) { model <- keras::keras_model_sequential() %>% keras::layer_embedding(input_dim = num_tokens, output_dim = embedding_dims, input_length = length_seq) %>% keras::layer_gru(units = num_units_1, activation = "relu", return_sequences = TRUE) %>% keras::layer_gru(units = num_units_2, activation = "relu") %>% keras::layer_dense(1) model %>% keras::compile(loss = "mean_squared_error", optimizer = "adam", metrics = custom_metric("pearson_correlation", metric_pearson_correlation)) } }
You are allowed to create a visualization of your model architecture. This architecture displays the information about the name, input shape, and output shape of layers in a flowchart.
plot_model(x)
plot_model(x)
x |
deep learning model |
plot for the model architecture
Dongmin Jung
purrr::map, purrr::map_chr, purrr::pluck, purrr::imap_dfr, DiagrammeR::grViz
library(reticulate) if (keras::is_keras_available() & reticulate::py_available()) { inputs1 <- layer_input(shape = c(1000)) inputs2 <- layer_input(shape = c(1000)) predictions1 <- inputs1 %>% layer_dense(units = 128, activation = 'relu') %>% layer_dense(units = 64, activation = 'relu') %>% layer_dense(units = 32, activation = 'softmax') predictions2 <- inputs2 %>% layer_dense(units = 128, activation = 'relu') %>% layer_dense(units = 64, activation = 'relu') %>% layer_dense(units = 32, activation = 'softmax') combined <- layer_concatenate(c(predictions1, predictions2)) %>% layer_dense(units = 16, activation = 'softmax') model <- keras_model(inputs = c(inputs1, inputs2), outputs = combined) plot_model(model) }
library(reticulate) if (keras::is_keras_available() & reticulate::py_available()) { inputs1 <- layer_input(shape = c(1000)) inputs2 <- layer_input(shape = c(1000)) predictions1 <- inputs1 %>% layer_dense(units = 128, activation = 'relu') %>% layer_dense(units = 64, activation = 'relu') %>% layer_dense(units = 32, activation = 'softmax') predictions2 <- inputs2 %>% layer_dense(units = 128, activation = 'relu') %>% layer_dense(units = 64, activation = 'relu') %>% layer_dense(units = 32, activation = 'softmax') combined <- layer_concatenate(c(predictions1, predictions2)) %>% layer_dense(units = 16, activation = 'softmax') model <- keras_model(inputs = c(inputs1, inputs2), outputs = combined) plot_model(model) }
From the result of the function "ttgsea", we can predict enrichment scores. For each new term, lemmatized text, predicted enrichment score, Monte Carlo p-value and adjusted p-value are provided. The function "token_vector" is used for encoding as we did for training. Of course, mapping from tokens to integers should be the same.
predict_model(object, new_text, num_simulations = 1000, adj_p_method = "fdr")
predict_model(object, new_text, num_simulations = 1000, adj_p_method = "fdr")
object |
result of "ttgsea" |
new_text |
new text data |
num_simulations |
number of simulations for Monte Carlo p-value (default: 1000) |
adj_p_method |
correction method (default: "fdr") |
table for lemmatized text, predicted enrichment score, MC p-value and adjusted p-value
Dongmin Jung
stats::p.adjust
library(reticulate) if (keras::is_keras_available() & reticulate::py_available()) { library(fgsea) data(examplePathways) data(exampleRanks) names(examplePathways) <- gsub("_", " ", substr(names(examplePathways), 9, 1000)) set.seed(1) fgseaRes <- fgsea(examplePathways, exampleRanks) num_tokens <- 1000 length_seq <- 30 batch_size <- 32 embedding_dims <- 50 num_units <- 32 epochs <- 1 ttgseaRes <- fit_model(fgseaRes, "pathway", "NES", model = bi_gru(num_tokens, embedding_dims, length_seq, num_units), num_tokens = num_tokens, length_seq = length_seq, epochs = epochs, batch_size = batch_size, use_generator = FALSE) set.seed(1) predict_model(ttgseaRes, "Cell Cycle") }
library(reticulate) if (keras::is_keras_available() & reticulate::py_available()) { library(fgsea) data(examplePathways) data(exampleRanks) names(examplePathways) <- gsub("_", " ", substr(names(examplePathways), 9, 1000)) set.seed(1) fgseaRes <- fgsea(examplePathways, exampleRanks) num_tokens <- 1000 length_seq <- 30 batch_size <- 32 embedding_dims <- 50 num_units <- 32 epochs <- 1 ttgseaRes <- fit_model(fgseaRes, "pathway", "NES", model = bi_gru(num_tokens, embedding_dims, length_seq, num_units), num_tokens = num_tokens, length_seq = length_seq, epochs = epochs, batch_size = batch_size, use_generator = FALSE) set.seed(1) predict_model(ttgseaRes, "Cell Cycle") }
This is a generator function that yields batches of training data then pass the function to the "fit_generator" function.
sampling_generator(X_data, Y_data, batch_size)
sampling_generator(X_data, Y_data, batch_size)
X_data |
inputs |
Y_data |
targets |
batch_size |
batch size |
generator for "fit_generator"
Dongmin Jung
X_data <- matrix(rnorm(200), ncol = 2) Y_data <- matrix(rnorm(100), ncol = 1) sampling_generator(X_data, Y_data, 32)
X_data <- matrix(rnorm(200), ncol = 2) Y_data <- matrix(rnorm(100), ncol = 1) sampling_generator(X_data, Y_data, 32)
An n-gram is used for tokenization. This function can also be used to limit the total number of tokens.
text_token(text, ngram_min = 1, ngram_max = 1, num_tokens)
text_token(text, ngram_min = 1, ngram_max = 1, num_tokens)
text |
text data |
ngram_min |
minimum size of an n-gram (default: 1) |
ngram_max |
maximum size of an n-gram (default: 1) |
num_tokens |
maximum number of tokens |
token |
result of tokenizing text |
ngram_min |
minimum size of an n-gram |
ngram_max |
maximum size of an n-gram |
Dongmin Jung
tm::removeWords, stopwords::stopwords, textstem::lemmatize_strings, text2vec::create_vocabulary, text2vec::prune_vocabulary
library(fgsea) data(examplePathways) data(exampleRanks) names(examplePathways) <- gsub("_", " ", substr(names(examplePathways), 9, 1000)) set.seed(1) fgseaRes <- fgsea(examplePathways, exampleRanks) tokens <- text_token(data.frame(fgseaRes)[,"pathway"], num_tokens = 1000)
library(fgsea) data(examplePathways) data(exampleRanks) names(examplePathways) <- gsub("_", " ", substr(names(examplePathways), 9, 1000)) set.seed(1) fgseaRes <- fgsea(examplePathways, exampleRanks) tokens <- text_token(data.frame(fgseaRes)[,"pathway"], num_tokens = 1000)
A vectorization of words or tokens of text is necessary for machine learning. Vectorized sequences are padded or truncated.
token_vector(text, token, length_seq)
token_vector(text, token, length_seq)
text |
text data |
token |
result of tokenization (output of "text_token") |
length_seq |
length of input sequences |
sequences of integers
Dongmin Jung
tm::removeWords, stopwords::stopwords, textstem::lemmatize_strings, tokenizers::tokenize_ngrams, keras::pad_sequences
library(reticulate) if (keras::is_keras_available() & reticulate::py_available()) { library(fgsea) data(examplePathways) data(exampleRanks) names(examplePathways) <- gsub("_", " ", substr(names(examplePathways), 9, 1000)) set.seed(1) fgseaRes <- fgsea(examplePathways, exampleRanks) tokens <- text_token(data.frame(fgseaRes)[,"pathway"], num_tokens = 1000) sequences <- token_vector("Cell Cycle", tokens, 10) }
library(reticulate) if (keras::is_keras_available() & reticulate::py_available()) { library(fgsea) data(examplePathways) data(exampleRanks) names(examplePathways) <- gsub("_", " ", substr(names(examplePathways), 9, 1000)) set.seed(1) fgseaRes <- fgsea(examplePathways, exampleRanks) tokens <- text_token(data.frame(fgseaRes)[,"pathway"], num_tokens = 1000) sequences <- token_vector("Cell Cycle", tokens, 10) }