Title: | Inference of gene regulatory networks from gene expression data |
---|---|
Description: | Reconstructing gene regulatory networks and transcription factor activity is crucial to understand biological processes and holds potential for developing personalized treatment. Yet, it is still an open problem as state-of-art algorithm are often not able to handle large amounts of data. Furthermore, many of the present methods predict numerous false positives and are unable to integrate other sources of information such as previously known interactions. Here we introduce KBoost, an algorithm that uses kernel PCA regression, boosting and Bayesian model averaging for fast and accurate reconstruction of gene regulatory networks. KBoost can also use a prior network built on previously known transcription factor targets. We have benchmarked KBoost using three different datasets against other high performing algorithms. The results show that our method compares favourably to other methods across datasets. |
Authors: | Luis F. Iglesias-Martinez [aut, cre] , Barbara de Kegel [aut], Walter Kolch [aut] |
Maintainer: | Luis F. Iglesias-Martinez <[email protected]> |
License: | GPL-2 | GPL-3 |
Version: | 1.15.0 |
Built: | 2024-11-29 06:09:51 UTC |
Source: | https://github.com/bioc/KBoost |
Function to add names to network for the user.
add_names(grn, gen_names)
add_names(grn, gen_names)
grn |
a GRN object from KBoost. |
gen_names |
a vector with the gene names. |
grn a GRN object with elements with user-defined gene names.
data(D4_multi_1) Net = kboost(D4_multi_1) g_names = matrix("G",100,1) for (i in seq_along(g_names)){ g_names[i] = paste(g_names[i],toString(i), sep = "") } Net = add_names(Net,g_names)
data(D4_multi_1) Net = kboost(D4_multi_1) g_names = matrix("G",100,1) for (i in seq_along(g_names)){ g_names[i] = paste(g_names[i],toString(i), sep = "") } Net = add_names(Net,g_names)
Function to calculate the AUROC and AUPR of a known network. This function was made to test the R implementation of the KBoost Package.
AUPR_AUROC_matrix(Net, G_mat, auto_remove, TFs, upper_limit)
AUPR_AUROC_matrix(Net, G_mat, auto_remove, TFs, upper_limit)
Net |
An inferred gene regulatory network |
G_mat |
A matrix with the gold standard network. |
auto_remove |
TRUE if the auto-regulation is to be discarded. |
TFs |
the indexes of the rows of Net that are TFs. |
upper_limit |
Top number of edges to use. |
list object with AUPR and AUROC of gold standard in matrix format.
data(D4_multi_1) Net = kboost(D4_multi_1) g_mat1 = tab_2_matrix_D4(KBoost::G_D4_multi_1,100) aupr_auroc = AUPR_AUROC_matrix(Net$GRN,g_mat1,auto_remove = TRUE, seq_len(100))
data(D4_multi_1) Net = kboost(D4_multi_1) g_mat1 = tab_2_matrix_D4(KBoost::G_D4_multi_1,100) aupr_auroc = AUPR_AUROC_matrix(Net$GRN,g_mat1,auto_remove = TRUE, seq_len(100))
Function to obtain the AUPR and AUROC in the DREAM4 Multifactorial Challenge.
d4_mfac(v, g, ite, write_res)
d4_mfac(v, g, ite, write_res)
v |
a number between 0 and 1 that is the shrinkage parameter |
g |
a number larger than 0, width parameter for the RBF Kernel |
ite |
an integer with number of iterations. |
write_res |
a logical to indicate if the tables should be written. |
list with auroc and auprs of the DREAM4 multifactorial challenge.
res = d4_mfac()
res = d4_mfac()
Each column is a gene and each row is a simulated experiment.
D4_multi_1
D4_multi_1
matrix
https://www.synapse.org/#!Synapse:syn3049712/wiki/74628
Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, and Stolovitzky G. Revealing strengths and weaknesses of methods for gene network inference. PNAS, 107(14):6286-6291, 2010. Pubmed
data(D4_multi_1)
data(D4_multi_1)
Each column is a gene and each row is a simulated experiment.
D4_multi_2
D4_multi_2
matrix
https://www.synapse.org/#!Synapse:syn3049712/wiki/74628
Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, and Stolovitzky G. Revealing strengths and weaknesses of methods for gene network inference. PNAS, 107(14):6286-6291, 2010. Pubmed
data(D4_multi_2)
data(D4_multi_2)
Each column is a gene and each row is a simulated experiment.
D4_multi_3
D4_multi_3
matrix
https://www.synapse.org/#!Synapse:syn3049712/wiki/74628
Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, and Stolovitzky G. Revealing strengths and weaknesses of methods for gene network inference. PNAS, 107(14):6286-6291, 2010. Pubmed
data(D4_multi_3)
data(D4_multi_3)
Each column is a gene and each row is a simulated experiment.
D4_multi_4
D4_multi_4
matrix
https://www.synapse.org/#!Synapse:syn3049712/wiki/74628
Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, and Stolovitzky G. Revealing strengths and weaknesses of methods for gene network inference. PNAS, 107(14):6286-6291, 2010. Pubmed
data(D4_multi_4)
data(D4_multi_4)
Each column is a gene and each row is a simulated experiment.
D4_multi_5
D4_multi_5
matrix
https://www.synapse.org/#!Synapse:syn3049712/wiki/74628
Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, and Stolovitzky G. Revealing strengths and weaknesses of methods for gene network inference. PNAS, 107(14):6286-6291, 2010. Pubmed
data(D4_multi_5)
data(D4_multi_5)
Each column is a gene and each row is a simulated experiment.
G_D4_multi_1
G_D4_multi_1
matrix
https://www.synapse.org/#!Synapse:syn3049712/wiki/74628
Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, and Stolovitzky G. Revealing strengths and weaknesses of methods for gene network inference. PNAS, 107(14):6286-6291, 2010. Pubmed
data(G_D4_multi_1)
data(G_D4_multi_1)
Each column is a gene and each row is a simulated experiment.
G_D4_multi_2
G_D4_multi_2
matrix
https://www.synapse.org/#!Synapse:syn3049712/wiki/74628
Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, and Stolovitzky G. Revealing strengths and weaknesses of methods for gene network inference. PNAS, 107(14):6286-6291, 2010. Pubmed
data(G_D4_multi_2)
data(G_D4_multi_2)
Each column is a gene and each row is a simulated experiment.
G_D4_multi_3
G_D4_multi_3
matrix
https://www.synapse.org/#!Synapse:syn3049712/wiki/74628
Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, and Stolovitzky G. Revealing strengths and weaknesses of methods for gene network inference. PNAS, 107(14):6286-6291, 2010. Pubmed
data(G_D4_multi_3)
data(G_D4_multi_3)
Each column is a gene and each row is a simulated experiment.
G_D4_multi_4
G_D4_multi_4
matrix
https://www.synapse.org/#!Synapse:syn3049712/wiki/74628
Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, and Stolovitzky G. Revealing strengths and weaknesses of methods for gene network inference. PNAS, 107(14):6286-6291, 2010. Pubmed
data(G_D4_multi_4)
data(G_D4_multi_4)
Each column is a gene and each row is a simulated experiment.
G_D4_multi_5
G_D4_multi_5
matrix
https://www.synapse.org/#!Synapse:syn3049712/wiki/74628
Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, and Stolovitzky G. Revealing strengths and weaknesses of methods for gene network inference. PNAS, 107(14):6286-6291, 2010. Pubmed
data(G_D4_multi_5)
data(G_D4_multi_5)
A gene regulatory network inferred from the ChIp-Seq Encode dataset Table with two columns
The fist column is a transcription factor and the second is a gene.
Gerstein_Prior_ENET_2
Gerstein_Prior_ENET_2
matrix
Gerstein, M.B., et al. Architecture of the human regulatory network derived from ENCODE data. Nature 2012;489(7414):91-100.
data(Gerstein_Prior_ENET_2)
data(Gerstein_Prior_ENET_2)
Function to build a prior using a previously built Network on ChIP-Seq.
get_prior_Gerstein(gen_names, TFs, pos_weight, neg_weight)
get_prior_Gerstein(gen_names, TFs, pos_weight, neg_weight)
gen_names |
the gene names in Symbol nomenclature. |
TFs |
the indexes of gene names which are TFs. |
pos_weight |
the prior weight for edges previously found in Gerstein 2011 |
neg_weight |
the prior weight for edges not found in the Gerstein 2011/ |
matrix with prior probabilities of the Tf target edges.
gen_names = c("TP53","MDM2","FOXM1","ESR1","CTCF","YY1") tfs = get_tfs_human(gen_names) prior = get_prior_Gerstein(gen_names,tfs,0.6,0.4)
gen_names = c("TP53","MDM2","FOXM1","ESR1","CTCF","YY1") tfs = get_tfs_human(gen_names) prior = get_prior_Gerstein(gen_names,tfs,0.6,0.4)
Function to automatically assign Human TFs given a list of Symbols.
get_tfs_human(gen_names)
get_tfs_human(gen_names)
gen_names |
a vector or matrix with the Symbol Gene Names of the system. |
indexes of gen_names who are TFs.
gen_names = c("TP53","MDM2","FOXM1","ESR1","CTCF") tfs = get_tfs_human(gen_names)
gen_names = c("TP53","MDM2","FOXM1","ESR1","CTCF") tfs = get_tfs_human(gen_names)
Function to perform a grid search and find the hyperparameters.
grid_search_kboost(dataset, vs, gs, ite)
grid_search_kboost(dataset, vs, gs, ite)
dataset |
1 for IRMA or 2 for DREAM4 multifactorial. |
vs |
The range of values of v. All values need to be between 0 and 1. |
gs |
The range of values of g. All values need to be larger than 0. |
ite |
An integer that is the number of iterations, fixed in this case. |
list with auprs and aurocs of different values of vs and gs and ite.
res = grid_search_kboost(1,c(0.1,0.5,1),c(1,10,60,100),3)
res = grid_search_kboost(1,c(0.1,0.5,1),c(1,10,60,100),3)
Table with three columns corresponding to Symbol
Human_TFs
Human_TFs
matrix
Lambert, S.A., et al. The Human Transcription Factors. Cell 2018;172(4):650-665.
data(Human_TFs)
data(Human_TFs)
Function to produce the AUPR and AUROC Results on the IRMA datasets.
irma_check(v, g, ite)
irma_check(v, g, ite)
v |
a number between 0 and 1 that is the shrinkage parameter |
g |
a number larger than 0 that is the width parameter for the RBF Kernel |
ite |
an integer with number of iterations. |
list with aurocs and auprs for IRMA datasets
res = irma_check()
res = irma_check()
Matrix where the rows are genes and columns are transcription factor.
IRMA_Gold
IRMA_Gold
matrix
https://www.sciencedirect.com/science/article/pii/S0092867409001561
Cantone, I., et al. A Yeast Synthetic Network for In Vivo Assessment of Reverse-Engineering and Modeling Approaches. Cell 2009;137(1):172-181.
data(IRMA_Gold)
data(IRMA_Gold)
Matrix where the rows are experiments and columns are genes for the IRMA Off dataset.
irma_off
irma_off
matrix
https://www.sciencedirect.com/science/article/pii/S0092867409001561
Cantone, I., et al. A Yeast Synthetic Network for In Vivo Assessment of Reverse-Engineering and Modeling Approaches. Cell 2009;137(1):172-181.
data(irma_off)
data(irma_off)
Matrix where the rows are experiments and columns are genes for the IRMA Off dataset.
irma_on
irma_on
matrix
https://www.sciencedirect.com/science/article/pii/S0092867409001561
Cantone, I., et al. A Yeast Synthetic Network for In Vivo Assessment of Reverse-Engineering and Modeling Approaches. Cell 2009;137(1):172-181.
data(irma_on)
data(irma_on)
A function to run KBoost.
kboost(X, TFs, g, v, prior_weights, ite)
kboost(X, TFs, g, v, prior_weights, ite)
X |
an NxG matrix with the expression values of G genes and N obvs.. |
TFs |
a Kx1 numeric matrix with integers of columns of X that are TFs. |
g |
a positive no., width parameter for RBF kernel. (default g = 40) |
v |
a no. between 0 and 1 with the shrinkage parameter. (default v =0.1) |
prior_weights |
it can be a scalar or GxK. (default 0.5) |
ite |
an integer for the maximum number of iterations (default 3) |
a list with the results for kboost, with fields: GRN a matrix with the posterior edge probability after network refinement. GRN_UP a matrix with the posterior edges before refinement. model a matrix with logical values for the TFs selected for each gene. g the width parameter for the RBF kernel. v the shrinkage parameter. prior the prior of each model. TFs a matrix with integers of each gene that is a TF. prior_weights the prior_weights with which KBoost was run. run_time a sacalar with the running time.
data(D4_multi_1) Net <- kboost(D4_multi_1)
data(D4_multi_1) Net <- kboost(D4_multi_1)
Function for KBoost on data from a human sample annotated with Symbol names.
KBoost_human_symbol(X, gen_names, g, v, ite, pos_weight, neg_weight)
KBoost_human_symbol(X, gen_names, g, v, ite, pos_weight, neg_weight)
X |
an NxG matrix with the expression values of G genes and N samples. |
gen_names |
SYMBOL gene names corresponding to the columns of X. |
g |
a positive no., width parameter for the RBF kernel. (default g = 40) |
v |
a double between 0 and 1, the shrinkage parameter. (default v =0.1) |
ite |
an integer with the number of iterations (default ite = 3) |
pos_weight |
no. between 0 and 1. Prior that a TF regulate a gene. |
neg_weight |
no. between 0 and 1, for TF gene pairs not seen before. |
list with results of KBoost on a dataset with Symbol gene names.
X = rnorm(50,0,1) X = matrix(X,10,5) gen_names = c("TP53","YY1","CTCF","MDM2","ESR1") grn = KBoost_human_symbol(X,gen_names,pos_weight = 0.6, neg_weight =0.4)
X = rnorm(50,0,1) X = matrix(X,10,5) gen_names = c("TP53","YY1","CTCF","MDM2","ESR1") grn = KBoost_human_symbol(X,gen_names,pos_weight = 0.6, neg_weight =0.4)
A function to perform feature normalization in kernel space.
kernel_normal(K)
kernel_normal(K)
K |
an NxN numeric matrix with the kernel function with N observations. |
feature centred kernel.
x = rnorm(100,0,1) k = RBF_K(x,40) k_ = kernel_normal(k)
x = rnorm(100,0,1) k = RBF_K(x,40) k_ = kernel_normal(k)
Function to perform Kernel Principal Component Boosting
kernel_pc_boosting(X, Y, g, v, ite, thr)
kernel_pc_boosting(X, Y, g, v, ite, thr)
X |
A matrix with the explanatory variables. |
Y |
a matrix with the variable to predict. |
g |
a positive number with the width parameter for the RBF Kernel. |
v |
a number between 0 and 1 that corresponds to the shrinkage parameter. |
ite |
an integer with the number of iterations. |
thr |
a threshold to discard Kernel principal components whose eigenvalue |
function an sum of squared errors.
data(D4_multi_1) Y = scale(matrix(D4_multi_1[,91],100,1)) X = scale(D4_multi_1[,-91]) res = kernel_pc_boosting(X,Y, g= 40, v = 0.5, ite = 3, thr = 1e-10)
data(D4_multi_1) Y = scale(matrix(D4_multi_1[,91],100,1)) X = scale(D4_multi_1[,-91]) res = kernel_pc_boosting(X,Y, g= 40, v = 0.5, ite = 3, thr = 1e-10)
Function to calculate the principal components of a kernel.
KPC(K, thr)
KPC(K, thr)
K |
an NxN numeric matrix with the Kernel matrix. |
thr |
a positive scalar which is a threshold to discard eigen-vectors based on eigen-values. |
the kernel principal components
x = rnorm(100,0,1) k = RBF_K(x,1) k_ = kernel_normal(k) kpca = KPC(k,1e-8)
x = rnorm(100,0,1) k = RBF_K(x,1) k_ = kernel_normal(k) kpca = KPC(k,1e-8)
Function to calculate the distance between nodes.
net_dist_bin(GRN, TFs, thr)
net_dist_bin(GRN, TFs, thr)
GRN |
An inferred networks with the predictive probabilities that a transcription factor regulates a gene. |
TFs |
A vector with indexes of the rows of GRN which correspond to TFs. |
thr |
A scalar between 0 and 1 that is used select the edges with large posterior probabilities. |
a matrix with the distances between edges.
data(D4_multi_1) Net = kboost(D4_multi_1) dist = net_dist_bin(Net$GRN,Net$TFs,0.1)
data(D4_multi_1) Net = kboost(D4_multi_1) dist = net_dist_bin(Net$GRN,Net$TFs,0.1)
Function to do a heuristic post-processing that improves accuracy. Each column is multiplied by its variance.
net_refine(Net)
net_refine(Net)
Net |
a GRN with TFs in the columns. |
the network with Slavek and Arodz heuristic
Net =rbeta(10000,1,2) Net = matrix(Net,100,100) net_ref = net_refine(Net)
Net =rbeta(10000,1,2) Net = matrix(Net,100,100) net_ref = net_refine(Net)
Function to summarize the GRN filtered with a threshold,
net_summary_bin(GRN, TFs, thr, a, b)
net_summary_bin(GRN, TFs, thr, a, b)
GRN |
An inferred network |
TFs |
A vector with indexes of the rows of GRN which correspond to TFs. |
thr |
a scalar between 0 and 1, a threshold for posterior probabilities. |
a |
parameter for Katz and PageRank centrality (default the inverse of the largest eigenvalue of GRN. |
b |
parameter for Katz and PageRank centrality (default b = 1). |
list with table version of the GRN, outdegree and indegree, and closeness centrality.
data(D4_multi_1) Net = kboost(D4_multi_1) Net_Summary = net_summary_bin(Net$GRN)
data(D4_multi_1) Net = kboost(D4_multi_1) Net_Summary = net_summary_bin(Net$GRN)
Function to calculate the RBF Kernel of a matrix X with width g.
RBF_K(x, g)
RBF_K(x, g)
x |
an Nx1 numeric matrix with N observations. |
g |
a positive scalar with the width parameter. |
the matrix with the RBF kernel
x = rnorm(100,0,1) k = RBF_K(x,40)
x = rnorm(100,0,1) k = RBF_K(x,40)
Function to produce the gold standard of the DREAM4 Multifactorial Challenge in matrix format.
tab_2_matrix_D4(g_table, G)
tab_2_matrix_D4(g_table, G)
g_table |
the network in table format. The first column is the Tf, the second column the gene, and the third indicates if there is an interaction. |
G |
the number of genes. |
a network in table format transformed into a matrix.
g_table = KBoost::G_D4_multi_1 g_mat = tab_2_matrix_D4(g_table,100)
g_table = KBoost::G_D4_multi_1 g_mat = tab_2_matrix_D4(g_table,100)
Function to write output in DREAM4 Challenge Format.
write_GRN_D4(GRN, TFs, filename)
write_GRN_D4(GRN, TFs, filename)
GRN |
a GxK gene regulatory network. |
TFs |
a K set of indixes of G that are TFs. |
filename |
a string with the filename. |
a file with the network written as a file.
data(D4_multi_1) Net = kboost(D4_multi_1) write_GRN_D4(Net$GRN, seq_len(100), "D4_multi_1_network.txt")
data(D4_multi_1) Net = kboost(D4_multi_1) write_GRN_D4(Net$GRN, seq_len(100), "D4_multi_1_network.txt")