There are various ways to detect modules from weighted networks. Conventional approaches such as clustering or graph partitioning purely use the network topology to define modules (Fortunato 2010). But we may need additional information such as increasing high-throughput omics data. On the one hand, the construction of reliable networks, especially for specific tissues, is relatively slow. On the other hand, integrating omics data with network topology has become a paradigm in system biology community in the past decade (Mitra et al. 2013). Weighted gene co-expression network (WGCN) is a pure data-driven gene network, which only relies on gene expression profiles. There is no rigorous definition of active modules in WGCN, but the module itself should be more compact and informative compared with random subnetworks. AMOUNTAIN (D. Li et al. 2016) provides a convex optimization based approach to identify such modules. Here we embed parts of the examples from the corresponding package AMOUNTAIN help pages into a single document.
We follow (W. Li et al. 2011) to construct gene co-expression networks for simulation study. Let n be the number of genes, and edge weights W as well as node score z follow the uniform distribution in range [0, 1]. A module contains k genes inside which the edge weights as well as node score follow the uniform distribution in range [θ, 1], where θ = {0.5, 0.6, 0.7, 0.8, 0.9}.
library(AMOUNTAIN)
n = 100
k = 20
theta = 0.5
pp <- networkSimulation(n, k, theta)
moduleid <- pp[[3]]
netid <- 1:100
restp <- netid[-moduleid]
groupdesign <- list(moduleid,restp)
names(groupdesign) <- c('module','background')
The following figure shows the weighted co-expression network when n = 100, k = 20 and red nodes indicate module members and wider edges mean larger similarities. Visualization is based on qgraph.
## Loading required package: qgraph
When simulating a two-layer network, the basic method is to connect two independent networks with an inter-layer weight matrix A, which is designed to have larger weights between two modules.
n1 = 100
k1 = 20
theta1 = 0.5
n2 = 80
k2 = 10
theta2 = 0.5
ppresult <- twolayernetworkSimulation(n1,k1,theta1,n2,k2,theta2)
A <- ppresult[[3]]
pp <- ppresult[[1]]
moduleid <- pp[[3]]
netid <- 1:n1
restp <- netid[-moduleid]
pp2 <- ppresult[[2]]
moduleid2 <- pp2[[3]]
netid2 <- 1:n2
restp2 <- netid2[-moduleid2]
library(qgraph)
## labelling the groups
groupdesign <- list(moduleid,restp,(moduleid2+n1),(restp2+n1))
names(groupdesign) <- c('module1','background1','module2',
'background2')
twolayernet <- matrix(0,nrow=(n1+n2),ncol=(n1+n2))
twolayernet[1:n1,1:n1] <- pp[[1]]
twolayernet[(n1+1):(n1+n2),(n1+1):(n1+n2)] <- pp2[[1]]
twolayernet[1:n1,(n1+1):(n1+n2)] <- A
twolayernet[(n1+1):(n1+n2),1:n1] <- t(A)
The following figure shows the the two-layer weighted co-expression network based on above simulation.
Given the network G with edges weight matrix W ∈ ℝn × n and nodes weight vector ${\bf z}\in\mathbb{R}^{n}$, where n is the number of nodes, we formulate the active modules identification on WGCN as a elastic net contrained optmization problem:
$$\min_{{\bf x}\in \mathbb{R}_+^n}\ F({\bf x})=-{\bf x}^TW{\bf x}-\lambda{\bf z}^T{\bf x}\quad s.t.\quad \alpha\|{\bf x}\|_1+(1-\alpha)\|{\bf x}\|_2^2=1$$
where the module membership vector ${\bf x}\in\mathbb{R}_+^n$ is relaxed from a 0 − 1 vector in which xi ≠ 0 means node i is in the module. And α is the parameter to balance ℓ1-norm and ℓ2-norm which actually controls the module size. Larger α means a more sparse vector, corresponding smaller module in this case. We adopt the euclidean projection based technique (Gong, Gai, and Zhang 2011) to solve the problem.
Here we show how to use the following function in the package to find an active module for above simulated single layer network. With groundtruth in hand, we can evaluate the quality of identified modules by F-score. In order to get higher quality, we need to tune parameter α in the elastic net penalty and λ = 1 in the objective function. The common way to select two optimal parameters is grid search.
n = 100
k = 20
theta = 0.5
pp <- networkSimulation(n,k,theta)
moduleid <- pp[[3]]
alphaset <- seq(0.1,0.9,by=0.1)
lambdaset <- 2^seq(-5,5)
## using a grid search to select lambda and alpha
Fscores <- matrix(0,nrow = length(alphaset),ncol = length(lambdaset))
for (j in 1:length(alphaset)) {
for (k in 1:length(lambdaset)) {
x <- moduleIdentificationGPFixSS(pp[[1]],pp[[2]],rep(1/n,n),maxiter = 500,
a=alphaset[j],lambda = lambdaset[k])
predictedid<-which(x[[2]]!=0)
recall <- length(intersect(predictedid,moduleid))/length(moduleid)
precise <- length(intersect(predictedid,moduleid))/length(predictedid)
Fscores[j,k] <- 2*precise*recall/(precise+recall)
}
}
We can show gridFscore by 3-D plot to see how these parameters affect the performance. By certain combination of these two parameters, we can almost exactly find the target model nodes with F − score = 1.
The basic idea to identification modules on a two-layer network is to
find two active modules on each layer, at the same time with maximal
inter-later links. we have function moduleIdentificationGPFixSSTwolayer
in the package. Following the two-layer network simulation in section 1,
we call the method.
## network simulation is the same as before
modres <- moduleIdentificationGPFixSSTwolayer(pp[[1]],pp[[2]],rep(1/n1,n1),pp2[[1]],pp2[[2]],rep(1/n2,n2),A)
predictedid <- which(modres[[1]]!=0)
recall <- length(intersect(predictedid,moduleid))/length(moduleid)
precise <- length(intersect(predictedid,moduleid))/length(predictedid)
F1 <- 2*precise*recall/(precise+recall)
predictedid2 <- which(modres[[2]]!=0)
recall2 <- length(intersect(predictedid2,moduleid2))/length(moduleid2)
precise2 <- length(intersect(predictedid2,moduleid2))/length(predictedid2)
F2 <- 2*precise2*recall2/(precise2+recall2)
And we can also select optimal parameters combination in a more sophisticated way based on the example in section 2.
A general multi-layer network is a natural extension of two-layer
networks. Here we consider a specific form of multi-layer network that
we could conduct simple operations. The basic idea to identification
modules on a two-layer network is to find two active modules on each
layer, at the same time with maximal inter-later links. we have function
moduleIdentificationGPFixSSMultilayer
in the package. Following the multi-layer network simulation, we call
the method as:
## network simulation
n = 100
k = 20
L = 5
theta = 0.5
cpl <- multilayernetworkSimulation(n,k,theta,L)
listz <- list()
for (i in 1:L){
listz[[i]] <- cpl[[i+2]]
}
moduleid <- cpl[[2]]
## use default parameters here
x <- moduleIdentificationGPFixSSMultilayer(cpl[[1]],listz,rep(1/n,n))
predictedid <- which(x[[2]]!=0)
recall <- length(intersect(predictedid,moduleid))/length(moduleid)
precise <- length(intersect(predictedid,moduleid))/length(predictedid)
Fscore <- (2*precise*recall/(precise+recall))
And we can also select optimal parameters combination in a more sophisticated way based on the example in section 2.
The usage of the package functions is the same for real-world data, but we need to be aware of two aspects. First of all the way to calculate edges score and nodes score in a weighted network can make an impact on the performance. Different input W and $\bf z$ in the objective function may lead to different modules.
Secondly, we do not have groundtruth about module membership for real world data. In this case, we may need to select the proper parameter so that the desired module size can be archived. When fixing λ = 0.01, we use a binary search method to select α for the elastic net penalty which controls the sparsity of the module.
## binary search parameter to fix module size to 100~200
abegin = 0.01
aend = 0.9
maxsize = 200
minsize = 100
for (i in 1:100) {
x <- moduleIdentificationGPFixSS(W,z,rep(1/n,n),a=(abegin+aend)/2,lambda = 0.001,maxiter = 500)
predictedid <- which(x[[2]]!=0)
if(length(predictedid) > maxsize){
abegin <- (abegin+aend)/2
}else if (length(predictedid) < minsize){
aend <- (abegin+aend)/2
}else
break
}
When dealing with large scale networks, pure R is proven to be slow. In the developing version we reimplement the core functions of AMOUNTAIN by C, in which the matrix operations are based on open source GSL. Currently we tested it on Linux platform. Here is a table of C-version functions and pure R functions:
C-version | Pure R | Brief description |
---|---|---|
CGPFixSS |
moduleIdentificationGPFixSS |
Module identification on single network |
CGPFixSSTwolayer |
moduleIdentificationGPFixSSTwolayer |
Module identification on two-layer network |
CGPFixSSMultiLayer |
moduleIdentificationGPFixSSMultilayer |
Module identification on multi-layer network |
We found that the most efficient way to call C functions is to
compile .c file into shared libraries (For Linux .so and Windows .dll)
and to use .C(xxx)
in R.
Although we could follow the standard way to use Rcpp and even RcppGSL to
make use of GSL, the data format transformation makes it slower. For
instance, you have to transform an array double *
into gsl_vector
or even RcppGSL::Vector
object by
filling each entry. It would cause additional overhead especially for
large scale data.
Finally, we can do gene annotation enrichment analysis with interactive tools like DAVID or Enrichr, to see whether a module gene list can be explained by existing biological process, pathways or even diseases.
Please visit AMOUNTAIN for new features.
Here is the output of sessionInfo()
on the system on
which this document was compiled:
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] qgraph_1.9.8 AMOUNTAIN_1.33.0 BiocStyle_2.33.1
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.48 bslib_0.8.0
## [4] ggplot2_3.5.1 htmlwidgets_1.6.4 psych_2.4.6.26
## [7] lattice_0.22-6 quadprog_1.5-8 vctrs_0.6.5
## [10] tools_4.4.1 stats4_4.4.1 parallel_4.4.1
## [13] tibble_3.2.1 fansi_1.0.6 highr_0.11
## [16] cluster_2.1.6 pkgconfig_2.0.3 Matrix_1.7-1
## [19] data.table_1.16.2 checkmate_2.3.2 lifecycle_1.0.4
## [22] compiler_4.4.1 stringr_1.5.1 munsell_0.5.1
## [25] mnormt_2.1.1 htmltools_0.5.8.1 sys_3.4.3
## [28] glasso_1.11 buildtools_1.0.0 sass_0.4.9
## [31] fdrtool_1.2.18 yaml_2.3.10 htmlTable_2.4.3
## [34] Formula_1.2-5 pillar_1.9.0 jquerylib_0.1.4
## [37] cachem_1.1.0 Hmisc_5.2-0 abind_1.4-8
## [40] rpart_4.1.23 nlme_3.1-166 lavaan_0.6-19
## [43] gtools_3.9.5 digest_0.6.37 stringi_1.8.4
## [46] reshape2_1.4.4 maketools_1.3.1 fastmap_1.2.0
## [49] grid_4.4.1 colorspace_2.1-1 cli_3.6.3
## [52] magrittr_2.0.3 base64enc_0.1-3 utf8_1.2.4
## [55] pbivnorm_0.6.0 foreign_0.8-87 corpcor_1.6.10
## [58] scales_1.3.0 backports_1.5.0 rmarkdown_2.28
## [61] jpeg_0.1-10 igraph_2.1.1 nnet_7.3-19
## [64] gridExtra_2.3 png_0.1-8 pbapply_1.7-2
## [67] evaluate_1.0.1 knitr_1.48 rlang_1.1.4
## [70] Rcpp_1.0.13 glue_1.8.0 BiocManager_1.30.25
## [73] rstudioapi_0.17.1 jsonlite_1.8.9 R6_2.5.1
## [76] plyr_1.8.9