Co-expressed gene-set enrichment analysis, cogena, is a workflow for gene set enrichment analysis of co-expressed genes. The cogena worklow (Figure 1) proceeds from co-expression analysis using a range of clustering methods, through to gene set enrichment analysis based on a range of pre-defined gene sets. Cogena can be applied to a number of different analytical scenarios dependent on the gene set used. Currently cogena is pre-built with gene sets from Msigdb and Connectivity Map, allowing pathway analysis and drug repositioning analysis respectively, the user can also add custom genes sets to expand analytical functionality further.
The following sections outline a typical example of the cogena workflow, describing the input data and typical analysis steps required for pathway analysis and drug repositioning analysis.
See examples using the ?cogena
command in R.
Note: all the gene names should be gene SYMBOLs since they are used in
the gene set files. Other kinds of gene identifiers can be used
according to the identifiers used in the user-defined gene-set files.
The cogena
package has an example dataset, from the NCBI
GEO database GSE13355.
The samples are derived from lesional and non-lesional skin of psoriasis
patients. There are two objects in the Psoriasis dataset. See
?Psoriasis
for more details.
## Warning: replacing previous import 'class::somgrid' by 'kohonen::somgrid' when
## loading 'cogena'
data(Psoriasis)
# objects in the Psoriasis dataset.
# Note: label of interest should follow the control label as this
# will affect the direction of gene regulation.
# For instance use factor (c("Normal", "Cancer", "Normal"),
# levels=c("Normal", "Cancer")), instead of factor(c("Normal",
# "Cancer","Normal")) since "Cancer" is the label of interest.
## [1] "DEexprs" "sampleLabel"
The gene set used determines the type of analysis.
There are a variety of gene sets in the cogena
package,
partly collected from MSigDB
and CMap. Gene sets are
summarized in Table 2.
Table 2. Cogena Gene Sets
Gene Set | Description |
---|---|
c2.cp.kegg.v7.1.symbols.gmt.xz | KEGG gene sets |
c2.cp.reactome.v7.1.symbols.gmt.xz | Reactome gene sets |
c2.cp.v7.0.symbols.gmt.xz | all canonical pathways |
c5.bp.v7.0.symbols.gmt.xz | GO biological processes |
c5.mf.v7.0.symbols.gmt.xz | GO molecular functions |
CmapDn100.gmt.xz | Connectivity map gene sets: top 100 down regulated per drug |
CmapUp100.gmt.xz | Connectivity map gene sets: top 100 up regulated per drug |
User-defined gene-sets must be formatted gmt and/or compressed by xz, (such as c2.cp.kegg.v7.01.symbols.gmt or c2.cp.kegg.v7.01.symbols.gmt.xz). Gene sets should be copied to the extdata directory in the installation directory of cogena, such as ~/R/x86_64-pc-linux-gnu- library/3.2/cogena/extdata in Linux), or kindly send to the author of cogena to share with others.
Firstly, KEGG Pathway Analysis, will be demonstrated to show the utility of cogena. The other analyses based on cogena are similar to the process of pathway analysis. Here we used the KEGG pathway gene set (c2.cp.kegg.v7.01.symbols.gmt.xz), hierarchical and Pam clustering methods, 10 clusters, 2 CPU cores and “correlation” distance metric to set up the pathway analysis.
# KEGG Pathway gene set
annoGMT <- "c2.cp.kegg.v7.01.symbols.gmt.xz"
# GO biological process gene set
# annoGMT <- "c5.bp.v7.0.symbols.gmt.xz"
annofile <- system.file("extdata", annoGMT, package="cogena")
# the number of clusters. It can be a vector.
# nClust <- 2:20
nClust <- 10
# Making factor "Psoriasis" behind factor "ct" means Psoriasis Vs Control
# is up-regualted
sampleLabel <- factor(sampleLabel, levels=c("ct", "Psoriasis"))
# the number of cores.
# ncore <- 8
ncore <- 2
# the clustering methods
# clMethods <- c("hierarchical","kmeans","diana","fanny","som","model",
# "sota","pam","clara","agnes") # All the methods can be used together.
clMethods <- c("hierarchical","pam")
# the distance metric
metric <- "correlation"
# the agglomeration method used for hierarchical clustering
# (hierarchical and agnes)
method <- "complete"
There are two steps for cogena analysis, co-expression analysis and then gene set enrichment analysis (here is pathway anlysis).
# Co-expression Analysis
genecl_result <- coExp(DEexprs, nClust=nClust, clMethods=clMethods,
metric=metric, method=method, ncore=ncore)
# Enrichment (Pathway) analysis for the co-expressed genes
clen_res <- clEnrich(genecl_result, annofile=annofile, sampleLabel=sampleLabel)
## Note: 671 out of 706 exist in the genes population.
##
## Clustering Methods:
## hierarchical pam
##
## The Number of Clusters:
## 10
##
## Metric of Distance Matrix:
## correlation
##
## Agglomeration method for hierarchical clustering (hclust and agnes):
## complete
##
## Gene set:
## c2.cp.kegg.v7.01.symbols.gmt.xz
After completing the cogena analysis, the user can use
summary
to see the summary of the result of cogena. And
enrichment
caculates the enrichment score of certain
clustering methods and certain numbers of cluster.
Cogena does not automatically set the clustering method or the number of clusters. Here we show some principles to guide the user towards optimal selection of method and number of clusters:
##
## Clustering Methods:
## hierarchical pam
##
## The Number of Clusters:
## 10
##
## Metric of Distance Matrix:
## correlation
##
## Agglomeration method for hierarchical clustering (hclust and agnes):
## complete
##
## Gene set:
## c2.cp.kegg.v7.01.symbols.gmt.xz
heatmapCluster
is developed to show the co-expression of
differentially expressed genes. Figure 2 produced by
heatmapCluster
is an enhanced heatmap with co-expression
information. Moreover, it is obvious to know which cluster contains
up-regulated or down-regulated genes based on the colour.
# Always make the number as character, please!
heatmapCluster(clen_res, "pam", "10", maintitle="Psoriasis")
## The number of genes in each cluster:
## upDownGene
## 1 2
## 468 238
## cluster_size
## 1 2 3 4 5 6 7 8 9 10
## 158 65 38 92 50 67 63 94 61 18
heatmapPEI
can be used to show the enrichment graph. See
Figure 3. See ?heatmapPEI
for more details. Many parameters
are configurable, while generally the default will be fine.
# The enrichment score for 10 clusters, together with Down-regulated,
# Up-regulated and All DE genes. The values shown in Figure 2 is the -log2(FDR).
#
# Always make the number as character, please!
heatmapPEI(clen_res, "pam", "10", printGS=FALSE, maintitle="Pathway analysis for Psoriasis")
## Joining with `by = join_by(rowname)`
## Joining with `by = join_by(clusterNumGene, GS)`
## Warning: Vectorized input to `element_text()` is not officially supported.
## ℹ Results may be unexpected or may change in future versions of ggplot2.
Figure 3A shows the pathway enrichment for each cluster as well as up-regulated, down-regulated and all the differentially expressed genes. The enrichment scores can be ranked by a certain cluster or the max or average scores of all the scores for each pathway.
heatmapPEI(clen_res, "pam", "10", printGS=FALSE, maintitle="Pathway analysis for Psoriasis", geom="circle")
## Joining with `by = join_by(rowname)`
## Joining with `by = join_by(clusterNumGene, GS)`
## Warning: Vectorized input to `element_text()` is not officially supported.
## ℹ Results may be unexpected or may change in future versions of ggplot2.
Figure 3B shows the pathway enrichment using circle plotting.
Pathway analysis demonstrates that specific disease pathways are often represented by a single cluster. Accordingly, we recommend that drug repositioning is performed based on co-expressed gene clusters instead of all the differentially expressed genes. If the input of cogena is disease related data, the drugs enriched should recover the gene expression changed by the disease (the drug should induce an opposite direction in expression to the disease), while if the input is drug related, the enriched drugs should show similar gene expression changes caused by the drug studied. Here we show drugs for treating psoriasis, an autoimmune disease.
The drug repositioning gene set choice of CmapDn100.gmt.xz or CmapUp100.gmt.xz should be made based on the regulation direction of clusters. For example, as the 7th cluster contains up-regulated genes for psoriasis, the CmapDn100.gmt.xz is chosen for drug repositioning of psoriasis to recover the gene expression changes caused by the disease.
# A comprehensive way
# cmapDn100_cogena_result <- clEnrich(genecl_result,
# annofile=system.file("extdata", "CmapDn100.gmt.xz", package="cogena"),
# sampleLabel=sampleLabel)
# A quick way
# Based on the pathway analysis results
cmapDn100_cogena_result <- clEnrich_one(genecl_result, "pam", "10",
annofile=system.file("extdata", "CmapDn100.gmt.xz", package="cogena"),
sampleLabel=sampleLabel)
Showing the results ordered by the 7th cluster in Figure 5. The
parameter orderMethod
is used to order the results.
heatmapPEI(cmapDn100_cogena_result, "pam", "10", printGS=FALSE,
orderMethod = "7", maintitle="Drug repositioning for Psoriasis")
## Joining with `by = join_by(rowname)`
## Joining with `by = join_by(clusterNumGene, GS)`
## Warning: Vectorized input to `element_text()` is not officially supported.
## ℹ Results may be unexpected or may change in future versions of ggplot2.
# Results based on cluster 5.
# heatmapPEI(cmapDn100_cogena_result, "pam", "10", printGS=FALSE,
# orderMethod = "5", maintitle="Drug repositioning for Psoriasis")
# Results based on cluster 9, containing down-regulated genes.
# heatmapPEI(cmapUp100_cogena_result, "pam", "10", printGS=FALSE,
# orderMethod = "9", maintitle="Drug repositioning for Psoriasis")
Usually there is more than one instance for a drug with different
doses or time-points in the Cmap gene set. heatmapCmap
can
merge the multi-instance results based on parameter mergeMethod
(“mean” or “max”). Figure 6 shows the multi-instance merged results
ordered by the 7th cluster.
heatmapCmap(cmapDn100_cogena_result, "pam", "10", printGS=FALSE,
orderMethod = "7", maintitle="Drug repositioning for Psoriasis")
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
##
## # Simple named list: list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
##
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## ℹ The deprecated feature was likely used in the cogena package.
## Please report the issue at <https://github.com/zhilongjia/cogena/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Vectorized input to `element_text()` is not officially supported.
## ℹ Results may be unexpected or may change in future versions of ggplot2.
The user can obtain the genes in a certain cluster via
geneInCluster
, enabling other analyses, such as drug
target identification.
# Always make the number as character, please!
geneC <- geneInCluster(clen_res, "pam", "10", "4")
head(geneC)
## [1] "CD47" "SERPINB13" "PNP" "MPZL2" "KCNJ15" "SOX7"
It can be obtained by geneExpInCluster
. There are two
items, clusterGeneExp
and label
, in the
returned object of geneExpInCluster
. It can be used for
other application.
# Always make the number as character, please!
gec <- geneExpInCluster(clen_res, "pam", "10")
gec$clusterGeneExp[1:3, 1:4]
## cluster_id GSM337261 GSM337262 GSM337263
## PI3 1 6.556556 6.040479 7.033708
## S100A7A 1 4.989918 4.971686 5.677227
## S100A12 1 4.873823 5.168421 5.255036
## GSM337261 GSM337262 GSM337263 GSM337264
## ct ct ct ct
## Levels: ct Psoriasis
System info
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] cogena_1.41.0 kohonen_3.0.12 ggplot2_3.5.1 cluster_2.1.6 rmarkdown_2.29
##
## loaded via a namespace (and not attached):
## [1] amap_0.8-20 tidyselect_1.2.1 farver_2.1.2
## [4] dplyr_1.1.4 bitops_1.0-9 fastmap_1.2.0
## [7] promises_1.3.0 digest_0.6.37 mime_0.12
## [10] lifecycle_1.0.4 ellipsis_0.3.2 magrittr_2.0.3
## [13] compiler_4.4.2 rlang_1.1.4 sass_0.4.9
## [16] tools_4.4.2 utf8_1.2.4 yaml_2.3.10
## [19] corrplot_0.95 knitr_1.49 labeling_0.4.3
## [22] htmlwidgets_1.6.4 pkgbuild_1.4.5 mclust_6.1.1
## [25] plyr_1.8.9 pkgload_1.4.0 biwt_1.0.1
## [28] KernSmooth_2.23-24 miniUI_0.1.1.1 withr_3.0.2
## [31] purrr_1.0.2 BiocGenerics_0.53.3 sys_3.4.3
## [34] desc_1.4.3 grid_4.4.2 fansi_1.0.6
## [37] urlchecker_1.0.1 profvis_0.4.0 caTools_1.18.3
## [40] xtable_1.8-4 colorspace_2.1-1 fastcluster_1.2.6
## [43] scales_1.3.0 gtools_3.9.5 iterators_1.0.14
## [46] MASS_7.3-61 cli_3.6.3 crayon_1.5.3
## [49] generics_0.1.3 remotes_2.5.0 rstudioapi_0.17.1
## [52] robustbase_0.99-4-1 reshape2_1.4.4 sessioninfo_1.2.2
## [55] cachem_1.1.0 stringr_1.5.1 parallel_4.4.2
## [58] vctrs_0.6.5 devtools_2.4.5 Matrix_1.7-1
## [61] jsonlite_1.8.9 maketools_1.3.1 foreach_1.5.2
## [64] tidyr_1.3.1 jquerylib_0.1.4 glue_1.8.0
## [67] DEoptimR_1.1-3-1 codetools_0.2-20 stringi_1.8.4
## [70] gtable_0.3.6 later_1.3.2 munsell_0.5.1
## [73] tibble_3.2.1 pillar_1.9.0 htmltools_0.5.8.1
## [76] gplots_3.2.0 R6_2.5.1 doParallel_1.0.17
## [79] rprojroot_2.0.4 evaluate_1.0.1 shiny_1.9.1
## [82] lattice_0.22-6 Biobase_2.67.0 apcluster_1.4.13
## [85] memoise_2.0.1 httpuv_1.6.15 bslib_0.8.0
## [88] class_7.3-22 Rcpp_1.0.13-1 xfun_0.49
## [91] fs_1.6.5 buildtools_1.0.0 usethis_3.0.0
## [94] pkgconfig_2.0.3