Gene Set Enrichment Analysis (GSEA) is a very powerful and interesting computational method that allows an easy correlation between differential expressed genes and biological processes. Unfortunately, although it was designed to help researchers to interpret gene expression data it can generate huge amounts of results whose biological meaning can be difficult to interpret.
Many available tools rely on the hierarchically structured Gene Ontology (GO) classification to reduce reundandcy in the results. However, due to the popularity of GSEA many more gene set collections, such as those in the Molecular Signatures Database (MSigDB), are emerging. Since these collections are not organized as those in GO, their usage for GSEA do not always give a straightforward answer or, in other words, getting all the meaninful information can be challenging with the currently available tools. For these reasons, GSEAmining was born to be an easy tool to create reproducible reports to help researchers make biological sense of GSEA outputs.
Given the results of GSEA, GSEAmining clusters the different gene sets collections based on the presence of the same genes in the leadind edge (core) subset. Leading edge subsets are those genes that contribute most to the enrichment score of each collection of genes or gene sets. For this reason, gene sets that participate in similar biological processes should share genes in common and in turn cluster together. After that, GSEAmining is able to identify and represent for each cluster:
In each case, positive and negative enrichments are shown in different colors so it is easy to distinguish biological processes or genes that may be of interest in that particular study.
You can install GSEAmining
from Bioconductor:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("GSEAmining")
Or directly from GitHub:
By default GSEAmining is designed to accept the resuls of the GSEA function from the clusterProfiler package. For more information about this function click here. However, it is not mandatory to use this package.
In this example the data corresponds to GSEA analysis of differential expressed genes from treated versus control samples in HGPalmer-PDX-P30 experiment. Differential gene expression (tableTop) was obtained using the oligo and limma R packages.
# A geneList contains three features:
# 1.numeric vector: fold change or other type of numerical variable
# 2.named vector: every number has a name, the corresponding gene ID
# 3.sorted vector: number should be sorted in decreasing order
tableTop_p30 <- as.data.frame(tableTop_p30)
geneList = tableTop_p30[,2]
names(geneList) = as.character(tableTop_p30[,1])
library(clusterProfiler)
# Read the .gmt file from MSigDB
gmtC2<- read.gmt("gmt files/c2.all.v7.1.symbols.gmt")
gmtC5<- read.gmt('gmt files/c5.all.v7.1.symbols.gmt')
gmtHALL <- read.gmt('gmt files/h.all.v7.1.symbols.gmt')
# Merge all the gene sets
gmt_all <- rbind(gmtC2, gmtC5, gmtHALL)
GSEA_p30<-GSEA(geneList, TERM2GENE = gmt_all, nPerm = 1000, pvalueCutoff = 0.5)
# Selection of gene sets with a specific thershold in terms of NES and p.adjust
genesets_sel <- GSEA_p30@result
Data should be in a data.frame with, at least three columns labelled as follows:
ID: The name of the gene set.
NES: Normalized Enrichment Score of the gene set.
core_enrichment: Genes that appear in the leading edge subset of the gene set.
# Structure of the data included in the package
data('genesets_sel', package = 'GSEAmining')
tibble::glimpse(genesets_sel)
#> Rows: 52
#> Columns: 4
#> $ ID <chr> "WATANABE_COLON_CANCER_MSI_VS_MSS_DN", "GO_RNA_BINDING…
#> $ NES <dbl> 2.511796, 2.234696, 2.200913, 2.167304, 2.127964, 2.11…
#> $ p.adjust <dbl> 0.02463671, 0.02716912, 0.03017176, 0.02612857, 0.0222…
#> $ core_enrichment <chr> "DACH1/DUOX2/GRM8/CDHR1/SPINK1/CEL/CYP2B6/C10orf99/SLC…
gm_filter: Filter the input data
GSEA outputs normally presents tens or hundreds of genesets but many of them may not meet the thresholds for considering them significantly enriched. For that it is better to filter the data for a better visualization.
The gm_filter()
function of GSEAmining allows this
process to be very easy, especially if the format of your data meets the
aforementioned criteria.
Using the gm_clust()
function, we can create an object
that will contain the hierarchical clustering of the gene sets according
to their genes in the core_enrichment column. This function accepts the
data frame created with gm_filter. In the process, a distance matrix is
calculated using the binary method (from the
dist()
function in stats
) and then a cluster
with the complete method (from hclust()
function
in stats
) is created.
The gm_dendplot()
function uses (i) the filtered data
frame and (ii) the gm_clust object to plot the cluster dendrogram. It
shows which gene sets are positively or negatively enriched coloring
them in red or blue respectively. Additionally by default the function
will highlight every other cluster with a rectangle to facilitate the
differentiation of all the clusters.
It is possible to tune the height of the dendrogram, the colors for the enrichment and the use of rectangles (and their sizes) for highlighting clusters.
To start getting an idea of what kind of biological processes are
represented in each cluster we can use gm_enrichterms()
.
This function plots word clouds for the most enriched terms in the names
of the different gene sets in each cluster. Again it needs (i) the
filtered data frame and (ii) the gm_clust object created before. By
default Positive or negative enrichment are colored in red or in blue
respectively and two ploting options are available:
gm_enrichterms(gs.filt, gs.cl)
#> Warning in ggwordcloud::geom_text_wordcloud_area(eccentricity = 1,
#> area_corr_power = 1, : Ignoring unknown parameters: `area_corr_power`
- Without clustering:
As for the gene set terms, it is also possible to evaluate which are
the most enriched genes in the leading edge (or core enrichment) subsets
in the different clusters with the function
gm_enrichcores()
. In this case the representation is shown
as bar plots following the same options as in
gm_enrichterms()
regarding colors and clustering the
results.
Additionally, this function allows the possibility to decide the
number of the most top genes that will be plotted using the parameter
top
. In case more than the indicated number of top genes
appear the same number of times in the same cluster they all will be
shown.
Finally, the gm_enrichreport()
is a tool that generates
a pdf file where each page contains both the terms word cloud and the
leading edge bar plot for each of the different clusters in the gm_clust
object.
The output
parameter can be used to name the file that
will be generated (by default it will be called gm_report.pdf). As in
the previous functions the colors and the number of top genes are also
editable parameters of the function.
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] GSEAmining_1.17.0 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] janeaustenr_1.0.0 viridis_0.6.5 sass_0.4.9
#> [4] utf8_1.2.4 generics_0.1.3 slam_0.1-55
#> [7] xml2_1.3.6 lattice_0.22-6 stringi_1.8.4
#> [10] digest_0.6.37 magrittr_2.0.3 evaluate_1.0.1
#> [13] grid_4.4.2 fastmap_1.2.0 jsonlite_1.8.9
#> [16] Matrix_1.7-1 tm_0.7-15 tidytext_0.4.2
#> [19] ggwordcloud_0.6.2 gridExtra_2.3 BiocManager_1.30.25
#> [22] purrr_1.0.2 fansi_1.0.6 viridisLite_0.4.2
#> [25] scales_1.3.0 jquerylib_0.1.4 cli_3.6.3
#> [28] rlang_1.1.4 tokenizers_0.3.0 commonmark_1.9.2
#> [31] munsell_0.5.1 withr_3.0.2 cachem_1.1.0
#> [34] yaml_2.3.10 parallel_4.4.2 NLP_0.3-2
#> [37] tools_4.4.2 dplyr_1.1.4 colorspace_2.1-1
#> [40] ggplot2_3.5.1 buildtools_1.0.0 vctrs_0.6.5
#> [43] R6_2.5.1 png_0.1-8 lifecycle_1.0.4
#> [46] stringr_1.5.1 pkgconfig_2.0.3 dendextend_1.19.0
#> [49] pillar_1.9.0 bslib_0.8.0 gtable_0.3.6
#> [52] glue_1.8.0 Rcpp_1.0.13-1 xfun_0.49
#> [55] tibble_3.2.1 tidyselect_1.2.1 sys_3.4.3
#> [58] knitr_1.49 farver_2.1.2 SnowballC_0.7.1
#> [61] htmltools_0.5.8.1 labeling_0.4.3 rmarkdown_2.29
#> [64] maketools_1.3.1 compiler_4.4.2 markdown_1.13
#> [67] gridtext_0.1.5