The multiGSEA
package was designed to run a robust
GSEA-based pathway enrichment for multiple omics layers. The enrichment
is calculated for each omics layer separately and aggregated p-values
are calculated afterwards to derive a composite multi-omics pathway
enrichment.
Pathway definitions can be downloaded from up to eight different
pathway databases by means of the graphite
Bioconductor
package (Sales, Calura, and Romualdi
2018). Feature mapping for transcripts and proteins is supported
towards Entrez Gene IDs, Uniprot, Gene Symbol, RefSeq, and Ensembl IDs.
The mapping is accomplished through the AnnotationDbi
package (Pagès et al. 2019) and currently
supported for 11 different model organisms including human, mouse, and
rat. ID conversion of metabolite features to Comptox Dashboard IDs
(DTXCID, DTXSID), CAS-numbers, Pubchem IDs (CID), HMDB, KEGG, ChEBI,
Drugbank IDs, or common metabolite names is accomplished through the
AnnotationHub package metabliteIDmapping
. This package
provides a comprehensive ID mapping for more than 1.1 million
entries.
This vignette covers a simple example workflow illustrating how the
multiGSEA
package works. The omics data sets that will be
used throughout the example were originally provided by Quiros et
al. (Quirós et al. 2017). In their
publication the authors analyzed the mitochondrial response to four
different toxicants, including Actinonin, Diclofenac, FCCB, and
Mito-Block (MB), within the transcriptome, proteome, and metabolome
layer. The transcriptome data can be downloaded from NCBI
GEO, the proteome data from the ProteomeXchange
Consortium, and the non-targeted metabolome raw data can be found in
the online supplement.
There are two different ways to install the multiGSEA
package.
First, the multiGSEA
package is part of Bioconductor.
Hence, the best way to install the package is via
BiocManager
. Start R (>=4.0.0) and run the following
commands:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
# The following initializes usage of Bioc devel
BiocManager::install(version = "devel")
BiocManager::install("multiGSEA")
Alternatively, the multiGSEA
package is hosted on our
github page https://github.com/yigbt/multiGSEA and can be installed
via the devtools
package:
Once installed, just load the multiGSEA
package
with:
A common workflow which is exemplified in this vignette is typically separated in the following steps:
multiGSEA
package, and omics data sets.At first, we need to load the necessary packages to run the
multi-omics enrichment analysis. In our example data we have to deal
with omics data that has been measured in human cell lines. We therefore
need the org.Hs.eg.db
package (Carlson 2019a) for transcript and protein
mapping. In case the omics data was measured in mouse or rat, we would
need the packages org.Mm.eg.db
(Carlson 2019b) and org.Rn.eg.db
(Carlson 2019c), respectively.
In principle, multiGSEA
is able to deal with 11
different model organisms. A summary of supported organisms, their
naming format within multiGSEA
and their respective
AnnotationDbi
package is shown in Table
@ref(tab:organismsTable).
Organisms | Abbreviations | Mapping |
---|---|---|
Human | hsapiens | org.Hs.eg.db |
Mouse | mmusculus | org.Mm.eg.db |
Rat | rnorvegicus | org.Rn.eg.db |
Dog | cfamiliaris | org.Cf.eg.db |
Cow | btaurus | org.Bt.eg.db |
Pig | sscrofa | org.Ss.eg.db |
Chicken | ggallus | org.Gg.eg.db |
Zebrafish | drerio | org.Xl.eg.db |
Frog | xlaevis | org.Dr.eg.db |
Fruit Fly | dmelanogaster | org.Dm.eg.db |
C.elegans | celegans | org.Ce.eg.db |
To run the analysis of this vignette, load the installed version of
multiGSEA
.
Load the omics data for each layer where an enrichment should be calculated. The example data is provided within the package and already preprocessed such that we have log2 transformed fold changes and their associated p-values.
# load transcriptomic features
data(transcriptome)
# load proteomic features
data(proteome)
# load metabolomic features
data(metabolome)
This example involves preprocessed omics data from public
repositories, which means that the data might look different when you
download and pre-process it with your own workflow. Therefore, we put
our processed data as an example data in the R package. We here sketch
out the pipeline described in the multiGSEA
paper. We will
not focus on the pre-processing steps and how to derive the necessary
input format for the multi-omics pathway enrichment in terms of
differentially expression analysis, since this is highly dependent on
your experiment and analysis workflow.
However, the required input format is quite simple and exactly the same for each input omics layer: A data frame with 3 mandatory columns, including feature IDs, the log2-transformed fold change (logFC), and the associated p-value.
The header of the data frame can be seen in Table @ref(tab:omicsTable):
Symbol | logFC | pValue | adj.pValue |
---|---|---|---|
A2M | -0.1615651 | 0.0000060 | 0.0002525 |
A2M-AS1 | 0.2352903 | 0.2622606 | 0.4594253 |
A4GALT | -0.0384392 | 0.3093539 | 0.5077487 |
AAAS | 0.0170947 | 0.5407324 | 0.7126172 |
AACS | -0.0260510 | 0.4034970 | 0.5954772 |
AADAT | 0.0819910 | 0.2355455 | 0.4285059 |
multiGSEA
works with nested lists where each sublist
represents an omics layer. The function rankFeatures
calculates the pre-ranked features, that are needed for the subsequent
calculation of the enrichment score. rankFeatures
calculates the a local statistic ls
based on the direction
of the fold change and the magnitude of its significance:
Please note, that any other rank metric will work as well and the choice on which one to chose is up to the user. However, as it was shown by Zyla et al. (Joanna Zyla and Polanska 2017), the choice of the applied metric might have a big impact on the outcome of your analysis.
# create data structure
omics_data <- initOmicsDataStructure(layer = c(
"transcriptome",
"proteome",
"metabolome"
))
## add transcriptome layer
omics_data$transcriptome <- rankFeatures(
transcriptome$logFC,
transcriptome$pValue
)
names(omics_data$transcriptome) <- transcriptome$Symbol
## add proteome layer
omics_data$proteome <- rankFeatures(proteome$logFC, proteome$pValue)
names(omics_data$proteome) <- proteome$Symbol
## add metabolome layer
## HMDB features have to be updated to the new HMDB format
omics_data$metabolome <- rankFeatures(metabolome$logFC, metabolome$pValue)
names(omics_data$metabolome) <- metabolome$HMDB
names(omics_data$metabolome) <- gsub(
"HMDB", "HMDB00",
names(omics_data$metabolome)
)
The first elements of each omics layer are shown below:
omics_short <- lapply(names(omics_data), function(name) {
head(omics_data[[name]])
})
names(omics_short) <- names(omics_data)
omics_short
## $transcriptome
## STC2 ASNS PCK2 FAM129A NUPR1 ASS1
## 13.43052 12.69346 12.10931 11.97085 11.81069 11.61673
##
## $proteome
## IFRD1 FAM129A FDFT1 ASNS CTH PCK2
## 10.818222 10.108260 -9.603185 9.327082 8.914447 8.908938
##
## $metabolome
## HMDB0000042 HMDB0003344 HMDB0000820 HMDB0000863 HMDB0006853 HMDB0013785
## -1.404669 -1.404669 -3.886828 -3.886828 -3.130641 -3.130641
Now we have to select the databases we want to query and the omics layer we are interested in before pathway definitions are downloaded and features are mapped to the desired format.
databases <- c("kegg", "reactome")
layers <- names(omics_data)
pathways <- getMultiOmicsFeatures(
dbs = databases, layer = layers,
returnTranscriptome = "SYMBOL",
returnProteome = "SYMBOL",
returnMetabolome = "HMDB",
useLocal = FALSE
)
The first two pathway definitions of each omics layer are shown below:
pathways_short <- lapply(names(pathways), function(name) {
head(pathways[[name]], 2)
})
names(pathways_short) <- names(pathways)
pathways_short
## $transcriptome
## $transcriptome$`(KEGG) Glycolysis / Gluconeogenesis`
## [1] "AKR1A1" "ADH1A" "ADH1B" "ADH1C" "ADH4" "ADH5" "ADH6"
## [8] "GALM" "ADH7" "LDHAL6A" "DLAT" "DLD" "ENO1" "ENO2"
## [15] "ENO3" "ALDH2" "ALDH3A1" "ALDH1B1" "FBP1" "ALDH3B1" "ALDH3B2"
## [22] "ALDH9A1" "ALDH3A2" "ALDOA" "ALDOB" "ALDOC" "G6PC1" "GAPDH"
## [29] "GAPDHS" "GCK" "GPI" "HK1" "HK2" "HK3" "ENO4"
## [36] "LDHA" "LDHB" "LDHC" "PGAM4" "ALDH7A1" "PCK1" "PCK2"
## [43] "PDHA1" "PDHA2" "PDHB" "PFKL" "PFKM" "PFKP" "PGAM1"
## [50] "PGAM2" "PGK1" "PGK2" "PGM1" "PKLR" "PKM" "PGM2"
## [57] "ACSS2" "G6PC2" "BPGM" "TPI1" "HKDC1" "ADPGK" "ACSS1"
## [64] "FBP2" "LDHAL6B" "G6PC3" "MINPP1"
##
## $transcriptome$`(KEGG) Citrate cycle (TCA cycle)`
## [1] "CS" "DLAT" "DLD" "DLST" "FH" "IDH1" "IDH2" "IDH3A"
## [9] "IDH3B" "IDH3G" "MDH1" "MDH2" "ACLY" "ACO1" "OGDH" "ACO2"
## [17] "PC" "PDHA1" "PDHA2" "PDHB" "OGDHL" "SDHA" "SDHB" "SDHC"
## [25] "SDHD" "SUCLG2" "SUCLG1" "SUCLA2" "PCK1" "PCK2"
##
##
## $proteome
## $proteome$`(KEGG) Glycolysis / Gluconeogenesis`
## [1] "AKR1A1" "ADH1A" "ADH1B" "ADH1C" "ADH4" "ADH5" "ADH6"
## [8] "GALM" "ADH7" "LDHAL6A" "DLAT" "DLD" "ENO1" "ENO2"
## [15] "ENO3" "ALDH2" "ALDH3A1" "ALDH1B1" "FBP1" "ALDH3B1" "ALDH3B2"
## [22] "ALDH9A1" "ALDH3A2" "ALDOA" "ALDOB" "ALDOC" "G6PC1" "GAPDH"
## [29] "GAPDHS" "GCK" "GPI" "HK1" "HK2" "HK3" "ENO4"
## [36] "LDHA" "LDHB" "LDHC" "PGAM4" "ALDH7A1" "PCK1" "PCK2"
## [43] "PDHA1" "PDHA2" "PDHB" "PFKL" "PFKM" "PFKP" "PGAM1"
## [50] "PGAM2" "PGK1" "PGK2" "PGM1" "PKLR" "PKM" "PGM2"
## [57] "ACSS2" "G6PC2" "BPGM" "TPI1" "HKDC1" "ADPGK" "ACSS1"
## [64] "FBP2" "LDHAL6B" "G6PC3" "MINPP1"
##
## $proteome$`(KEGG) Citrate cycle (TCA cycle)`
## [1] "CS" "DLAT" "DLD" "DLST" "FH" "IDH1" "IDH2" "IDH3A"
## [9] "IDH3B" "IDH3G" "MDH1" "MDH2" "ACLY" "ACO1" "OGDH" "ACO2"
## [17] "PC" "PDHA1" "PDHA2" "PDHB" "OGDHL" "SDHA" "SDHB" "SDHC"
## [25] "SDHD" "SUCLG2" "SUCLG1" "SUCLA2" "PCK1" "PCK2"
##
##
## $metabolome
## $metabolome$`(KEGG) Glycolysis / Gluconeogenesis`
## [1] "HMDB0001586" "HMDB0001206" "HMDB0001294" "HMDB0001270" "HMDB0000122"
## [6] "HMDB0003498" "HMDB0003391" "HMDB0060180" "HMDB0003904" "HMDB0000108"
## [11] "HMDB0000223" "HMDB0000243" "HMDB0000042" "HMDB0000190" "HMDB0000990"
## [16] "HMDB0001473" "HMDB0003345" "HMDB0000263" "HMDB0001112" "HMDB0000124"
##
## $metabolome$`(KEGG) Citrate cycle (TCA cycle)`
## [1] "HMDB0001206" "HMDB0003974" "HMDB0001022" "HMDB0006744" "HMDB0003904"
## [6] "HMDB0000094" "HMDB0000134" "HMDB0000223" "HMDB0000243" "HMDB0000254"
## [11] "HMDB0000208" "HMDB0000263" "HMDB0000156" "HMDB0000072" "HMDB0000193"
At this stage, we have the ranked features for each omics layer and
the extracted and mapped features from external pathway databases. In
the following step we can use the multiGSEA
function to
calculate the enrichment for all omics layer separately.
# use the multiGSEA function to calculate the enrichment scores
# for all omics layer at once.
enrichment_scores <- multiGSEA(pathways, omics_data)
The enrichment score data structure is a list containing sublists
named transcriptome
, proteome
, and
metabolome
. Each sublist represents the complete pathway
enrichment for this omics layer.
Making use of the Stouffers Z-method to combine multiple p-values
that have been derived from independent tests that are based on the same
null hypothesis. The function extractPvalues
creates a
simple data frame where each row represents a pathway and columns
represent omics related p-values and adjusted p-values. This data
structure can then be used to calculate the aggregated p-value. The
subsequent calculation of adjusted p-values can be achieved by the
function p.adjust
.
multiGSEA
provided three different methods to aggregate
p-values. These methods differ in their way how they weight either small
or large p-values. By default, combinePvalues
will apply
the Z-method or Stouffer’s method (Stouffer et
al. 1949) which has no bias towards small or large p-values. The
widely used Fisher’s combined probability test (Fisher 1932) can also be applied but is known
for its bias towards small p-values. Edgington’s method goes the
opposite direction by favoring large p-values (Edgington 1972). Those methods can be applied
by setting the parameter method
to “fisher” or
“edgington”.
df <- extractPvalues(
enrichmentScores = enrichment_scores,
pathwayNames = names(pathways[[1]])
)
df$combined_pval <- combinePvalues(df)
df$combined_padj <- p.adjust(df$combined_pval, method = "BH")
df <- cbind(data.frame(pathway = names(pathways[[1]])), df)
Please note: In former versions of
multiGSEA
, adjusted p-values of individual omics-layers
have been combined. The correction for multiple testing prior to the
p-value combination will, however, violate the assumptions about
p-values of certain combination methods (e.g., Fisher’s combined
probability test).
Therefore, multiGSEA uses p-values for combination from
RELEASE 3.18. To keep previous results reproducible, we
introduced the col_pattern
parameter which can be set to
padj
to use adjusted p-values.
Finally, print the pathways sorted based on their combined adjusted p-values. For displaying reasons, only the adjusted p-values are shown in Table @ref(tab:resultTable).
pathway | transcriptome_padj | proteome_padj | metabolome_padj | combined_pval |
---|---|---|---|---|
(KEGG) Metabolic pathways | 0.2334681 | 0.0000000 | 0.0000001 | 0 |
(REACTOME) Cell Cycle | 0.0000000 | 0.0000000 | 0.7285786 | 0 |
(REACTOME) Amino acid and derivative metabolism | 0.0001129 | 0.0000000 | 0.2929289 | 0 |
(REACTOME) Cell Cycle, Mitotic | 0.0000000 | 0.0000061 | 0.7285786 | 0 |
(REACTOME) Metabolism | 0.1379817 | 0.0000000 | 0.0019073 | 0 |
(KEGG) Carbon metabolism | 0.0695093 | 0.0000074 | 0.0004568 | 0 |
(KEGG) Biosynthesis of amino acids | 0.0009925 | 0.0000000 | 0.3390800 | 0 |
(REACTOME) Selenocysteine synthesis | 0.0062946 | 0.0000002 | 0.1926771 | 0 |
(REACTOME) Cholesterol biosynthesis | 0.0000000 | 0.0005119 | 0.6061027 | 0 |
(REACTOME) Signaling Pathways | 0.0000000 | 0.0013548 | 0.3390800 | 0 |
(REACTOME) Selenoamino acid metabolism | 0.0036377 | 0.0000000 | 0.5188184 | 0 |
(REACTOME) RHO GTPase Effectors | 0.0000000 | 0.0000943 | 0.8194159 | 0 |
(REACTOME) M Phase | 0.0000000 | 0.0105863 | 0.7285786 | 0 |
(KEGG) Steroid biosynthesis | 0.0000002 | 0.0001654 | 0.5532487 | 0 |
(REACTOME) Glutathione synthesis and recycling | 0.0560028 | 0.0000229 | 0.0751749 | 0 |
In principle, multiGSEA
can also be run as single/multi
omics analysis on custom gene sets.
The pathways
object storing the pathway features across
multiple omics layers is a nested list, and hence can be designed
manually to fit ones purposes.
In the following example, HALLMARK gene sets are retrieved from
MSigDB
and used to create a transcriptomics input list:
library(msigdbr)
library(dplyr)
hallmark <- msigdbr(species = "Rattus norvegicus", category = "H") %>%
dplyr::select(gs_name, ensembl_gene) %>%
dplyr::filter(!is.na(ensembl_gene)) %>%
unstack(ensembl_gene ~ gs_name)
pathways <- list("transcriptome" = hallmark)
Please note, feature sets across multiple omics layers have to be in the same order and their names have to be identical, see the example presented above.
Here is the output of sessionInfo()
on the system on
which this document was compiled:
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] magrittr_2.0.3 multiGSEA_1.17.2 org.Hs.eg.db_3.20.0
## [4] AnnotationDbi_1.69.0 IRanges_2.41.1 S4Vectors_0.45.2
## [7] Biobase_2.67.0 BiocGenerics_0.53.3 generics_0.1.3
## [10] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] Rdpack_2.6.2 DBI_1.2.3
## [3] mnormt_2.1.1 sandwich_3.1-1
## [5] rlang_1.1.4 multcomp_1.4-26
## [7] qqconf_1.3.2 compiler_4.4.2
## [9] RSQLite_2.3.8 png_0.1-8
## [11] vctrs_0.6.5 pkgconfig_2.0.3
## [13] crayon_1.5.3 fastmap_1.2.0
## [15] dbplyr_2.5.0 XVector_0.47.0
## [17] utf8_1.2.4 rmarkdown_2.29
## [19] graph_1.85.0 UCSC.utils_1.3.0
## [21] purrr_1.0.2 bit_4.5.0
## [23] xfun_0.49 zlibbioc_1.52.0
## [25] cachem_1.1.0 graphite_1.53.0
## [27] GenomeInfoDb_1.43.1 jsonlite_1.8.9
## [29] blob_1.2.4 BiocParallel_1.41.0
## [31] parallel_4.4.2 R6_2.5.1
## [33] bslib_0.8.0 mutoss_0.1-13
## [35] jquerylib_0.1.4 numDeriv_2016.8-1.1
## [37] Rcpp_1.0.13-1 knitr_1.49
## [39] zoo_1.8-12 splines_4.4.2
## [41] Matrix_1.7-1 tidyselect_1.2.1
## [43] yaml_2.3.10 codetools_0.2-20
## [45] curl_6.0.1 lattice_0.22-6
## [47] tibble_3.2.1 withr_3.0.2
## [49] KEGGREST_1.47.0 evaluate_1.0.1
## [51] metaboliteIDmapping_1.0.0 survival_3.7-0
## [53] BiocFileCache_2.15.0 Biostrings_2.75.1
## [55] pillar_1.9.0 BiocManager_1.30.25
## [57] filelock_1.0.3 sn_2.1.1
## [59] mathjaxr_1.6-0 BiocVersion_3.21.1
## [61] ggplot2_3.5.1 munsell_0.5.1
## [63] scales_1.3.0 TFisher_0.2.0
## [65] glue_1.8.0 maketools_1.3.1
## [67] tools_4.4.2 metap_1.11
## [69] AnnotationHub_3.15.0 sys_3.4.3
## [71] data.table_1.16.2 fgsea_1.33.0
## [73] buildtools_1.0.0 mvtnorm_1.3-2
## [75] fastmatch_1.1-4 cowplot_1.1.3
## [77] grid_4.4.2 plotrix_3.8-4
## [79] rbibutils_2.3 colorspace_2.1-1
## [81] GenomeInfoDbData_1.2.13 cli_3.6.3
## [83] rappdirs_0.3.3 fansi_1.0.6
## [85] dplyr_1.1.4 gtable_0.3.6
## [87] sass_0.4.9 digest_0.6.37
## [89] TH.data_1.1-2 multtest_2.63.0
## [91] memoise_2.0.1 htmltools_0.5.8.1
## [93] lifecycle_1.0.4 httr_1.4.7
## [95] mime_0.12 MASS_7.3-61
## [97] bit64_4.5.2