For the demonstration of escape, we will use the example
“pbmc_small” data from Seurat and also
generate a SingleCellExperiment
object from it.
suppressPackageStartupMessages(library(escape))
suppressPackageStartupMessages(library(SingleCellExperiment))
suppressPackageStartupMessages(library(scran))
suppressPackageStartupMessages(library(Seurat))
suppressPackageStartupMessages(library(SeuratObject))
suppressPackageStartupMessages(library(RColorBrewer))
suppressPackageStartupMessages(library(ggplot2))
pbmc_small <- get("pbmc_small")
sce.pbmc <- as.SingleCellExperiment(pbmc_small, assay = "RNA")
The first step in the process of performing gene set enrichment
analysis is identifying the gene sets we would like to use. The function
getGeneSets()
allows users to isolate a whole or multiple
libraries from a list of GSEABase GeneSetCollection objects.
We can do this for gene set collections from the built-in Molecular Signature Database by setting the parameter library equal to library/libraries of interest. For multiple libraries, just set library = c(“Library1”, “Library2”, etc).
Additional parameters include:
If the sequencing of the single-cell data is performed on a species
other than “Homo sapiens”, make sure to use the species
parameter in getGeneSets()
in order to get the correct gene
nomenclature.
Several popular methods exist for Gene Set Enrichment Analysis (GSEA). These methods can vary in the underlying assumptions. escape incorporates several methods that are particularly advantageous for single-cell RNA values:
This method calculates the enrichment score using a rank-normalized approach and generating an empirical cumulative distribution function for each individual cell. The enrichment score is defined for a gene set (G) using the number of genes in the gene set (NG) and total number of genes (N).
$$ ES(G,S) \sum_{i = 1}^{n} [P_W^G(G,S,i)-P_{NG}(G,S,i)] $$ Please see the following citation for more information.
GSVA varies slightly by estimating a Poisson-based kernel cumulative density function. But like ssGSEA, the ultimate enrichment score reported is based on the maximum value of the random walk statistic. GSVA appears to have better overall consistency and runs faster than ssGSEA.
ESjkmax = Vjk[maxl = 1, ..., p|vjk(l)] Please see the following citation for more information.
In contrast to ssGSEA and GSVA, AUCell takes the gene rankings for each cell and step-wise plots the position of each gene in the gene set along the y-axis. The output score is the area under the curve for this approach.
Please see the following citation for more information.
UCell calculates a Mann-Whitney U statistic based on the gene rank list. Importantly, UCell has a cut-off for ranked genes (rmax) at 1500 - this is per design as drop-out in single-cell can alter enrichment results. This also substantially speeds the calculations up.
The enrichment score output is then calculated using the complement of the U statistic scaled by the gene set size and cut-off.
$$ U_j^` = 1-\frac{U_j}{n \bullet r_{max}} $$
Please see the following citation for more information.
escape has 2 major functions - the first being
escape.matrix()
, which serves as the backbone of enrichment
calculations. Using count-level data supplied from a single-cell object
or matrix, escape.matrix()
will produce an enrichment score
for the individual cells with the gene sets selected and output the
values as a matrix.
method
groups
min.size
normalize
make.positive
Cautionary note: make.positive was added to allow for differential analysis downstream of enrichment as some methods may produce negative values. It preserves log-fold change, but ultimately modifies the enrichment values and should be used with caution.
enrichment.scores <- escape.matrix(pbmc_small,
gene.sets = GS.hallmark,
groups = 1000,
min.size = 5)
## [1] "Using sets of 1000 cells. Running 1 times."
ggplot(data = as.data.frame(enrichment.scores),
mapping = aes(enrichment.scores[,1], enrichment.scores[,2])) +
geom_point() +
theme_classic() +
theme(axis.title = element_blank())
Multi-core support is for all methods is available through BiocParallel.
To add more cores, use the argument BPPARAM to
escape.matrix()
. Here we will use the
SnowParam()
for it’s support across platforms and
explicitly call 2 workers (or cores).
Alternatively, we can use runEscape()
to calculate the
enrichment score and directly attach the output to a single-cell object.
The additional parameter for ``runEscape
is
new.assay.name, in order to save the enrichment scores
as a custom assay in the single-cell object.
pbmc_small <- runEscape(pbmc_small,
method = "ssGSEA",
gene.sets = GS.hallmark,
groups = 1000,
min.size = 5,
new.assay.name = "escape.ssGSEA")
## [1] "Using sets of 1000 cells. Running 1 times."
sce.pbmc <- runEscape(sce.pbmc,
method = "UCell",
gene.sets = GS.hallmark,
groups = 1000,
min.size = 5,
new.assay.name = "escape.UCell")
## [1] "Using sets of 1000 cells. Running 1 times."
We can quickly examine the attached enrichment scores using the
visualization/workflow we prefer - here we will use just
FeaturePlot()
from the Seurat R package.
Although we glossed over the normalization that can be used in
escape.matrix()
and runEscape()
, it is worth
mentioning here as normalization can affect all downstream analyses.
There can be inherent bias in enrichment values due to drop out in
single-cell expression data. Cells with larger numbers of features and
counts will likely have higher enrichment values.
performNormalization()
will normalize the enrichment values
by calculating the number of genes expressed in each gene set and cell.
This is similar to the normalization in classic GSEA and it will be
stored in a new assay.
pbmc_small <- performNormalization(sc.data = pbmc_small,
assay = "escape.ssGSEA",
gene.sets = GS.hallmark)
## [1] "Calculating features per cell..."
## [1] "Normalizing enrichment scores per cell..."
An alternative for scaling by expressed gene sets would be to use a scaling factor previously calculated during normal single-cell data processing and quality control. This can be done using the scale.factor argument and providing a vector.
pbmc_small <- performNormalization(sc.data = pbmc_small,
assay = "escape.ssGSEA",
gene.sets = GS.hallmark,
scale.factor = pbmc_small$nFeature_RNA)
## [1] "Normalizing enrichment scores per cell..."
performNormalization()
has an additional parameter
make.positive. Across the individual gene sets, if
negative normalized enrichment scores are seen, the minimum value is
added to all values. For example if the normalized enrichment scores
(after the above accounting for drop out) ranges from -50 to 50,
make.positive will adjust the range to 0 to 100 (by
adding 50). This allows for compatible log2-fold change downstream, but
can alter the enrichment score interpretation.
There are a number of ways to look at the enrichment values
downstream of runEscape()
with the myriad plotting and
visualizations functions/packages for single-cell data. escape
include several additional plotting functions to assist in the
analysis.
We can examine the enrichment values across our gene sets by using
heatmapEnrichment()
. This visualization will return the
mean of the group.by variable. As a default - all
visualizations of single-cell objects will use the cluster assignment or
active identity as a default for visualizations.
Most of the visualizations in escape have a defined set of parameters.
group.by
facet.by
scale
In addition, heatmapEnrichment()
allows for the
reclustering of rows and columns using Euclidean distance of the
enrichment scores and the Ward2 methods for clustering using
cluster.rows and cluster.columns.
heatmapEnrichment(sce.pbmc,
group.by = "ident",
assay = "escape.UCell",
scale = TRUE,
cluster.rows = TRUE,
cluster.columns = TRUE)
Each visualization has an additional argument called **palette that
supplies the coloring scheme to be used - available color palettes can
be viewed with hcl.pals()
.
## [1] "Pastel 1" "Dark 2" "Dark 3" "Set 2"
## [5] "Set 3" "Warm" "Cold" "Harmonic"
## [9] "Dynamic" "Grays" "Light Grays" "Blues 2"
## [13] "Blues 3" "Purples 2" "Purples 3" "Reds 2"
## [17] "Reds 3" "Greens 2" "Greens 3" "Oslo"
## [21] "Purple-Blue" "Red-Purple" "Red-Blue" "Purple-Orange"
## [25] "Purple-Yellow" "Blue-Yellow" "Green-Yellow" "Red-Yellow"
## [29] "Heat" "Heat 2" "Terrain" "Terrain 2"
## [33] "Viridis" "Plasma" "Inferno" "Rocket"
## [37] "Mako" "Dark Mint" "Mint" "BluGrn"
## [41] "Teal" "TealGrn" "Emrld" "BluYl"
## [45] "ag_GrnYl" "Peach" "PinkYl" "Burg"
## [49] "BurgYl" "RedOr" "OrYel" "Purp"
## [53] "PurpOr" "Sunset" "Magenta" "SunsetDark"
## [57] "ag_Sunset" "BrwnYl" "YlOrRd" "YlOrBr"
## [61] "OrRd" "Oranges" "YlGn" "YlGnBu"
## [65] "Reds" "RdPu" "PuRd" "Purples"
## [69] "PuBuGn" "PuBu" "Greens" "BuGn"
## [73] "GnBu" "BuPu" "Blues" "Lajolla"
## [77] "Turku" "Hawaii" "Batlow" "Blue-Red"
## [81] "Blue-Red 2" "Blue-Red 3" "Red-Green" "Purple-Green"
## [85] "Purple-Brown" "Green-Brown" "Blue-Yellow 2" "Blue-Yellow 3"
## [89] "Green-Orange" "Cyan-Magenta" "Tropic" "Broc"
## [93] "Cork" "Vik" "Berlin" "Lisbon"
## [97] "Tofino" "ArmyRose" "Earth" "Fall"
## [101] "Geyser" "TealRose" "Temps" "PuOr"
## [105] "RdBu" "RdGy" "PiYG" "PRGn"
## [109] "BrBG" "RdYlBu" "RdYlGn" "Spectral"
## [113] "Zissou 1" "Cividis" "Roma"
Alternatively, we can add an additional layer to the ggplot object
that is returned by the visualizations using something like
scale_fill_gradientn()
for continuous values or
scale_fill_manual()
for the categorical variables.
We can also focus on individual gene sets - one approach is to use
geyserEnrichment()
. Here individual cells are plotted along
the Y-axis with graphical summary where the central dot refers to the
median enrichment value and the thicker/thinner lines demonstrate the
interval summaries referring to the 66% and 95%.
geyserEnrichment(pbmc_small,
assay = "escape.ssGSEA",
gene.set = "HALLMARK-INTERFERON-GAMMA-RESPONSE")
To show the additional parameters that appear in visualizations of individual enrichment gene sets - we can reorder the groups by the mean of the gene set using order.by = “mean”.
geyserEnrichment(pbmc_small,
assay = "escape.ssGSEA",
gene.set = "HALLMARK-INTERFERON-GAMMA-RESPONSE",
order.by = "mean")
What if we had 2 separate samples or groups within the data? Another parameter we can use is facet.by to allow for direct visualization of an additional variable.
geyserEnrichment(pbmc_small,
assay = "escape.ssGSEA",
gene.set = "HALLMARK-INTERFERON-GAMMA-RESPONSE",
facet.by = "groups")
Lastly, we can select the way the color is applied to the plot using the color.by parameter. Here we can set it to the gene set of interest “HALLMARK-INTERFERON-GAMMA-RESPONSE”.
Similar to the geyserEnrichment()
the
ridgeEnrichment()
can display the distribution of
enrichment values across the selected gene set. The central line is at
the median value for the respective grouping.
We can get the relative position of individual cells along the x-axis using the add.rug parameter.
Another distribution visualization is a violin plot, which we
separate and directly compare using a binary classification. Like
ridgeEnrichment()
, this allows for greater use of
categorical variables. For splitEnrichment()
, the output
will be two halves of a violin plot based on the
split.by parameter with a central boxplot with the
relative distribution across all samples.
densityEnrichment()
is a method to visualize the mean
rank position of the gene set features along the total feature space by
group. This is similar to traditional GSEA analysis, but is not
calculating the walk-based enrichment score.
gene.set.use
gene.sets
It may be advantageous to look at the distribution of multiple gene
sets - here we can use scatterEnrichment()
for a 2 gene set
comparison. The color values are based on the density of points
determined by the number of neighbors, similar to the Nebulosa
R package. We just need to define which gene set to plot on the
x.axis and which to plot on the
y.axis.
scatterEnrichment(pbmc_small,
assay = "escape.ssGSEA",
x.axis = "HALLMARK-INTERFERON-GAMMA-RESPONSE",
y.axis = "HALLMARK-IL6-JAK-STAT3-SIGNALING")
The scatter plot can also be converted into a hexbin, another method for summarizing the individual cell distributions along the x and y axis, by setting style = “hex”.
scatterEnrichment(sce.pbmc,
assay = "escape.UCell",
x.axis = "HALLMARK-INTERFERON-GAMMA-RESPONSE",
y.axis = "HALLMARK-IL6-JAK-STAT3-SIGNALING",
style = "hex")
escape has its own PCA function performPCA()
which will
work on a single-cell object or a matrix of enrichment values. This is
specifically useful for downstream visualizations as it stores the
eigenvalues and rotations. If we want to look at the relative
contribution to overall variance of each component or a Biplot-like
overlay of the individual features, use performPCA()
.
Alternatively, other PCA-based functions like Seurat’s
RunPCA()
or scater’s ``runPCA()
can be used.
These functions are likely faster and would be ideal if we have a larger
number of cells and/or gene sets.
escape has a built in method for plotting PCA
pcaEnrichment()
that functions similarly to the
scatterEnrichment()
function where x.axis
and y.axis are the components to plot.
pcaEnrichment()
can plot additional information on the
principal component analysis.
add.percent.contribution will add the relative percent contribution of the x and y.axis to total variability observed in the PCA.
display.factors will overlay the magnitude and direction that the features/gene sets contribute to the selected components. The number of gene sets is determined by number.of.factors. This can assist in understanding the underlying differences in enrichment across different cells.
Differential enrichment analysis can be performed similar to differential gene expression analysis. For the purposes of finding the differential enrichment values, we can first normalize the enrichment values for the ssGSEA calculations. Notice here, we are using make.positive = TRUE in order to adjust any negative values. This is a particular issue when it comes to ssGSEA and GSVA enrichment scores.
pbmc_small <- performNormalization(pbmc_small,
assay = "escape.ssGSEA",
gene.sets = GS.hallmark,
make.positive = TRUE)
## [1] "Calculating features per cell..."
## [1] "Normalizing enrichment scores per cell..."
all.markers <- FindAllMarkers(pbmc_small,
assay = "escape.ssGSEA_normalized",
min.pct = 0,
logfc.threshold = 0)
head(all.markers)
## p_val avg_log2FC pct.1 pct.2
## HALLMARK-INTERFERON-GAMMA-RESPONSE 4.061261e-06 -1.5124819 0.972 1.000
## HALLMARK-ALLOGRAFT-REJECTION 6.519652e-04 2.3948709 1.000 0.977
## HALLMARK-IL2-STAT5-SIGNALING 5.934221e-03 3.2347238 1.000 0.977
## HALLMARK-ALLOGRAFT-REJECTION.1 2.932830e-07 -4.4124278 0.960 1.000
## HALLMARK-APOPTOSIS 2.932830e-07 2.0643196 1.000 0.982
## HALLMARK-P53-PATHWAY 9.505692e-05 0.8890429 1.000 0.982
## p_val_adj cluster
## HALLMARK-INTERFERON-GAMMA-RESPONSE 6.904145e-05 0
## HALLMARK-ALLOGRAFT-REJECTION 1.108341e-02 0
## HALLMARK-IL2-STAT5-SIGNALING 1.008818e-01 0
## HALLMARK-ALLOGRAFT-REJECTION.1 4.985811e-06 1
## HALLMARK-APOPTOSIS 4.985811e-06 1
## HALLMARK-P53-PATHWAY 1.615968e-03 1
## gene
## HALLMARK-INTERFERON-GAMMA-RESPONSE HALLMARK-INTERFERON-GAMMA-RESPONSE
## HALLMARK-ALLOGRAFT-REJECTION HALLMARK-ALLOGRAFT-REJECTION
## HALLMARK-IL2-STAT5-SIGNALING HALLMARK-IL2-STAT5-SIGNALING
## HALLMARK-ALLOGRAFT-REJECTION.1 HALLMARK-ALLOGRAFT-REJECTION
## HALLMARK-APOPTOSIS HALLMARK-APOPTOSIS
## HALLMARK-P53-PATHWAY HALLMARK-P53-PATHWAY
If you have any questions, comments or suggestions, feel free to visit the github repository or email me.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ggplot2_3.5.1 RColorBrewer_1.1-3
## [3] Seurat_5.1.0 SeuratObject_5.0.2
## [5] sp_2.1-4 scran_1.33.2
## [7] scuttle_1.15.5 SingleCellExperiment_1.27.2
## [9] SummarizedExperiment_1.35.5 Biobase_2.67.0
## [11] GenomicRanges_1.57.2 GenomeInfoDb_1.41.2
## [13] IRanges_2.39.2 S4Vectors_0.43.2
## [15] BiocGenerics_0.53.0 MatrixGenerics_1.17.1
## [17] matrixStats_1.4.1 escape_2.3.0
## [19] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] GSVA_1.99.0 spatstat.sparse_3.1-0
## [3] httr_1.4.7 tools_4.4.1
## [5] sctransform_0.4.1 utf8_1.2.4
## [7] R6_2.5.1 HDF5Array_1.33.8
## [9] lazyeval_0.2.2 uwot_0.2.2
## [11] ggdist_3.3.2 rhdf5filters_1.17.0
## [13] withr_3.0.2 gridExtra_2.3
## [15] progressr_0.15.0 cli_3.6.3
## [17] spatstat.explore_3.3-3 fastDummies_1.7.4
## [19] labeling_0.4.3 sass_0.4.9
## [21] spatstat.data_3.1-2 ggridges_0.5.6
## [23] pbapply_1.7-2 R.utils_2.12.3
## [25] parallelly_1.38.0 limma_3.61.12
## [27] RSQLite_2.3.7 generics_0.1.3
## [29] ica_1.0-3 spatstat.random_3.3-2
## [31] dplyr_1.1.4 distributional_0.5.0
## [33] Matrix_1.7-1 fansi_1.0.6
## [35] abind_1.4-8 R.methodsS3_1.8.2
## [37] lifecycle_1.0.4 yaml_2.3.10
## [39] edgeR_4.3.21 rhdf5_2.49.0
## [41] SparseArray_1.5.45 Rtsne_0.17
## [43] grid_4.4.1 blob_1.2.4
## [45] promises_1.3.0 dqrng_0.4.1
## [47] crayon_1.5.3 miniUI_0.1.1.1
## [49] lattice_0.22-6 beachmat_2.23.0
## [51] msigdbr_7.5.1 cowplot_1.1.3
## [53] annotate_1.85.0 KEGGREST_1.45.1
## [55] magick_2.8.5 sys_3.4.3
## [57] maketools_1.3.1 pillar_1.9.0
## [59] knitr_1.48 metapod_1.13.0
## [61] rjson_0.2.23 future.apply_1.11.3
## [63] codetools_0.2-20 leiden_0.4.3.1
## [65] glue_1.8.0 spatstat.univar_3.0-1
## [67] data.table_1.16.2 vctrs_0.6.5
## [69] png_0.1-8 spam_2.11-0
## [71] gtable_0.3.6 cachem_1.1.0
## [73] xfun_0.48 S4Arrays_1.5.11
## [75] mime_0.12 survival_3.7-0
## [77] statmod_1.5.0 bluster_1.17.0
## [79] fitdistrplus_1.2-1 ROCR_1.0-11
## [81] nlme_3.1-166 bit64_4.5.2
## [83] RcppAnnoy_0.0.22 bslib_0.8.0
## [85] irlba_2.3.5.1 KernSmooth_2.23-24
## [87] colorspace_2.1-1 DBI_1.2.3
## [89] UCell_2.9.0 tidyselect_1.2.1
## [91] bit_4.5.0 compiler_4.4.1
## [93] AUCell_1.29.0 graph_1.83.0
## [95] BiocNeighbors_2.1.0 DelayedArray_0.33.1
## [97] plotly_4.10.4 scales_1.3.0
## [99] hexbin_1.28.4 lmtest_0.9-40
## [101] stringr_1.5.1 SpatialExperiment_1.15.1
## [103] digest_0.6.37 goftest_1.2-3
## [105] spatstat.utils_3.1-0 rmarkdown_2.28
## [107] XVector_0.45.0 htmltools_0.5.8.1
## [109] pkgconfig_2.0.3 sparseMatrixStats_1.17.2
## [111] highr_0.11 fastmap_1.2.0
## [113] rlang_1.1.4 htmlwidgets_1.6.4
## [115] UCSC.utils_1.1.0 shiny_1.9.1
## [117] DelayedMatrixStats_1.29.0 farver_2.1.2
## [119] jquerylib_0.1.4 zoo_1.8-12
## [121] jsonlite_1.8.9 BiocParallel_1.41.0
## [123] R.oo_1.26.0 BiocSingular_1.23.0
## [125] magrittr_2.0.3 GenomeInfoDbData_1.2.13
## [127] dotCall64_1.2 patchwork_1.3.0
## [129] Rhdf5lib_1.27.0 munsell_0.5.1
## [131] Rcpp_1.0.13 babelgene_22.9
## [133] reticulate_1.39.0 stringi_1.8.4
## [135] zlibbioc_1.51.2 MASS_7.3-61
## [137] plyr_1.8.9 parallel_4.4.1
## [139] listenv_0.9.1 ggrepel_0.9.6
## [141] deldir_2.0-4 Biostrings_2.75.0
## [143] splines_4.4.1 tensor_1.5
## [145] locfit_1.5-9.10 igraph_2.1.1
## [147] spatstat.geom_3.3-3 RcppHNSW_0.6.0
## [149] buildtools_1.0.0 reshape2_1.4.4
## [151] ScaledMatrix_1.13.0 XML_3.99-0.17
## [153] evaluate_1.0.1 BiocManager_1.30.25
## [155] httpuv_1.6.15 RANN_2.6.2
## [157] tidyr_1.3.1 purrr_1.0.2
## [159] polyclip_1.10-7 future_1.34.0
## [161] scattermore_1.2 rsvd_1.0.5
## [163] xtable_1.8-4 RSpectra_0.16-2
## [165] later_1.3.2 viridisLite_0.4.2
## [167] ggpointdensity_0.1.0 tibble_3.2.1
## [169] memoise_2.0.1 AnnotationDbi_1.69.0
## [171] cluster_2.1.6 globals_0.16.3
## [173] GSEABase_1.67.1