The ReactomeGSA package is a client to the web-based Reactome Analysis System. Essentially, it performs a gene set analysis using the latest version of the Reactome pathway database as a backend.
This vignette shows how the ReactomeGSA package can be used to perform a pathway analysis of cell clusters in single-cell RNA-sequencing data.
To cite this package, use
Griss J. ReactomeGSA, https://github.com/reactome/ReactomeGSA (2019)
The ReactomeGSA
package can be directly installed from
Bioconductor:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!require(ReactomeGSA))
BiocManager::install("ReactomeGSA")
# install the ReactomeGSA.data package for the example data
if (!require(ReactomeGSA.data))
BiocManager::install("ReactomeGSA.data")
For more information, see https://bioconductor.org/install/.
As an example we load single-cell RNA-sequencing data of B cells extracted from the dataset published by Jerby-Arnon et al. (Cell, 2018).
Note: This is not a complete Seurat object. To decrease the size, the object only contains gene expression values and cluster annotations.
library(ReactomeGSA.data)
#> Loading required package: limma
#> Loading required package: edgeR
#> Loading required package: ReactomeGSA
#> Loading required package: Seurat
#> Loading required package: SeuratObject
#> Loading required package: sp
#> 'SeuratObject' was built under R 4.4.0 but the current version is
#> 4.4.2; it is recomended that you reinstall 'SeuratObject' as the ABI
#> for R may have changed
#>
#> Attaching package: 'SeuratObject'
#> The following objects are masked from 'package:base':
#>
#> intersect, t
data(jerby_b_cells)
jerby_b_cells
#> An object of class Seurat
#> 23686 features across 920 samples within 1 assay
#> Active assay: RNA (23686 features, 0 variable features)
#> 2 layers present: counts, data
There are two methods to analyse scRNA-seq data using ReactomeGSA:
ReactomeGSA can generate pseudo-bulk data from scRNA-seq data and then analyse this data using the classical quantitative pathway analysis algorithms. Thereby, it is possible to directly compare, f.e. two cell types with each other or two treatment approaches. The result is a classical pathway analysis result with p-values and fold-changes attributed to each pathway.
The analyse_sc_clusters
function offers a second
approach using the gene set variation algorithm ssGSEA
to
derive pathway-level quantitative values for each cluster or cell type
in the dataset. This is helpful to visualize the “expression level” of
pathways accross the dataset. Statistical analyses have to be performed
separately.
The pathway analysis is at the very end of a scRNA-seq workflow. This means, that any Q/C was already performed, the data was normalized and cells were already clustered.
In this example we are going to compare Cluster 1
against Cluster 2
.
# store the Idents as a meta.data field - this was
# removed while compressing the object as an example
jerby_b_cells$clusters <- Idents(jerby_b_cells)
table(jerby_b_cells$clusters)
#>
#> Cluster 4 Cluster 8 Cluster 1 Cluster 11 Cluster 3 Cluster 6 Cluster 2
#> 106 54 140 31 109 85 114
#> Cluster 7 Cluster 9 Cluster 13 Cluster 5 Cluster 12 Cluster 10
#> 55 47 24 90 25 40
As a next step, we need to create the pseudo-bulk data for the
analysis. This is achieved through the
generate_pseudo_bulk_data
function.
library(ReactomeGSA)
# This creates a pseudo-bulk object by splitting each cluster in 4
# random bulks of data. This approach can be changed through the
# split_by and k_variable parameter.
pseudo_bulk_data <- generate_pseudo_bulk_data(jerby_b_cells, group_by = "clusters")
#> Centering and scaling data matrix
# we can immediately create the metadata data.frame for this data
pseudo_metadata <- generate_metadata(pseudo_bulk_data)
This pseudo-bulk data is directly compatible with the existing algorithms for quantitative pathway analysis and can be processed using the respective ReactomeGSA methods.
# Create a new request object using 'Camera' for the gene set analysis
my_request <- ReactomeAnalysisRequest(method = "Camera")
# set the maximum number of allowed missing values to 50%
my_request <- set_parameters(request = my_request, max_missing_values = 0.5)
# add the pseudo-bulk data as a dataset
my_request <- add_dataset(request = my_request,
expression_values = pseudo_bulk_data,
sample_data = pseudo_metadata,
name = "Pseudo-bulk",
type = "rnaseq_counts",
comparison_factor = "Group",
comparison_group_1 = "Cluster 1",
comparison_group_2 = "Cluster 2")
#> Converting expression data to string... (This may take a moment)
#> Conversion complete
my_request
#> ReactomeAnalysisRequestObject
#> Method = Camera
#> Parameters:
#> - max_missing_values: 0.5
#> Datasets:
#> - Pseudo-bulk (rnaseq_counts)
#> No parameters set.
#> ReactomeAnalysisRequest
This request object can be directly submitted to the ReactomeGSA analysis.
quant_result <- perform_reactome_analysis(my_request, compress = FALSE)
#> Submitting request to Reactome API...
#> Reactome Analysis submitted succesfully
#> Converting dataset Pseudo-bulk...
#> Mapping identifiers...
#> Performing gene set analysis using Camera
#> Analysing dataset 'Pseudo-bulk' using Camera
#> Creating REACTOME visualization
#> Retrieving result...
quant_result
#> ReactomeAnalysisResult object
#> Reactome Release: 90
#> Results:
#> - Pseudo-bulk:
#> 2573 pathways
#> 10419 fold changes for genes
#> Reactome visualizations:
#> - Gene Set Analysis Summary
#> ReactomeAnalysisResult
This object can be used in the same fashion as any ReactomeGSA result object.
# get the pathway-level results
quant_pathways <- pathways(quant_result)
head(quant_pathways)
#> Name
#> R-HSA-156842 Eukaryotic Translation Elongation
#> R-HSA-192823 Viral mRNA Translation
#> R-HSA-156902 Peptide chain elongation
#> R-HSA-72764 Eukaryotic Translation Termination
#> R-HSA-975956 Nonsense Mediated Decay (NMD) independent of the Exon Junction Complex (EJC)
#> R-HSA-2408557 Selenocysteine synthesis
#> Direction.Pseudo-bulk FDR.Pseudo-bulk PValue.Pseudo-bulk
#> R-HSA-156842 Up 2.777595e-23 1.403481e-26
#> R-HSA-192823 Up 2.777595e-23 2.159032e-26
#> R-HSA-156902 Up 9.060867e-23 1.056456e-25
#> R-HSA-72764 Up 2.062321e-21 3.206096e-24
#> R-HSA-975956 Up 5.617992e-21 1.091720e-23
#> R-HSA-2408557 Up 5.686902e-21 1.326133e-23
#> NGenes.Pseudo-bulk av_foldchange.Pseudo-bulk sig.Pseudo-bulk
#> R-HSA-156842 81 0.3500341 TRUE
#> R-HSA-192823 78 0.3725893 TRUE
#> R-HSA-156902 78 0.3481714 TRUE
#> R-HSA-72764 82 0.3155141 TRUE
#> R-HSA-975956 84 0.3166279 TRUE
#> R-HSA-2408557 81 0.2437127 TRUE
# get the top pathways to label them
library(tidyr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
top_pathways <- quant_pathways %>%
tibble::rownames_to_column(var = "Id") %>%
filter(`FDR.Pseudo-bulk` < 0.001) %>%
arrange(desc(`av_foldchange.Pseudo-bulk`))
# limit to a few pathway for aesthetic reasons
top_pathways <- top_pathways[c(1,5,6), ]
# create a simple volcano plot of the pathway results
library(ggplot2)
ggplot(quant_pathways,
aes(x = `av_foldchange.Pseudo-bulk`,
y = -log10(`FDR.Pseudo-bulk`))) +
geom_vline(xintercept = c(log2(0.5), log2(2)), colour="grey", linetype = "longdash") +
geom_hline(yintercept = -log10(0.05), colour="grey", linetype = "longdash") +
geom_point() +
geom_label(data = top_pathways, aes(label = Name), nudge_y = 1, check_overlap = TRUE)
#> Warning in geom_label(data = top_pathways, aes(label = Name), nudge_y = 1, :
#> Ignoring unknown parameters: `check_overlap`
The ReactomeGSA package can now be used to get pathway-level expression values for every cell cluster. This is achieved by calculating the mean gene expression for every cluster and then submitting this data to a gene set variation analysis.
All of this is wrapped in the single analyse_sc_clusters
function.
gsva_result <- analyse_sc_clusters(jerby_b_cells, verbose = TRUE)
#> Calculating average cluster expression...
#> Converting expression data to string... (This may take a moment)
#> Conversion complete
#> Submitting request to Reactome API...
#> Compressing request data...
#> Reactome Analysis submitted succesfully
#> Converting dataset Seurat...
#> Mapping identifiers...
#> Performing gene set analysis using ssGSEA
#> Analysing dataset 'Seurat' using ssGSEA
#> Retrieving result...
The resulting object is a standard
ReactomeAnalysisResult
object.
gsva_result
#> ReactomeAnalysisResult object
#> Reactome Release: 90
#> Results:
#> - Seurat:
#> 1741 pathways
#> 11759 fold changes for genes
#> No Reactome visualizations available
#> ReactomeAnalysisResult
pathways
returns the pathway-level expression values per
cell cluster:
pathway_expression <- pathways(gsva_result)
# simplify the column names by removing the default dataset identifier
colnames(pathway_expression) <- gsub("\\.Seurat", "", colnames(pathway_expression))
pathway_expression[1:3,]
#> Name Cluster_1 Cluster_10 Cluster_11
#> R-HSA-1059683 Interleukin-6 signaling -0.03556065 -0.04521415 0.14931444
#> R-HSA-109703 PKB-mediated events 0.32848879 -0.19771237 0.05087568
#> R-HSA-109704 PI3K Cascade -0.12080150 -0.13450596 0.16088061
#> Cluster_12 Cluster_13 Cluster_2 Cluster_3 Cluster_4
#> R-HSA-1059683 0.06075330 0.0004459001 0.14896349 -0.09445591 -0.11884117
#> R-HSA-109703 0.09556102 0.0274677841 -0.05142555 0.15462891 -0.16856699
#> R-HSA-109704 -0.04114424 0.0680222582 0.05924518 0.05295837 -0.02208554
#> Cluster_5 Cluster_6 Cluster_7 Cluster_8 Cluster_9
#> R-HSA-1059683 -0.13882420 -0.07577576 -0.06560393 0.174814091 -0.04388071
#> R-HSA-109703 0.10554980 -0.06251424 -0.30506281 -0.007670405 -0.09929333
#> R-HSA-109704 0.02008053 -0.12603775 -0.01371145 0.101810119 -0.04142070
A simple approach to find the most relevant pathways is to assess the maximum difference in expression for every pathway:
# find the maximum differently expressed pathway
max_difference <- do.call(rbind, apply(pathway_expression, 1, function(row) {
values <- as.numeric(row[2:length(row)])
return(data.frame(name = row[1], min = min(values), max = max(values)))
}))
max_difference$diff <- max_difference$max - max_difference$min
# sort based on the difference
max_difference <- max_difference[order(max_difference$diff, decreasing = T), ]
head(max_difference)
#> name
#> R-HSA-140180 COX reactions
#> R-HSA-1296067 Potassium transport channels
#> R-HSA-392023 Adrenaline signalling through Alpha-2 adrenergic receptor
#> R-HSA-8964540 Alanine metabolism
#> R-HSA-3248023 Regulation by TREX1
#> R-HSA-350864 Regulation of thyroid hormone activity
#> min max diff
#> R-HSA-140180 -0.9647899 0.9841810 1.948971
#> R-HSA-1296067 -1.0000000 0.8858649 1.885865
#> R-HSA-392023 -0.9008335 0.9738051 1.874639
#> R-HSA-8964540 -0.8731077 0.9938765 1.866984
#> R-HSA-3248023 -0.9185236 0.9421670 1.860691
#> R-HSA-350864 -0.9134206 0.9394455 1.852866
The ReactomeGSA package contains two functions to visualize these pathway results. The first simply plots the expression for a selected pathway:
For a better overview, the expression of multiple pathways can be
shown as a heatmap using gplots
heatmap.2
function:
# Additional parameters are directly passed to gplots heatmap.2 function
plot_gsva_heatmap(gsva_result, max_pathways = 15, margins = c(6,20))
The plot_gsva_heatmap
function can also be used to only
display specific pahtways:
# limit to selected B cell related pathways
relevant_pathways <- c("R-HSA-983170", "R-HSA-388841", "R-HSA-2132295", "R-HSA-983705", "R-HSA-5690714")
plot_gsva_heatmap(gsva_result,
pathway_ids = relevant_pathways, # limit to these pathways
margins = c(6,30), # adapt the figure margins in heatmap.2
dendrogram = "col", # only plot column dendrogram
scale = "row", # scale for each pathway
key = FALSE, # don't display the color key
lwid=c(0.1,4)) # remove the white space on the left
#> Warning in plot_gsva_heatmap(gsva_result, pathway_ids = relevant_pathways, :
#> Warning: No results for the following pathways: R-HSA-983705
This analysis shows us that cluster 8 has a marked up-regulation of B Cell receptor signalling, which is linked to a co-stimulation of the CD28 family. Additionally, there is a gradient among the cluster with respect to genes releated to antigen presentation.
Therefore, we are able to further classify the observed B cell subtypes based on their pathway activity.
The pathway-level expression analysis can also be used to run a
Principal Component Analysis on the samples. This is simplified through
the function plot_gsva_pca
:
In this analysis, cluster 11 is a clear outlier from the other B cell subtypes and therefore might be prioritised for further evaluation.
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ggplot2_3.5.1 dplyr_1.1.4 tidyr_1.3.1
#> [4] ReactomeGSA.data_1.20.0 Seurat_5.1.0 SeuratObject_5.0.2
#> [7] sp_2.1-4 ReactomeGSA_1.21.2 edgeR_4.5.0
#> [10] limma_3.63.2 rmarkdown_2.29
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 sys_3.4.3 jsonlite_1.8.9
#> [4] magrittr_2.0.3 spatstat.utils_3.1-1 farver_2.1.2
#> [7] vctrs_0.6.5 ROCR_1.0-11 spatstat.explore_3.3-3
#> [10] progress_1.2.3 htmltools_0.5.8.1 curl_6.0.1
#> [13] sass_0.4.9 sctransform_0.4.1 parallelly_1.39.0
#> [16] KernSmooth_2.23-24 bslib_0.8.0 htmlwidgets_1.6.4
#> [19] ica_1.0-3 plyr_1.8.9 plotly_4.10.4
#> [22] zoo_1.8-12 cachem_1.1.0 buildtools_1.0.0
#> [25] igraph_2.1.1 mime_0.12 lifecycle_1.0.4
#> [28] pkgconfig_2.0.3 Matrix_1.7-1 R6_2.5.1
#> [31] fastmap_1.2.0 fitdistrplus_1.2-1 future_1.34.0
#> [34] shiny_1.9.1 digest_0.6.37 colorspace_2.1-1
#> [37] patchwork_1.3.0 tensor_1.5 RSpectra_0.16-2
#> [40] irlba_2.3.5.1 labeling_0.4.3 progressr_0.15.1
#> [43] fansi_1.0.6 spatstat.sparse_3.1-0 httr_1.4.7
#> [46] polyclip_1.10-7 abind_1.4-8 compiler_4.4.2
#> [49] withr_3.0.2 fastDummies_1.7.4 gplots_3.2.0
#> [52] MASS_7.3-61 gtools_3.9.5 caTools_1.18.3
#> [55] tools_4.4.2 lmtest_0.9-40 httpuv_1.6.15
#> [58] future.apply_1.11.3 goftest_1.2-3 glue_1.8.0
#> [61] nlme_3.1-166 promises_1.3.2 grid_4.4.2
#> [64] Rtsne_0.17 cluster_2.1.6 reshape2_1.4.4
#> [67] generics_0.1.3 gtable_0.3.6 spatstat.data_3.1-4
#> [70] hms_1.1.3 data.table_1.16.2 utf8_1.2.4
#> [73] BiocGenerics_0.53.3 spatstat.geom_3.3-4 RcppAnnoy_0.0.22
#> [76] ggrepel_0.9.6 RANN_2.6.2 pillar_1.9.0
#> [79] stringr_1.5.1 spam_2.11-0 RcppHNSW_0.6.0
#> [82] later_1.4.1 splines_4.4.2 lattice_0.22-6
#> [85] survival_3.7-0 deldir_2.0-4 tidyselect_1.2.1
#> [88] locfit_1.5-9.10 maketools_1.3.1 miniUI_0.1.1.1
#> [91] pbapply_1.7-2 knitr_1.49 gridExtra_2.3
#> [94] scattermore_1.2 xfun_0.49 Biobase_2.67.0
#> [97] statmod_1.5.0 matrixStats_1.4.1 stringi_1.8.4
#> [100] lazyeval_0.2.2 yaml_2.3.10 evaluate_1.0.1
#> [103] codetools_0.2-20 tibble_3.2.1 cli_3.6.3
#> [106] uwot_0.2.2 xtable_1.8-4 reticulate_1.40.0
#> [109] munsell_0.5.1 jquerylib_0.1.4 Rcpp_1.0.13-1
#> [112] globals_0.16.3 spatstat.random_3.3-2 png_0.1-8
#> [115] spatstat.univar_3.1-1 parallel_4.4.2 prettyunits_1.2.0
#> [118] dotCall64_1.2 bitops_1.0-9 listenv_0.9.1
#> [121] viridisLite_0.4.2 scales_1.3.0 ggridges_0.5.6
#> [124] crayon_1.5.3 leiden_0.4.3.1 purrr_1.0.2
#> [127] rlang_1.1.4 cowplot_1.1.3