pariedGSEA
is a user-friendly framework for paired
differential gene expression and splicing analyses. Providing bulk
RNA-seq count data, pariedGSEA
combines the results of
DESeq2
(Love, Huber, and Anders 2014) and DEXSeq
(Anders, Reyes, and Huber 2012),
aggregates the p-values to gene level and allows you to run a subsequent
gene set over-representation analysis using fgsea’s
fora
function (Korotkevich, Sukhov,
and Sergushichev 2019). Since version 0.99.2, you can also do the
differential analyses using limma .
pairedGSEA
was developed to highlight the importance of
differential splicing analysis. It was build in a way that yields
comparable results between splicing and expression-related events. It,
by default, accounts for surrogate variables in the data, and
facilitates exploratory data analysis either through storing
intermediate results or through plotting functions for the
over-representation analysis.
This vignette will guide you through how to use the main functions of
pairedGSEA
.
pariedGSEA
assumes you have already preprocessed and
aligned your sequencing reads to transcript level. Before starting, you
should therefore have a counts matrix and a metadata file. This data may
also be in the format of a SummarizedExperiment
(Morgan et al. 2022) or
DESeqDataSet
. Importantly, please ensure that the
rownames have the format: gene:transcript
.
The metadata should, as a minimum, contain the
sample IDs
corresponding to the column names of the count
matrix, a group
column containing information on which
samples corresponds to the baseline (controls) and the case (condition).
Bear in mind, the column names can be as you wish, but the names must be
provided in the sample_col
and group_col
parameters, respectively.
To see an example of what such data could look like, please see
?pairedGSEA::example_se
.
To install this package, start R (version “4.3”) and enter:
Bioconductor dependencies
# Install Bioconductor dependencies
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c(
"SummarizedExperiment", "S4Vectors", "DESeq2", "DEXSeq",
"fgsea", "sva", "BiocParallel"))
Install pairedGSEA
from Bioconductor
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("pairedGSEA")
Install development version from GitHub
Running a paired Differential Gene Expression (DGE) and Differential
Gene Splicing (DGS) analysis is the first step in the
pairedGSEA
workflow.
But before running the paired_diff
function, we
recommend storing the experiment parameters in a set of variables at the
top of your script for future reference and easy access:
library("pairedGSEA")
# Defining plotting theme
ggplot2::theme_set(ggplot2::theme_classic(base_size = 20))
## Load data
# In this vignette we will use the included example Summarized Experiment.
# See ?example_se for more information about the data.
data("example_se")
if(FALSE){ # Examples of other data imports
# Example of count matrix
tx_count <- read.csv("path/to/counts.csv") # Or other load function
md_file <- "path/to/metadata.xlsx" # Can also be a .csv file or a data.frame
# Example of locally stored DESeqDataSet
dds <- readRDS("path/to/dds.RDS")
# Example of locally stored SummarizedExperiment
se <- readRDS("path/to/se.RDS")
}
## Experiment parameters
group_col <- "group_nr" # Column with the groups you would like to compare
sample_col <- "id" # Name of column that specifies the sample id of each sample.
# This is used to ensure the metadata and count data contains the same samples
# and to arrange the data according to the metadata
# (important for underlying tools)
baseline <- 1 # Name of baseline group
case <- 2 # Name of case group
experiment_title <- "TGFb treatment for 12h" # Name of your experiment. This is
# used in the file names that are stored if store_results is TRUE (recommended)
# Check if parameters above fit with metadaata
SummarizedExperiment::colData(example_se)
#> DataFrame with 6 rows and 5 columns
#> study id source final_description
#> <character> <character> <character> <character>
#> GSM1499784 GSE61220 GSM1499784 small airway epithel.. Epi,Control
#> GSM1499785 GSE61220 GSM1499785 small airway epithel.. Epi,Control
#> GSM1499786 GSE61220 GSM1499786 small airway epithel.. Epi,Control
#> GSM1499790 GSE61220 GSM1499790 small airway epithel.. Epi,TNF,12hr
#> GSM1499791 GSE61220 GSM1499791 small airway epithel.. Epi,TNF,12hr
#> GSM1499792 GSE61220 GSM1499792 small airway epithel.. Epi,TNF,12hr
#> group_nr
#> <factor>
#> GSM1499784 1
#> GSM1499785 1
#> GSM1499786 1
#> GSM1499790 2
#> GSM1499791 2
#> GSM1499792 2
The paired DGE/DGS analysis is run with paired_diff()
.
paired_diff()
is essentially a wrapper function around
DESeq2::DESeq
(Love, Huber, and
Anders 2014) and DEXSeq::DEXSeq
(Anders, Reyes, and Huber 2012), the latter
takes in the ballpark of 20-30 minutes to run depending on the size of
the data and computational resources. Please visit their individual
vignettes for further information.
set.seed(500) # For reproducible results
diff_results <- paired_diff(
object = example_se,
metadata = NULL, # Use with count matrix or if you want to change it in
# the input object
group_col = group_col,
sample_col = sample_col,
baseline = baseline,
case = case,
experiment_title = experiment_title,
store_results = FALSE # Set to TRUE (recommended) if you want
# to store intermediate results, such as the results on transcript level
)
#> Running TGFb treatment for 12h
#> Preparing metadata
#>
#> Removing 0 rows with a summed count lower than 10
#> Removing 108 rows with counts in less than 2 samples.
#> Running SVA
#> No significant surrogate variables
#>
#> Found 0 surrogate variables
#> Running DESeq2
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
#> Extracting results
#> Initiating DEXSeq
#> Creating DEXSeqDataSet
#> converting counts to integer mode
#> Warning in DESeqDataSet(rse, design, ignoreRank = TRUE): some variables in
#> design formula are characters, converting to factors
#>
#> Running DEXSeq -- This might take a while
#> TGFb treatment for 12h is analysed.
#> Aggregating p values
#> Done.
After running the analyses, paired_diff
will aggregate
the p-values to gene level using lancaster aggregation
(Lancaster 1961) and calculate the
FDR-adjusted p-values (see ?pairedGSEA:::aggregate_pvalue
for more information). For the DGE transcripts, the log2 fold changes
will be aggregated using a weighted mean, whereas the DGS log2 fold
changes will be assigned to the log2 fold change of the transcript with
the lowest p-value. Use the latter with a grain of salt.
From here, feel free to analyse the gene-level results using your
preferred method. If you set store_results = TRUE
, you
could extract the transcript level results found in the
results
folder under the names “*_expression_results.RDS”
and “*_splicing_results.RDS” for the DGE and DGS analysis, respectively
(The * corresponds to the provided experiment title).
There are a range of other parameters you can play with to tailor the
experience. Here, the default settings are showed. See
?pairedGSEA::paired_diff
for further details.
covariates = NULL,
run_sva = TRUE,
use_limma = FALSE,
prefilter = 10,
fit_type = "local",
test = "LRT",
quiet = FALSE,
parallel = TRUE,
BPPARAM = BiocParallel::bpparam(),
...
To highlight some examples of use:
limma::eBayes
and
limma::diffSplice
for the analyses with
use_limma = TRUE
.covariates
to a character vector of
the specific names. This will be used in SVA, DGE, and DGS....
parameters will be fed to
DESeq2::DESeq
, see their manual for options.pairedGSEA
comes with a wrapper function for
fgsea::fora
(Korotkevich, Sukhov,
and Sergushichev 2019). If you wish, feel free to use that
directly or any other gene set analysis method - investigate the
diff_results
object before use to see what information it
contains.
The inbuilt wrapper also computes the relative risk for each gene set
and an enrichment score (log2(relative_risk + 0.06)
, the
pseudo count is for plotting purposes).
Before you get going, you will need a list of gene sets (aka.
pathways) according to the species you are working with and the category
of gene sets of interest. For this purpose, feel free to use the msigdbr (Dolgalev 2022) wrapper function in
pairedGSEA
: pairedGSEA::prepare_msigdb
. If you
do, see ?pairedGSEA::prepare_msigdb
for further
details.
The inbuilt ORA function is called paired_ora
and is run
as follows
# Define gene sets in your preferred way
gene_sets <- pairedGSEA::prepare_msigdb(
species = "Homo sapiens",
category = "C5",
gene_id_type = "ensembl_gene"
)
ora <- paired_ora(
paired_diff_result = diff_results,
gene_sets = gene_sets,
experiment_title = NULL # experiment_title - if not NULL,
# the results will be stored in an RDS object in the 'results' folder
)
#> Running over-representation analyses
#> Joining result
There are many options for investigating your ORA results.
pairedGSEA
comes with an inbuilt scatter plot function that
plots the enrichment score of DGE against those of DGS.
The function allows you to interactively look at the placement of the significant pathways using plotly (Sievert 2020). You can color specific points based on a regular expression for gene sets of interest.
# Scatter plot function with default settings
plot_ora(
ora,
plotly = FALSE,
pattern = "Telomer", # Identify all gene sets about telomeres
cutoff = 0.05, # Only include significant gene sets
lines = TRUE # Guide lines
)
As mentioned, this function can be utilized in a few different ways.
The default settings will plot the enrichment scores of each analysis
and draw dashed lines for the y = x
line,
y = 0
, and x = 0
. Remove those by setting
lines = FALSE
.
Make the plot interactive with plotly = TRUE
.
Highlight gene sets containing a specific regex pattern by setting
pattern
to the regex pattern of interest.
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] pairedGSEA_1.7.0 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.2.3 bitops_1.0-9
#> [3] httr2_1.0.7 biomaRt_2.63.0
#> [5] rlang_1.1.4 magrittr_2.0.3
#> [7] msigdbr_7.5.1 aggregation_1.0.1
#> [9] matrixStats_1.4.1 compiler_4.4.2
#> [11] RSQLite_2.3.9 mgcv_1.9-1
#> [13] png_0.1-8 vctrs_0.6.5
#> [15] sva_3.55.0 stringr_1.5.1
#> [17] pkgconfig_2.0.3 crayon_1.5.3
#> [19] fastmap_1.2.0 dbplyr_2.5.0
#> [21] XVector_0.47.0 labeling_0.4.3
#> [23] Rsamtools_2.23.1 rmarkdown_2.29
#> [25] UCSC.utils_1.3.0 bit_4.5.0.1
#> [27] xfun_0.49 zlibbioc_1.52.0
#> [29] cachem_1.1.0 GenomeInfoDb_1.43.2
#> [31] jsonlite_1.8.9 progress_1.2.3
#> [33] blob_1.2.4 DelayedArray_0.33.3
#> [35] BiocParallel_1.41.0 parallel_4.4.2
#> [37] prettyunits_1.2.0 R6_2.5.1
#> [39] bslib_0.8.0 stringi_1.8.4
#> [41] RColorBrewer_1.1-3 limma_3.63.2
#> [43] genefilter_1.89.0 GenomicRanges_1.59.1
#> [45] jquerylib_0.1.4 Rcpp_1.0.13-1
#> [47] SummarizedExperiment_1.37.0 knitr_1.49
#> [49] IRanges_2.41.2 Matrix_1.7-1
#> [51] splines_4.4.2 tidyselect_1.2.1
#> [53] abind_1.4-8 yaml_2.3.10
#> [55] codetools_0.2-20 hwriter_1.3.2.1
#> [57] curl_6.0.1 lattice_0.22-6
#> [59] tibble_3.2.1 withr_3.0.2
#> [61] Biobase_2.67.0 KEGGREST_1.47.0
#> [63] evaluate_1.0.1 survival_3.8-3
#> [65] BiocFileCache_2.15.0 xml2_1.3.6
#> [67] Biostrings_2.75.3 pillar_1.10.0
#> [69] BiocManager_1.30.25 filelock_1.0.3
#> [71] MatrixGenerics_1.19.0 stats4_4.4.2
#> [73] generics_0.1.3 S4Vectors_0.45.2
#> [75] hms_1.1.3 ggplot2_3.5.1
#> [77] munsell_0.5.1 scales_1.3.0
#> [79] xtable_1.8-4 glue_1.8.0
#> [81] maketools_1.3.1 tools_4.4.2
#> [83] data.table_1.16.4 sys_3.4.3
#> [85] fgsea_1.33.0 annotate_1.85.0
#> [87] locfit_1.5-9.10 babelgene_22.9
#> [89] buildtools_1.0.0 XML_3.99-0.17
#> [91] fastmatch_1.1-4 cowplot_1.1.3
#> [93] grid_4.4.2 DEXSeq_1.53.0
#> [95] edgeR_4.5.1 AnnotationDbi_1.69.0
#> [97] colorspace_2.1-1 nlme_3.1-166
#> [99] GenomeInfoDbData_1.2.13 cli_3.6.3
#> [101] rappdirs_0.3.3 S4Arrays_1.7.1
#> [103] dplyr_1.1.4 gtable_0.3.6
#> [105] DESeq2_1.47.1 sass_0.4.9
#> [107] digest_0.6.37 BiocGenerics_0.53.3
#> [109] SparseArray_1.7.2 farver_2.1.2
#> [111] geneplotter_1.85.0 memoise_2.0.1
#> [113] htmltools_0.5.8.1 lifecycle_1.0.4
#> [115] httr_1.4.7 statmod_1.5.0
#> [117] bit64_4.5.2