| Title: | Automatic Generation of Gene Set Enrichment Analyses |
|---|---|
| Description: | Implements pipelines for generating gene set enrichment analysis reports in the augere framework. This includes various competitive and self-contained gene sets from a variety of Bioconductor packages. Each pipeline function generates a self-contained Rmarkdown report with all of the steps required to reproduce the gene set enrichment analysis. |
| Authors: | Aaron Lun [cre, aut] (ORCID: <https://orcid.org/0000-0002-3564-4813>) |
| Maintainer: | Aaron Lun <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.99.1 |
| Built: | 2026-05-15 06:37:14 UTC |
| Source: | https://github.com/bioc/augere.gsea |
Implements pipelines for generating gene set enrichment analysis reports in the augere framework. This includes various competitive and self-contained gene sets from a variety of Bioconductor packages. Each pipeline function generates a self-contained Rmarkdown report with all of the steps required to reproduce the gene set enrichment analysis.
Maintainer: Aaron Lun [email protected] (ORCID)
Useful links:
Report bugs at https://github.com/augere-bioinfo/augere.gsea/issues
Analyze a RNA-seq dataset for differential expression across a collection of gene sets.
This requires access to the original values, unlike runPrecomputed.
runContrast( x, sets, groups, comparison, covariates = NULL, block = NULL, subset.factor = NULL, subset.levels = NULL, subset.groups = TRUE, design = NULL, contrast = NULL, dc.block = NULL, robust = TRUE, quality = TRUE, trend = FALSE, methods = c("fry", "camera"), assay = 1, annotation = NULL, metadata = NULL, output.dir = "contrast", author = NULL, dry.run = FALSE, save.results = TRUE, suppress.plots = FALSE )runContrast( x, sets, groups, comparison, covariates = NULL, block = NULL, subset.factor = NULL, subset.levels = NULL, subset.groups = TRUE, design = NULL, contrast = NULL, dc.block = NULL, robust = TRUE, quality = TRUE, trend = FALSE, methods = c("fry", "camera"), assay = 1, annotation = NULL, metadata = NULL, output.dir = "contrast", author = NULL, dry.run = FALSE, save.results = TRUE, suppress.plots = FALSE )
x |
A SummarizedExperiment object containing a count matrix where genes and samples are in rows and columns, respectively.
Alternatively, the output of |
sets |
A list or CharacterList of character vectors.
Each vector represents a gene set and contains the identifiers for genes in that set.
Identifiers should be consistent with the row names of |
groups |
String specifying the |
comparison |
Character vector of length no greater than 2.
For length 2, this specifies the groups to be compared, whereas for length 1, this specifies the covariate to test.
See Alternatively, a named character vector with no more than 2 unique names,
see Unlike |
covariates |
Character vector specifying the |
block |
Character vector specifying the |
subset.factor |
String specifying the |
subset.levels |
Vector containing the levels of the |
subset.groups |
Boolean indicating whether to automatically subset the dataset to only those samples assigned to groups in |
design |
Matrix, function, or formula specifying the experimental design, see |
contrast |
String, function or vector specifying a custom contrast,
see Unlike |
dc.block |
String specifying the blocking factor to use in |
robust |
Boolean indicating whether robust empirical Bayes shrinkage should be used in |
quality |
Boolean indicating whether quality weighting should be performed. This reduces the influence of low-quality samples at the cost of more computational work. |
trend |
Boolean indicating whether variances should be shrunk towards a trend in |
methods |
Character vector specifying the methods to run. This should contain at least one of the following:
|
assay |
String or integer specifying the assay of |
annotation |
Character vector specifying the columns of |
metadata |
Named list of additional metadata to store alongside each result. |
output.dir |
String containing the path to an output directory in which to write the Rmarkdown file and save results. |
author |
Character vector containg the names of the authors.
If |
dry.run |
Boolean indicating whether to perform a dry run.
This generates the Rmarkdown report in |
save.results |
Boolean indicating whether the results should be saved to file. |
suppress.plots |
Boolean indicating whether plots should be suppressed.
This can be set to |
Some of the methods involve randomization, so for full reproducibility, users should call set.seed before running runContrast.
Note that, even if the user does not call set.seed,
runContrast will automatically insert set.seed statements into the Rmarkdown report
prior to any GSEA functions that involve randomization.
Each set.seed call has a hard-coded seed to ensure that future compilation of the generated report will give the same result.
However, different calls to runContrast will use different (randomly selected) seeds to avoid systematic biases.
Thus, if full reproducibility of runContrast is required, users should set the seed themselves before calling runContrast.
A Rmarkdown report named report.Rmd is written inside output.dir.
This contains all commmands used to reproduce the analysis.
If dry.run=FALSE, a list of DataFrames is returned where each DataFrame contains the enrichment results for a method.
Each row corresponds to a gene set in sets and is named accordingly.
All DataFrames are guaranteed to have (at least) the following fields:
NumGenes, the number of genes in the set (after removing genes that were not tested).
Direction, the net direction of the change in expression within the set (either "up" or "down").
PValue, the p-value for enrichment in each gene set.
FDR, the Benjamini-Hochberg-adjusted p-value.
Additional fields may be present for specific methods.
If save.results=TRUE, the results are saved in a results subdirectory of output.dir.
If dry.run=FALSE, only the report is created, and NULL is returned.
Aaron Lun
x <- augere.de::loadExampleDataset() all.genes <- rownames(x) sets <- list( A = sample(all.genes, 92), B = sample(all.genes, 212), C = sample(all.genes, 12), D = sample(all.genes, 38), E = sample(all.genes, 55) ) output.dir <- tempfile() res <- runContrast( x, sets, group="dex", comparison=c("trt", "untrt"), output=output.dir ) res list.files(output.dir, recursive=TRUE)x <- augere.de::loadExampleDataset() all.genes <- rownames(x) sets <- list( A = sample(all.genes, 92), B = sample(all.genes, 212), C = sample(all.genes, 12), D = sample(all.genes, 38), E = sample(all.genes, 55) ) output.dir <- tempfile() res <- runContrast( x, sets, group="dex", comparison=c("trt", "untrt"), output=output.dir ) res list.files(output.dir, recursive=TRUE)
Run gene set enrichment analyses on precomputed statistics, usually from differential expression analyses. This uses competitive gene set tests where the enrichment of “interesting” genes within the set must be greater than that outside of the set.
runPrecomputed( x, sets, signif.field, signif.threshold, rank.field, methods = c("hypergeometric", "goseq", "fgsea", "cameraPR"), alternative = c("mixed", "up", "down", "either"), rank.sqrt = FALSE, sign.field = NULL, goseq.bias = "AveExpr", goseq.args = list(), fgsea.leading.edge = FALSE, fgsea.args = list(), geneSetTest.args = list(), cameraPR.args = list(), metadata = NULL, annotation = NULL, author = NULL, output.dir = "precomputed", dry.run = FALSE, save.results = TRUE, suppress.plots = FALSE )runPrecomputed( x, sets, signif.field, signif.threshold, rank.field, methods = c("hypergeometric", "goseq", "fgsea", "cameraPR"), alternative = c("mixed", "up", "down", "either"), rank.sqrt = FALSE, sign.field = NULL, goseq.bias = "AveExpr", goseq.args = list(), fgsea.leading.edge = FALSE, fgsea.args = list(), geneSetTest.args = list(), cameraPR.args = list(), metadata = NULL, annotation = NULL, author = NULL, output.dir = "precomputed", dry.run = FALSE, save.results = TRUE, suppress.plots = FALSE )
x |
A data frame or DataFrame object containing various test statistics (columns) for genes (rows).
Rows should be named with the same gene identifiers used in Rows should not be filtered to only the significant genes, this will be handled by Alternatively, an object returned by wrapInput that refers to a data frame or DataFrame. |
sets |
A list or CharacterList of character vectors.
Each vector represents a gene set and contains the identifiers for genes in that set.
Identifiers should be consistent with the row names of |
signif.field |
String specifying the column of |
signif.threshold |
Number specifying the upper threshold for significance to apply to the statistics from |
rank.field |
String specifying the column of |
methods |
Character vector specifying the gene set testing methods to use. This can be any number of the following:
|
alternative |
String specifying the alternative hypothesis. This should be one of:
|
rank.sqrt |
Boolean indicating whether to compute the square root of the |
sign.field |
String specifying the column of |
goseq.bias |
String specifying the column of |
goseq.args |
Named list of additional arguments to pass to |
fgsea.leading.edge |
Boolean indicating whether the “leading edge” should be stored from |
fgsea.args |
Named list of additional arguments to pass to |
geneSetTest.args |
Named list of additional arguments to pass to |
cameraPR.args |
Named list of additional arguments to pass to |
metadata |
Named list of additional metadata to store with each result. |
annotation |
Character vector specifying the columns of |
author |
Character vector containg the names of the authors.
If |
output.dir |
String containing the path to an output directory in which to write the Rmarkdown file and save results. |
dry.run |
Logical scalar indicating whether a dry run should be performed,
This generates the Rmarkdown report in |
save.results |
Boolean indicating whether the results should be saved to file. |
suppress.plots |
Boolean indicating whether to suppress the generation of plots.
This can be set to |
Some of the methods involve randomization, so for full reproducibility, users should call set.seed before running runPrecomputed.
Note that, even if the user does not call set.seed,
runPrecomputed will automatically insert set.seed statements into the Rmarkdown report
prior to any GSEA functions that involve randomization.
Each set.seed call has a hard-coded seed to ensure that future compilation of the generated report will give the same result.
However, different calls to runPrecomputed will use different (randomly selected) seeds to avoid systematic biases.
Thus, if full reproducibility of runPrecomputed is required, users should set the seed themselves before calling runPrecomputed.
A Rmarkdown report named report.Rmd is written inside output.dir.
This contains all commmands used to reproduce the analysis.
If dry.run=FALSE, a list of DataFrames is returned where each DataFrame contains the enrichment results for a method.
Each row corresponds to a gene set in sets and is named accordingly.
All DataFrames are guaranteed to have (at least) the following fields:
NumGenes, the number of genes in the set (after removing genes that were not tested).
PValue, the p-value for enrichment in each gene set.
FDR, the Benjamini-Hochberg-adjusted p-value.
(if alternative="either") Direction, whether the set is more enriched for "up"- or "down"-regulated genes.
Specific methods will have additional fields:
For "fgsea", ES contains the enrichment score and NES contains the normalized enrichment score.
If fgsea.leading.edge=TRUE, LeadingEdge will contain a CharacterList with the names of genes in the “leading edge” for each gene set.
For "goseq" and "hypergeometric", NumSig contains the number of significant genes in each set.
(If alternative is "up" or "down", this is filtered for significant genes of the desired sign.)
If save.results=TRUE, the results are saved in a results subdirectory of output.dir.
If dry.run=FALSE, only the report is created, and NULL is returned.
Aaron Lun
all.genes <- sprintf("gene-%i", seq_len(1000)) library(S4Vectors) tab <- DataFrame( row.names=all.genes, AveExpr=rnorm(length(all.genes)), t=rnorm(length(all.genes)), PValue=runif(length(all.genes)), FDR=runif(length(all.genes)) ) sets <- list( A=sample(all.genes, 20), B=sample(all.genes, 300), C=sample(all.genes, 10), D=sample(all.genes, 500) ) output.dir <- tempfile() results <- runPrecomputed( tab, sets, signif.field="FDR", signif.threshold=0.05, rank.field="t", output=output.dir ) results$hypergeometric results$fgsea list.files(output.dir, recursive=TRUE)all.genes <- sprintf("gene-%i", seq_len(1000)) library(S4Vectors) tab <- DataFrame( row.names=all.genes, AveExpr=rnorm(length(all.genes)), t=rnorm(length(all.genes)), PValue=runif(length(all.genes)), FDR=runif(length(all.genes)) ) sets <- list( A=sample(all.genes, 20), B=sample(all.genes, 300), C=sample(all.genes, 10), D=sample(all.genes, 500) ) output.dir <- tempfile() results <- runPrecomputed( tab, sets, signif.field="FDR", signif.threshold=0.05, rank.field="t", output=output.dir ) results$hypergeometric results$fgsea list.files(output.dir, recursive=TRUE)