Title: | Reproducible GSEA Benchmarking |
---|---|
Description: | The GSEABenchmarkeR package implements an extendable framework for reproducible evaluation of set- and network-based methods for enrichment analysis of gene expression data. This includes support for the efficient execution of these methods on comprehensive real data compendia (microarray and RNA-seq) using parallel computation on standard workstations and institutional computer grids. Methods can then be assessed with respect to runtime, statistical significance, and relevance of the results for the phenotypes investigated. |
Authors: | Ludwig Geistlinger [aut, cre], Gergely Csaba [aut], Mara Santarelli [ctb], Lucas Schiffer [ctb], Marcel Ramos [ctb], Ralf Zimmer [aut], Levi Waldron [aut] |
Maintainer: | Ludwig Geistlinger <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.27.0 |
Built: | 2024-10-30 07:23:58 UTC |
Source: | https://github.com/bioc/GSEABenchmarkeR |
This is a convenience function to create customized boxplots for specific benchmark criteria such as runtime, statistical significance and phenotype relevance.
bpPlot(data, what = c("runtime", "sig.sets", "rel.sets", "typeI"))
bpPlot(data, what = c("runtime", "sig.sets", "rel.sets", "typeI"))
data |
Numeric matrix or list of numeric vectors. In case of a matrix, column names are assumed to be method names and rownames are assumed to be dataset IDs. In case of a list, names are assumed to be method names and each element corresponds to a numeric vector with names assumed to be dataset IDs. |
what |
Character. Determines how the plot is customized. One of
|
None. Plots to a graphics device.
Ludwig Geistlinger <[email protected]>
evalNrSigSets
to evaluate fractions of significant
gene sets; evalRelevance
to evaluate phenotype relevance of
gene set rankings.
# simulated setup: # 3 methods & 5 datasets methods <- paste0("m", 1:3) data.ids <- paste0("d", 1:5) # runtime data rt <- vapply(1:3, function(m) runif(5, min = m, max = m+1), numeric(5)) rownames(rt) <- data.ids colnames(rt) <- methods # plot bpPlot(rt, what = "runtime")
# simulated setup: # 3 methods & 5 datasets methods <- paste0("m", 1:3) data.ids <- paste0("d", 1:5) # runtime data rt <- vapply(1:3, function(m) runif(5, min = m, max = m+1), numeric(5)) rownames(rt) <- data.ids colnames(rt) <- methods # plot bpPlot(rt, what = "runtime")
Convenience function to flexibly save and restore an already processed expression data compendium via caching.
cacheResource(res, rname, ucdir = "GSEABenchmarkeR")
cacheResource(res, rname, ucdir = "GSEABenchmarkeR")
res |
Resource. An arbitrary R object. |
rname |
Character. Resource name. |
ucdir |
Character. User cache directory. Defaults to 'GSEABenchmarkeR',
which will accordingly use |
None. Stores the object in the cache by side effect.
Ludwig Geistlinger <[email protected]>
loadEData
, R_user_dir
,
BiocFileCache
# load user-defined expression compendium data.dir <- system.file("extdata/myEData", package = "GSEABenchmarkeR") edat <- loadEData(data.dir) # do some processing of the compendium edat <- lapply(edat, function(d) d[1:50,]) # cache it ... cacheResource(edat, "myEData") # ... and restore it at a later time edat <- loadEData(data.dir, cache = TRUE)
# load user-defined expression compendium data.dir <- system.file("extdata/myEData", package = "GSEABenchmarkeR") edat <- loadEData(data.dir) # do some processing of the compendium edat <- lapply(edat, function(d) d[1:50,]) # cache it ... cacheResource(edat, "myEData") # ... and restore it at a later time edat <- loadEData(data.dir, cache = TRUE)
These functions evaluate gene set rankings obtained from applying enrichment methods to multiple datasets. This allows to assess resulting rankings for granularity (how many gene sets have a unique p-value?) and statistical significance (how many gene sets have a p-value below a significance threshold?).
evalNrSigSets(ea.ranks, alpha = 0.05, padj = "none", perc = TRUE) evalNrSets(ea.ranks, uniq.pval = TRUE, perc = TRUE)
evalNrSigSets(ea.ranks, alpha = 0.05, padj = "none", perc = TRUE) evalNrSets(ea.ranks, uniq.pval = TRUE, perc = TRUE)
ea.ranks |
Enrichment analysis rankings. A list with an entry for each enrichment method applied. Each entry is a list that stores for each dataset analyzed the resulting gene set ranking as obtained from applying the respective method to the respective dataset. |
alpha |
Statistical significance level. Defaults to 0.05. |
padj |
Character. Method for adjusting p-values to multiple testing.
For available methods see the man page of the stats function
|
perc |
Logical. Should the percentage or absolute number of gene sets
be returned? Percentage is typically more useful for comparison between
rankings with a potentially different total number of gene sets. Defaults
to |
uniq.pval |
Logical. Should the number of gene sets with a unique
p-value or the total number of gene sets per ranking be returned? Defaults
to |
A list of numeric vectors storing for each method the number of (significant) gene sets for each dataset analyzed. If each element of the resulting list is of equal length (corresponds to successful application of each enrichment method to each dataset), the list is automatically simplified to a numeric matrix (rows = datasets, columns = methods).
Ludwig Geistlinger <[email protected]>
runEA
to apply enrichment methods to multiple
datasets; readResults
to read saved rankings as an input for
the eval-functions.
# simulated setup: # 2 methods & 2 datasets methods <- paste0("m", 1:2) data.ids <- paste0("d", 1:2) # simulate gene set rankings getRankingForDataset <- function(d) { r <- EnrichmentBrowser::makeExampleData("ea.res") EnrichmentBrowser::gsRanking(r, signif.only=FALSE) } getRankingsForMethod <- function(m) { rs <- lapply(data.ids, getRankingForDataset) names(rs) <- data.ids rs } ea.ranks <- lapply(methods, getRankingsForMethod) names(ea.ranks) <- methods # evaluate evalNrSets(ea.ranks) evalNrSigSets(ea.ranks)
# simulated setup: # 2 methods & 2 datasets methods <- paste0("m", 1:2) data.ids <- paste0("d", 1:2) # simulate gene set rankings getRankingForDataset <- function(d) { r <- EnrichmentBrowser::makeExampleData("ea.res") EnrichmentBrowser::gsRanking(r, signif.only=FALSE) } getRankingsForMethod <- function(m) { rs <- lapply(data.ids, getRankingForDataset) names(rs) <- data.ids rs } ea.ranks <- lapply(methods, getRankingsForMethod) names(ea.ranks) <- methods # evaluate evalNrSets(ea.ranks) evalNrSigSets(ea.ranks)
This function evaluates the proportion of rejected null hypotheses (= the fraction of significant gene sets) of an enrichment method when applied to random gene sets of defined size.
evalRandomGS( method, se, nr.gs = 100, set.size = 5, alpha = 0.05, padj = "none", perc = TRUE, reps = 100, rep.block.size = -1, summarize = TRUE, save2file = FALSE, out.dir = NULL, ... )
evalRandomGS( method, se, nr.gs = 100, set.size = 5, alpha = 0.05, padj = "none", perc = TRUE, reps = 100, rep.block.size = -1, summarize = TRUE, save2file = FALSE, out.dir = NULL, ... )
method |
Enrichment analysis method. A character scalar chosen
from |
se |
An expression dataset of class |
nr.gs |
Integer. Number of random gene sets. Defaults to 100. |
set.size |
Integer. Gene set size, i.e. number of genes in each random gene set. |
alpha |
Numeric. Statistical significance level. Defaults to 0.05. |
padj |
Character. Method for adjusting p-values to multiple testing.
For available methods see the man page of the stats function
|
perc |
Logical. Should the percentage (between 0 and 100, default) or the proportion (between 0 and 1) of significant gene sets be returned? |
reps |
Integer. Number of replications. Defaults to 100. |
rep.block.size |
Integer. When running in parallel, splits |
summarize |
Logical. If |
save2file |
Logical. Should results be saved to file for subsequent
benchmarking? Defaults to |
out.dir |
Character. Determines the output directory where results are
saved to. Defaults to |
... |
Additional arguments passed to the selected enrichment method. |
A named numeric vector of length 2 storing mean and standard deviation
of the proportion of significant gene sets across reps
replications
(summarize=TRUE
); or a numeric vector of length reps
storing the
the proportion of significant gene sets for each replication itself
(summarize=FALSE
).
Ludwig Geistlinger <[email protected]>
sbea
and nbea
for carrying out set- and network-based enrichment analysis.
BiocParallelParam
and register
for
configuration of parallel computation.
# loading two datasets from the GEO2KEGG compendium geo2kegg <- loadEData("geo2kegg", nr.datasets = 2) # only considering the first 1000 probes for demonstration geo2kegg <- lapply(geo2kegg, function(d) d[1:1000,]) # preprocessing and DE analysis for two of the datasets geo2kegg <- maPreproc(geo2kegg) geo2kegg <- runDE(geo2kegg) evalRandomGS("camera", geo2kegg[[1]], reps = 3)
# loading two datasets from the GEO2KEGG compendium geo2kegg <- loadEData("geo2kegg", nr.datasets = 2) # only considering the first 1000 probes for demonstration geo2kegg <- lapply(geo2kegg, function(d) d[1:1000,]) # preprocessing and DE analysis for two of the datasets geo2kegg <- maPreproc(geo2kegg) geo2kegg <- runDE(geo2kegg) evalRandomGS("camera", geo2kegg[[1]], reps = 3)
This function evaluates gene set rankings obtained from the application of enrichment methods to multiple datasets - where each dataset investigates a certain phenotype such as a disease. Given pre-defined phenotype relevance scores for the gene sets, indicating how important a gene set is for the investigated phenotype (as e.g. judged by evidence from the literature), this allows to assess whether enrichment methods produce gene set rankings in which phenotype-relevant gene sets accumulate at the top.
evalRelevance( ea.ranks, rel.ranks, data2pheno, method = "wsum", top = 0, rel.thresh = 0, ... ) compOpt(rel.ranks, gs.ids, data2pheno = NULL, top = 0) compRand(rel.ranks, gs.ids, data2pheno = NULL, perm = 1000)
evalRelevance( ea.ranks, rel.ranks, data2pheno, method = "wsum", top = 0, rel.thresh = 0, ... ) compOpt(rel.ranks, gs.ids, data2pheno = NULL, top = 0) compRand(rel.ranks, gs.ids, data2pheno = NULL, perm = 1000)
ea.ranks |
Enrichment analysis rankings. A list with an entry for each
enrichment method applied. Each entry is a list that stores for each
dataset analyzed the resulting gene set ranking, obtained from applying the
respective method to the respective dataset. Resulting gene set rankings
are assumed to be of class |
rel.ranks |
Relevance score rankings. A list with an entry for each
phenotype investigated. Each entry should be a
|
data2pheno |
A named character vector where the names correspond to dataset IDs and the elements of the vector to the corresponding phenotypes investigated. |
method |
Character. Determines how the relevance score is summarized
across the enrichment analysis ranking. Choose |
top |
Integer. If |
rel.thresh |
Numeric. Relevance score threshold. Restricts relevance
score rankings (argument |
... |
Additional arguments for computation of the relevance measure
as defined by the
|
gs.ids |
Character vector of gene set IDs on which enrichment analysis has been carried out. |
perm |
Integer. Number of permutations if |
The function evalRelevance
evaluates the similarity of a gene set ranking
obtained from enrichment analysis and a gene set ranking based on phenotype
relevance scores. Therefore, the function first transforms the ranks 'r'
from the enrichment analysis to weights 'w' in [0,1] via w = 1 - r / N;
where 'N' denotes the total number of gene sets on which the enrichment
analysis has been carried out. These weights are then multiplied with the
corresponding relevance scores and summed up.
The function compOpt
applies evalRelevance
to the theoretically
optimal case in which the enrichment analysis ranking is identical to the
relevance score ranking. The ratio between observed and optimal score is
useful for comparing observed scores between datasets / phenotypes.
The function compRand
repeatedly applies evalRelevance
to random
rankings obtained from placing the gene sets randomly along the ranking, thereby
assessing how likely it is to observe a score equal or greater than the one
obtained.
It is also possible to inspect other measures for summarizing the phenotype
relevance, instead of calculating weighted relevance scores sums (argument
method="wsum"
, default).
One possibility is to treat the comparison of the EA ranking and the relevance
ranking as a classification problem, and to compute standard classification
performance measures such as the area under the ROC curve (method="auc"
).
However, this requires to divide the relevance rankings (argument rel.ranks
)
into relevant (true positives) and irrelevant (true negatives) gene sets using
the top
argument.
Instead of method="auc"
, this can also be any other
performance measure that the ROCR package (https://rocr.bioinf.mpi-sb.mpg.de)
implements. For example, method="tnr"
for calculation of the true
negative rate. Although such classification performance measures are easy to
interpret, the weighted sum has certain preferable properties such as avoiding
thresholding and accounting for varying degrees of relevance in the relevance
rankings.
It is also possible to compute a standard rank-based correlation measure
such as Spearman's correlation (method="cor"
) to compare the similarity
of the enrichment analysis rankings and the relevance rankings. However, this
might not be optimal for a comparison of an EA ranking going over the full
gene set vector against the typically much smaller vector of gene sets for
which a relevance score is annotated. For this scenario, using
rank correlation reduces the question to "does a subset of the EA ranking
preserve the order of the relevance ranking"; although our question of interest is
rather "is a subset of the relevant gene sets ranked highly in the EA ranking".
A numeric matrix (rows = datasets, columns = methods) storing in each cell the chosen relevance measure (score, AUC, cor) obtained from applying the respective enrichment method to the respective expression dataset.
Ludwig Geistlinger <[email protected]>
runEA
to apply enrichment methods to multiple datasets;
readResults
to read saved rankings as an input for the eval-functions;
# # (1) simulated setup: 1 enrichment method applied to 1 dataset # # simulate gene set ranking ea.ranks <- EnrichmentBrowser::makeExampleData("ea.res") ea.ranks <- EnrichmentBrowser::gsRanking(ea.ranks, signif.only=FALSE) # simulated relevance score ranking rel.ranks <- ea.ranks rel.ranks[,2] <- runif(nrow(ea.ranks), min=1, max=100) colnames(rel.ranks)[2] <- "REL.SCORE" rownames(rel.ranks) <- rel.ranks[,"GENE.SET"] ind <- order(rel.ranks[,"REL.SCORE"], decreasing=TRUE) rel.ranks <- rel.ranks[ind,] # evaluate evalRelevance(ea.ranks, rel.ranks) compOpt(rel.ranks, ea.ranks[,"GENE.SET"]) compRand(rel.ranks, ea.ranks[,"GENE.SET"], perm=3) # # (2) simulated setup: 2 methods & 2 datasets # methods <- paste0("m", 1:2) data.ids <- paste0("d", 1:2) # simulate gene set rankings ea.ranks <- sapply(methods, function(m) sapply(data.ids, function(d) { r <- EnrichmentBrowser::makeExampleData("ea.res") r <- EnrichmentBrowser::gsRanking(r, signif.only=FALSE) return(r) }, simplify=FALSE), simplify=FALSE) # simulate a mapping from datasets to disease codes d2d <- c("ALZ", "BRCA") names(d2d) <- data.ids # simulate relevance score rankings rel.ranks <- lapply(ea.ranks[[1]], function(rr) { rr[,2] <- runif(nrow(rr), min=1, max=100) colnames(rr)[2] <- "REL.SCORE" rownames(rr) <- rr[,"GENE.SET"] ind <- order(rr[,"REL.SCORE"], decreasing=TRUE) rr <- rr[ind,] return(rr) }) names(rel.ranks) <- unname(d2d) # evaluate evalRelevance(ea.ranks, rel.ranks, d2d)
# # (1) simulated setup: 1 enrichment method applied to 1 dataset # # simulate gene set ranking ea.ranks <- EnrichmentBrowser::makeExampleData("ea.res") ea.ranks <- EnrichmentBrowser::gsRanking(ea.ranks, signif.only=FALSE) # simulated relevance score ranking rel.ranks <- ea.ranks rel.ranks[,2] <- runif(nrow(ea.ranks), min=1, max=100) colnames(rel.ranks)[2] <- "REL.SCORE" rownames(rel.ranks) <- rel.ranks[,"GENE.SET"] ind <- order(rel.ranks[,"REL.SCORE"], decreasing=TRUE) rel.ranks <- rel.ranks[ind,] # evaluate evalRelevance(ea.ranks, rel.ranks) compOpt(rel.ranks, ea.ranks[,"GENE.SET"]) compRand(rel.ranks, ea.ranks[,"GENE.SET"], perm=3) # # (2) simulated setup: 2 methods & 2 datasets # methods <- paste0("m", 1:2) data.ids <- paste0("d", 1:2) # simulate gene set rankings ea.ranks <- sapply(methods, function(m) sapply(data.ids, function(d) { r <- EnrichmentBrowser::makeExampleData("ea.res") r <- EnrichmentBrowser::gsRanking(r, signif.only=FALSE) return(r) }, simplify=FALSE), simplify=FALSE) # simulate a mapping from datasets to disease codes d2d <- c("ALZ", "BRCA") names(d2d) <- data.ids # simulate relevance score rankings rel.ranks <- lapply(ea.ranks[[1]], function(rr) { rr[,2] <- runif(nrow(rr), min=1, max=100) colnames(rr)[2] <- "REL.SCORE" rownames(rr) <- rr[,"GENE.SET"] ind <- order(rr[,"REL.SCORE"], decreasing=TRUE) rr <- rr[ind,] return(rr) }) names(rel.ranks) <- unname(d2d) # evaluate evalRelevance(ea.ranks, rel.ranks, d2d)
This function evaluates the type I error rate of selected methods for enrichment analysis when applied to one or more expression datasets.
evalTypeIError( methods, exp.list, gs, alpha = 0.05, ea.perm = 1000, tI.perm = 1000, perm.block.size = -1, summarize = TRUE, save2file = FALSE, out.dir = NULL, verbose = TRUE, ... )
evalTypeIError( methods, exp.list, gs, alpha = 0.05, ea.perm = 1000, tI.perm = 1000, perm.block.size = -1, summarize = TRUE, save2file = FALSE, out.dir = NULL, verbose = TRUE, ... )
methods |
Methods for enrichment analysis. This can be either
|
exp.list |
Experiment list. A |
gs |
Gene sets, i.e. a list of character vectors of gene IDs. |
alpha |
Numeric. Statistical significance level. Defaults to 0.05. |
ea.perm |
Integer. Number of permutations of the sample group assignments during enrichment analysis. Defaults to 1000. Can also be an integer vector matching the length of 'methods' to assign different numbers of permutations for different methods. |
tI.perm |
Integer. Number of permutations of the sample group assignments
during type I error rate evaluation. Defaults to 1000. Can also be an integer
vector matching the length of |
perm.block.size |
Integer. When running in parallel, splits |
summarize |
Logical. If |
save2file |
Logical. Should results be saved to file for subsequent
benchmarking? Defaults to |
out.dir |
Character. Determines the output directory where results are
saved to. Defaults to |
verbose |
Logical. Should progress be reported? Defaults to |
... |
Additional arguments passed to the selected enrichment methods. |
A list with an entry for each method applied. Each method entry is
a list with an entry for each dataset analyzed. Each dataset entry is either
a summary (summarize=TRUE
) or the full vector of type I error rates
(summarize=FALSE
) across tI.perm
permutations of the sample labels.
Ludwig Geistlinger <[email protected]>
sbea
and nbea
for carrying out set- and network-based enrichment analysis.
BiocParallelParam
and register
for
configuration of parallel computation.
# loading three datasets from the GEO2KEGG compendium geo2kegg <- loadEData("geo2kegg", nr.datasets=3) # only considering the first 1000 probes for demonstration geo2kegg <- lapply(geo2kegg, function(d) d[1:1000,]) # preprocessing and DE analysis for two of the datasets geo2kegg <- maPreproc(geo2kegg[2:3]) geo2kegg <- runDE(geo2kegg) # getting a subset of human KEGG gene sets gs.file <- system.file("extdata", package="EnrichmentBrowser") gs.file <- file.path(gs.file, "hsa_kegg_gs.gmt") kegg.gs <- EnrichmentBrowser::getGenesets(gs.file) # evaluating type I error rate of two methods on two datasets # NOTE: using a small number of permutations for demonstration; # for a meaningful evaluation tI.perm should be >= 1000 res <- evalTypeIError(geo2kegg, methods=c("ora", "camera"), gs=kegg.gs, ea.perm=0, tI.perm=3) # applying a user-defined enrichment method ... # ... or a mix of pre-defined and user-defined methods dummySBEA <- function(se, gs) { sig.ps <- sample(seq(0, 0.05, length=1000), 5) nsig.ps <- sample(seq(0.1, 1, length=1000), length(gs)-5) ps <- sample(c(sig.ps, nsig.ps), length(gs)) names(ps) <- names(gs) return(ps) } methods <- list(camera = "camera", dummySBEA = dummySBEA) res <- evalTypeIError(methods, geo2kegg, gs=kegg.gs, tI.perm=3)
# loading three datasets from the GEO2KEGG compendium geo2kegg <- loadEData("geo2kegg", nr.datasets=3) # only considering the first 1000 probes for demonstration geo2kegg <- lapply(geo2kegg, function(d) d[1:1000,]) # preprocessing and DE analysis for two of the datasets geo2kegg <- maPreproc(geo2kegg[2:3]) geo2kegg <- runDE(geo2kegg) # getting a subset of human KEGG gene sets gs.file <- system.file("extdata", package="EnrichmentBrowser") gs.file <- file.path(gs.file, "hsa_kegg_gs.gmt") kegg.gs <- EnrichmentBrowser::getGenesets(gs.file) # evaluating type I error rate of two methods on two datasets # NOTE: using a small number of permutations for demonstration; # for a meaningful evaluation tI.perm should be >= 1000 res <- evalTypeIError(geo2kegg, methods=c("ora", "camera"), gs=kegg.gs, ea.perm=0, tI.perm=3) # applying a user-defined enrichment method ... # ... or a mix of pre-defined and user-defined methods dummySBEA <- function(se, gs) { sig.ps <- sample(seq(0, 0.05, length=1000), 5) nsig.ps <- sample(seq(0.1, 1, length=1000), length(gs)-5) ps <- sample(c(sig.ps, nsig.ps), length(gs)) names(ps) <- names(gs) return(ps) } methods <- list(camera = "camera", dummySBEA = dummySBEA) res <- evalTypeIError(methods, geo2kegg, gs=kegg.gs, tI.perm=3)
This function implements a general interface for loading the pre-defined GEO2KEGG microarray compendium and the TCGA RNA-seq compendium. It also allows loading of user-defined data from file.
loadEData(edata, nr.datasets = NULL, cache = TRUE, ...)
loadEData(edata, nr.datasets = NULL, cache = TRUE, ...)
edata |
Expression data compendium. A character vector of length 1 that must be either
See details. |
nr.datasets |
Integer. Number of datasets that should be loaded from the compendium. This is mainly for demonstration purposes. |
cache |
Logical. Should an already cached version used if available?
Defaults to |
... |
Additional arguments passed to the internal loading routines of the GEO2KEGG and TCGA compendia. This currently includes for loading of the GEO2KEGG compendium
And for loading of the TCGA compendium
|
The pre-defined GEO2KEGG microarray compendium consists of 42 datasets investigating a total of 19 different human diseases as collected by Tarca et al. (2012 and 2013).
The pre-defined TCGA RNA-seq compendium consists of datasets from The Cancer Genome Atlas (TCGA, 2013) investigating a total of 34 different cancer types.
User-defined data can also be loaded, given that datasets, preferably of
class SummarizedExperiment
, have been saved as
RDS
files.
A list
of datasets, typically of class
SummarizedExperiment
.
Note that loadEData("geo2kegg", preproc = FALSE)
(the default)
returns the original microarray probe level data as a list of
ExpressionSet
objects. Use preproc = TRUE
or
the maPreproc
function to summarize the probe level
data to gene level data and to obtain a list
of
SummarizedExperiment
objects.
Ludwig Geistlinger <[email protected]>
Tarca et al. (2012) Down-weighting overlapping genes improves gene set analysis. BMC Bioinformatics, 13:136.
Tarca et al. (2013) A comparison of gene set analysis methods in terms of sensitivity, prioritization and specificity. PLoS One, 8(11):e79217.
The Cancer Genome Atlas Research Network (2013) The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet, 45(10):1113-20.
Rahman et al. (2015) Alternative preprocessing of RNA-Sequencing data in The Cancer Genome Atlas leads to improved analysis results. Bioinformatics, 31(22):3666-72.
SummarizedExperiment
,
ExpressionSet
, maPreproc
# (1) Loading the GEO2KEGG microarray compendium geo2kegg <- loadEData("geo2kegg", nr.datasets=2) # (2) Loading the TCGA RNA-seq compendium tcga <- loadEData("tcga", nr.datasets=2) # (3) reading user-defined expression data from file data.dir <- system.file("extdata/myEData", package="GSEABenchmarkeR") edat <- loadEData(data.dir)
# (1) Loading the GEO2KEGG microarray compendium geo2kegg <- loadEData("geo2kegg", nr.datasets=2) # (2) Loading the TCGA RNA-seq compendium tcga <- loadEData("tcga", nr.datasets=2) # (3) reading user-defined expression data from file data.dir <- system.file("extdata/myEData", package="GSEABenchmarkeR") edat <- loadEData(data.dir)
This function prepares datasets of the GEO2KEGG microarray compendium for further analysis. This includes summarization of probe level expression to gene level expression as well as annotation of required colData slots for sample grouping.
maPreproc(exp.list, parallel = NULL)
maPreproc(exp.list, parallel = NULL)
exp.list |
Experiment list. A |
parallel |
Parallel computation mode. An instance of class
|
A list
of datasets, each being of class
SummarizedExperiment
.
Ludwig Geistlinger <[email protected]>
loadEData
to load a specified expression data
compendium.
# reading user-defined expression data from file geo2kegg <- loadEData("geo2kegg", nr.datasets=3) # only considering the first 100 probes for demonstration geo2kegg <- lapply(geo2kegg, function(d) d[1:100,]) # preprocessing two datasets geo2kegg <- maPreproc(geo2kegg[2:3])
# reading user-defined expression data from file geo2kegg <- loadEData("geo2kegg", nr.datasets=3) # only considering the first 100 probes for demonstration geo2kegg <- lapply(geo2kegg, function(d) d[1:100,]) # preprocessing two datasets geo2kegg <- maPreproc(geo2kegg[2:3])
When assessing enrichment analysis results for phenotype relevance, it is assumed that each analyzed dataset investigates a certain phenotype such as a disease. This function reads a mapping between dataset IDs and assigned disease codes.
readDataId2diseaseCodeMap(map.file)
readDataId2diseaseCodeMap(map.file)
map.file |
Character. The path to the mapping file. |
A named character vector where each element of the vector is a disease code and the names are the dataset IDs.
Ludwig Geistlinger <[email protected]>
evalRelevance
for evaluating phenotype relevance of gene set
rankings.
data.dir <- system.file("extdata", package="GSEABenchmarkeR") d2d.file <- file.path(data.dir, "malacards", "GseId2Disease.txt") d2d.map <- readDataId2diseaseCodeMap(d2d.file)
data.dir <- system.file("extdata", package="GSEABenchmarkeR") d2d.file <- file.path(data.dir, "malacards", "GseId2Disease.txt") d2d.map <- readDataId2diseaseCodeMap(d2d.file)
These functions read results obtained from the application of enrichment methods to multiple datasets for subsequent assessment.
readResults( data.dir, data.ids, methods, type = c("runtime", "ranking", "typeI") )
readResults( data.dir, data.ids, methods, type = c("runtime", "ranking", "typeI") )
data.dir |
Character. The data directory where results have been saved to. |
data.ids |
A character vector of dataset IDs. |
methods |
Methods for enrichment analysis. A character vector with
method names typically chosen from |
type |
Character. Type of the result. Should be one out of 'runtime', 'ranking', or 'typeI'. |
A result list with an entry for each method applied. Each entry
stores corresponding runtimes (type="runtime"
), gene set
rankings (type="ranking"
), or type I error rates (type="typeI"
)
as obtained from applying the respective method to the given datasets.
Ludwig Geistlinger <[email protected]>
runEA
to apply enrichment methods to multiple datasets.
# simulated setup: # 1 methods & 1 datasets methods <- paste0("m", 1:2) data.ids <- paste0("d", 1:2) # result directory res.dir <- tempdir() sdirs <- file.path(res.dir, methods) for(d in sdirs) dir.create(d) # store runtime & rankings for(m in 1:2) { rt <- runif(5, min=m, max=m+1) for(d in 1:2) { # runtime out.file <- paste(data.ids[d], "txt", sep=".") out.file <- file.path(sdirs[m], out.file) cat(rt[d], file=out.file) # ranking out.file <- sub("txt$", "rds", out.file) r <- EnrichmentBrowser::makeExampleData("ea.res") r <- EnrichmentBrowser::gsRanking(r, signif.only=FALSE) saveRDS(r, file=out.file) } } # reading runtime & rankings rts <- readResults(res.dir, data.ids, methods, type="runtime") rkgs <- readResults(res.dir, data.ids, methods, type="ranking")
# simulated setup: # 1 methods & 1 datasets methods <- paste0("m", 1:2) data.ids <- paste0("d", 1:2) # result directory res.dir <- tempdir() sdirs <- file.path(res.dir, methods) for(d in sdirs) dir.create(d) # store runtime & rankings for(m in 1:2) { rt <- runif(5, min=m, max=m+1) for(d in 1:2) { # runtime out.file <- paste(data.ids[d], "txt", sep=".") out.file <- file.path(sdirs[m], out.file) cat(rt[d], file=out.file) # ranking out.file <- sub("txt$", "rds", out.file) r <- EnrichmentBrowser::makeExampleData("ea.res") r <- EnrichmentBrowser::gsRanking(r, signif.only=FALSE) saveRDS(r, file=out.file) } } # reading runtime & rankings rts <- readResults(res.dir, data.ids, methods, type="runtime") rkgs <- readResults(res.dir, data.ids, methods, type="ranking")
This function applies selected methods for differential expression (DE) analysis to selected datasets of an expression data compendium.
runDE( exp.list, de.method = c("limma", "edgeR", "DESeq2"), padj.method = "flexible", parallel = NULL, ... ) metaFC(exp.list, max.na = round(length(exp.list)/3)) writeDE(exp.list, out.dir = NULL) plotDEDistribution(exp.list, alpha = 0.05, beta = 1) plotNrSamples(exp.list)
runDE( exp.list, de.method = c("limma", "edgeR", "DESeq2"), padj.method = "flexible", parallel = NULL, ... ) metaFC(exp.list, max.na = round(length(exp.list)/3)) writeDE(exp.list, out.dir = NULL) plotDEDistribution(exp.list, alpha = 0.05, beta = 1) plotNrSamples(exp.list)
exp.list |
Experiment list. A |
de.method |
Differential expression method. See documentation of
|
padj.method |
Method for adjusting p-values to multiple testing. For
available methods see the man page of the stats function |
parallel |
Parallel computation mode. An instance of class
|
... |
Additional arguments passed to |
max.na |
Integer. Determines for which genes a meta fold change is
computed. Per default, excludes genes for which the fold change is not
annotated in >= 1/3 of the datasets in |
out.dir |
Character. Determines the output directory where DE results
for each dataset are written to. Defaults to |
alpha |
Statistical significance level. Defaults to 0.05. |
beta |
Absolute log2 fold change cut-off. Defaults to 1 (2-fold). |
DE studies typically report a gene as differentially expressed if the corresponding DE p-value, corrected for multiple testing, satisfies the chosen significance level. Enrichment methods that work directly on the list of DE genes are then substantially influenced by the multiple testing correction.
An example is the frequently used over-representation analysis (ORA), which
assesses the overlap between the DE genes and a gene set under study based
on the hypergeometric distribution (see Appendix A of the
EnrichmentBrowser
vignette for an introduction).
ORA is inapplicable if there are few genes satisfying the significance threshold, or if almost all genes are DE.
Using padj.method="flexible"
accounts for these cases by applying
multiple testing correction in dependence on the degree of differential
expression:
the correction method from Benjamini and Hochberg (BH) is applied if it renders >= 1% and <= 25% of all measured genes as DE,
the p-values are left unadjusted, if the BH correction results in < 1% DE genes, and
the more stringent Bonferroni correction is applied, if the BH correction results in > 25% DE genes.
Note that resulting p-values should not be used for assessing the statistical significance of DE genes within or between datasets. They are solely used to determine which genes are included in the analysis with ORA - where the flexible correction ensures that the fraction of included genes is roughly in the same order of magnitude across datasets.
Alternative stratgies could also be applied - such as taking a constant number of genes for each dataset or excluding ORA methods in general from the assessment.
runDE
returns exp.list
with DE measures annotated to
the rowData
slot of each dataset, writeDE
writes to file,
and plotDEDistribution
plots to a graphics device.
Ludwig Geistlinger <[email protected]>
loadEData
to load a specified expression data compendium.
# reading user-defined expression data from file data.dir <- system.file("extdata/myEData", package="GSEABenchmarkeR") edat <- loadEData(data.dir) # differential expression analysis edat <- runDE(edat) # visualization of per-dataset DE distribution plotDEDistribution(edat) # calculating meta fold changes across datasets mfcs <- metaFC(edat, max.na=0) # writing DE results to file out.dir <- tempdir() out.dir <- file.path(out.dir, "de") if(!file.exists(out.dir)) dir.create(out.dir) writeDE(edat, out.dir)
# reading user-defined expression data from file data.dir <- system.file("extdata/myEData", package="GSEABenchmarkeR") edat <- loadEData(data.dir) # differential expression analysis edat <- runDE(edat) # visualization of per-dataset DE distribution plotDEDistribution(edat) # calculating meta fold changes across datasets mfcs <- metaFC(edat, max.na=0) # writing DE results to file out.dir <- tempdir() out.dir <- file.path(out.dir, "de") if(!file.exists(out.dir)) dir.create(out.dir) writeDE(edat, out.dir)
This function applies selected methods for enrichment analysis to selected datasets of a compendium.
runEA( exp.list, methods, gs, perm = 1000, parallel = NULL, save2file = FALSE, out.dir = NULL, ... )
runEA( exp.list, methods, gs, perm = 1000, parallel = NULL, save2file = FALSE, out.dir = NULL, ... )
exp.list |
Experiment list. A |
methods |
Methods for enrichment analysis. This can be either
|
gs |
Gene sets, i.e. a list of character vectors of gene IDs. |
perm |
Number of permutations of the sample group assignments.
Defaults to 1000. Can also be an integer vector matching
the length of |
parallel |
Parallel computation mode. An instance of class
|
save2file |
Logical. Should results be saved to file for subsequent
benchmarking? Defaults to |
out.dir |
Character. Determines the output directory where results are
saved to. Defaults to |
... |
Additional arguments passed to the selected enrichment methods. |
A list with an entry for each method applied. Each method entry is a list with an entry for each dataset analyzed. Each dataset entry is a list of length 2, with the first element being the runtime and the second element being the gene set ranking, as obtained from applying the respective method to the respective dataset.
Ludwig Geistlinger <[email protected]>
sbea
and nbea
for carrying out set- and network-based enrichment analysis.
BiocParallelParam
and register
for
configuration of parallel computation.
# loading three datasets from the GEO2KEGG compendium geo2kegg <- loadEData("geo2kegg", nr.datasets=3) # only considering the first 1000 probes for demonstration geo2kegg <- lapply(geo2kegg, function(d) d[1:1000,]) # preprocessing and DE analysis for two of the datasets geo2kegg <- maPreproc(geo2kegg[2:3]) geo2kegg <- runDE(geo2kegg) # getting a subset of human KEGG gene sets gs.file <- system.file("extdata/hsa_kegg_gs.gmt", package="EnrichmentBrowser") kegg.gs <- EnrichmentBrowser::getGenesets(gs.file) # applying two methods to two datasets res <- runEA(geo2kegg, methods=c("ora", "camera"), gs=kegg.gs, perm=0) # applying a user-defined enrichment method dummySBEA <- function(se, gs) { sig.ps <- sample(seq(0, 0.05, length=1000), 5) nsig.ps <- sample(seq(0.1, 1, length=1000), length(gs)-5) ps <- sample(c(sig.ps, nsig.ps), length(gs)) names(ps) <- names(gs) return(ps) } res <- runEA(geo2kegg, methods=dummySBEA, gs=kegg.gs) # applying a mix of pre-defined and user-defined methods methods <- list(camera = "camera", dummySBEA = dummySBEA) res <- runEA(geo2kegg, methods, gs=kegg.gs, perm=0)
# loading three datasets from the GEO2KEGG compendium geo2kegg <- loadEData("geo2kegg", nr.datasets=3) # only considering the first 1000 probes for demonstration geo2kegg <- lapply(geo2kegg, function(d) d[1:1000,]) # preprocessing and DE analysis for two of the datasets geo2kegg <- maPreproc(geo2kegg[2:3]) geo2kegg <- runDE(geo2kegg) # getting a subset of human KEGG gene sets gs.file <- system.file("extdata/hsa_kegg_gs.gmt", package="EnrichmentBrowser") kegg.gs <- EnrichmentBrowser::getGenesets(gs.file) # applying two methods to two datasets res <- runEA(geo2kegg, methods=c("ora", "camera"), gs=kegg.gs, perm=0) # applying a user-defined enrichment method dummySBEA <- function(se, gs) { sig.ps <- sample(seq(0, 0.05, length=1000), 5) nsig.ps <- sample(seq(0.1, 1, length=1000), length(gs)-5) ps <- sample(c(sig.ps, nsig.ps), length(gs)) names(ps) <- names(gs) return(ps) } res <- runEA(geo2kegg, methods=dummySBEA, gs=kegg.gs) # applying a mix of pre-defined and user-defined methods methods <- list(camera = "camera", dummySBEA = dummySBEA) res <- runEA(geo2kegg, methods, gs=kegg.gs, perm=0)