Package 'GSEABenchmarkeR' reference manual

Title:	Reproducible GSEA Benchmarking
Description:	The GSEABenchmarkeR package implements an extendable framework for reproducible evaluation of set- and network-based methods for enrichment analysis of gene expression data. This includes support for the efficient execution of these methods on comprehensive real data compendia (microarray and RNA-seq) using parallel computation on standard workstations and institutional computer grids. Methods can then be assessed with respect to runtime, statistical significance, and relevance of the results for the phenotypes investigated.
Authors:	Ludwig Geistlinger [aut, cre], Gergely Csaba [aut], Mara Santarelli [ctb], Lucas Schiffer [ctb], Marcel Ramos [ctb], Ralf Zimmer [aut], Levi Waldron [aut]
Maintainer:	Ludwig Geistlinger <[email protected]>
License:	Artistic-2.0
Version:	1.27.0
Built:	2025-03-29 06:06:44 UTC
Source:	https://github.com/bioc/GSEABenchmarkeR

Customized boxplot visualization of benchmark results

Description

This is a convenience function to create customized boxplots for specific benchmark criteria such as runtime, statistical significance and phenotype relevance.

Usage

bpPlot(data, what = c("runtime", "sig.sets", "rel.sets", "typeI"))
bpPlot(data, what = c("runtime", "sig.sets", "rel.sets", "typeI"))

Arguments

data

Numeric matrix or list of numeric vectors. In case of a matrix, column names are assumed to be method names and rownames are assumed to be dataset IDs. In case of a list, names are assumed to be method names and each element corresponds to a numeric vector with names assumed to be dataset IDs.

what

Character. Determines how the plot is customized. One of

runtime: displays runtime of methods across datasets,
sig.sets: displays percentage of significant gene sets,
rel.sets: displays phenotype relevance scores,
typeI: displays type I error rates.

Value

None. Plots to a graphics device.

Author(s)

Ludwig Geistlinger <[email protected]>

Examples


    # simulated setup:
    # 3 methods & 5 datasets
    methods <- paste0("m", 1:3)
    data.ids <- paste0("d", 1:5)

    # runtime data
    rt <- vapply(1:3, function(m) runif(5, min = m, max = m+1), numeric(5))
    rownames(rt) <- data.ids 
    colnames(rt) <- methods

    # plot
    bpPlot(rt, what = "runtime")

# simulated setup:
    # 3 methods & 5 datasets
    methods <- paste0("m", 1:3)
    data.ids <- paste0("d", 1:5)

    # runtime data
    rt <- vapply(1:3, function(m) runif(5, min = m, max = m+1), numeric(5))
    rownames(rt) <- data.ids 
    colnames(rt) <- methods

    # plot
    bpPlot(rt, what = "runtime")

Caching of a resource

Description

Convenience function to flexibly save and restore an already processed expression data compendium via caching.

Usage

cacheResource(res, rname, ucdir = "GSEABenchmarkeR")
cacheResource(res, rname, ucdir = "GSEABenchmarkeR")

Arguments

`res`	Resource. An arbitrary R object.
`rname`	Character. Resource name.
`ucdir`	Character. User cache directory. Defaults to 'GSEABenchmarkeR', which will accordingly use `tools::R_user_dir("GSEABenchmarkeR", which = "cache")`.

Value

None. Stores the object in the cache by side effect.

Author(s)

Ludwig Geistlinger <[email protected]>

Examples


     # load user-defined expression compendium
     data.dir <- system.file("extdata/myEData", package = "GSEABenchmarkeR")
     edat <- loadEData(data.dir)

     # do some processing of the compendium
     edat <- lapply(edat, function(d) d[1:50,])

     # cache it ...
     cacheResource(edat, "myEData")

     # ... and restore it at a later time
     edat <- loadEData(data.dir, cache = TRUE)    

# load user-defined expression compendium
     data.dir <- system.file("extdata/myEData", package = "GSEABenchmarkeR")
     edat <- loadEData(data.dir)

     # do some processing of the compendium
     edat <- lapply(edat, function(d) d[1:50,])

     # cache it ...
     cacheResource(edat, "myEData")

     # ... and restore it at a later time
     edat <- loadEData(data.dir, cache = TRUE)

Evaluating gene set rankings for the number of (significant) sets

Description

These functions evaluate gene set rankings obtained from applying enrichment methods to multiple datasets. This allows to assess resulting rankings for granularity (how many gene sets have a unique p-value?) and statistical significance (how many gene sets have a p-value below a significance threshold?).

Usage

evalNrSigSets(ea.ranks, alpha = 0.05, padj = "none", perc = TRUE)

evalNrSets(ea.ranks, uniq.pval = TRUE, perc = TRUE)
evalNrSigSets(ea.ranks, alpha = 0.05, padj = "none", perc = TRUE)

evalNrSets(ea.ranks, uniq.pval = TRUE, perc = TRUE)

Arguments

`ea.ranks`	Enrichment analysis rankings. A list with an entry for each enrichment method applied. Each entry is a list that stores for each dataset analyzed the resulting gene set ranking as obtained from applying the respective method to the respective dataset.
`alpha`	Statistical significance level. Defaults to 0.05.
`padj`	Character. Method for adjusting p-values to multiple testing. For available methods see the man page of the stats function `p.adjust`. Defaults to `"none"`.
`perc`	Logical. Should the percentage or absolute number of gene sets be returned? Percentage is typically more useful for comparison between rankings with a potentially different total number of gene sets. Defaults to `TRUE`.
`uniq.pval`	Logical. Should the number of gene sets with a unique p-value or the total number of gene sets per ranking be returned? Defaults to `TRUE`.

Value

A list of numeric vectors storing for each method the number of (significant) gene sets for each dataset analyzed. If each element of the resulting list is of equal length (corresponds to successful application of each enrichment method to each dataset), the list is automatically simplified to a numeric matrix (rows = datasets, columns = methods).

Author(s)

Ludwig Geistlinger <[email protected]>

Examples


    # simulated setup:
    # 2 methods & 2 datasets
    methods <- paste0("m", 1:2)
    data.ids <- paste0("d", 1:2)

    # simulate gene set rankings
    getRankingForDataset <- function(d)
    {
         r <- EnrichmentBrowser::makeExampleData("ea.res") 
         EnrichmentBrowser::gsRanking(r, signif.only=FALSE)
    }
  
    getRankingsForMethod <- function(m)
    {
         rs <- lapply(data.ids, getRankingForDataset)
         names(rs) <- data.ids
         rs
    }
        
    ea.ranks <- lapply(methods, getRankingsForMethod) 
    names(ea.ranks) <- methods 

    # evaluate
    evalNrSets(ea.ranks)
    evalNrSigSets(ea.ranks)

# simulated setup:
    # 2 methods & 2 datasets
    methods <- paste0("m", 1:2)
    data.ids <- paste0("d", 1:2)

    # simulate gene set rankings
    getRankingForDataset <- function(d)
    {
         r <- EnrichmentBrowser::makeExampleData("ea.res") 
         EnrichmentBrowser::gsRanking(r, signif.only=FALSE)
    }
  
    getRankingsForMethod <- function(m)
    {
         rs <- lapply(data.ids, getRankingForDataset)
         names(rs) <- data.ids
         rs
    }
        
    ea.ranks <- lapply(methods, getRankingsForMethod) 
    names(ea.ranks) <- methods 

    # evaluate
    evalNrSets(ea.ranks)
    evalNrSigSets(ea.ranks)

Evaluation of enrichment methods on random gene sets

Description

This function evaluates the proportion of rejected null hypotheses (= the fraction of significant gene sets) of an enrichment method when applied to random gene sets of defined size.

Usage

evalRandomGS(
  method,
  se,
  nr.gs = 100,
  set.size = 5,
  alpha = 0.05,
  padj = "none",
  perc = TRUE,
  reps = 100,
  rep.block.size = -1,
  summarize = TRUE,
  save2file = FALSE,
  out.dir = NULL,
  ...
)
evalRandomGS(
  method,
  se,
  nr.gs = 100,
  set.size = 5,
  alpha = 0.05,
  padj = "none",
  perc = TRUE,
  reps = 100,
  rep.block.size = -1,
  summarize = TRUE,
  save2file = FALSE,
  out.dir = NULL,
  ...
)

Arguments

`method`	Enrichment analysis method. A character scalar chosen from `sbeaMethods` and `nbeaMethods`, or a user-defined function implementing a method for enrichment analysis.
`se`	An expression dataset of class `SummarizedExperiment`.
`nr.gs`	Integer. Number of random gene sets. Defaults to 100.
`set.size`	Integer. Gene set size, i.e. number of genes in each random gene set.
`alpha`	Numeric. Statistical significance level. Defaults to 0.05.
`padj`	Character. Method for adjusting p-values to multiple testing. For available methods see the man page of the stats function `p.adjust`. Defaults to `"none"`.
`perc`	Logical. Should the percentage (between 0 and 100, default) or the proportion (between 0 and 1) of significant gene sets be returned?
`reps`	Integer. Number of replications. Defaults to 100.
`rep.block.size`	Integer. When running in parallel, splits `reps` into blocks of the indicated size. Defaults to -1, which indicates to not partition `reps`.
`summarize`	Logical. If `TRUE` (default) returns the mean (`mean`) and the standard deviation (`sd`) of the proportion of significant gene sets across `reps` replications. Use `FALSE` to return the full vector storing the proportion of significant gene sets for each replication.
`save2file`	Logical. Should results be saved to file for subsequent benchmarking? Defaults to `FALSE`.
`out.dir`	Character. Determines the output directory where results are saved to. Defaults to `NULL`, which then writes to `tools::R_user_dir("GSEABenchmarkeR")` in case `save2file` is set to `TRUE`.
`...`	Additional arguments passed to the selected enrichment method.

Value

A named numeric vector of length 2 storing mean and standard deviation of the proportion of significant gene sets across reps replications (summarize=TRUE); or a numeric vector of length reps storing the the proportion of significant gene sets for each replication itself (summarize=FALSE).

Author(s)

Ludwig Geistlinger <[email protected]>

Examples


    # loading two datasets from the GEO2KEGG compendium
    geo2kegg <- loadEData("geo2kegg", nr.datasets = 2)

    # only considering the first 1000 probes for demonstration
    geo2kegg <- lapply(geo2kegg, function(d) d[1:1000,]) 

    # preprocessing and DE analysis for two of the datasets
    geo2kegg <- maPreproc(geo2kegg)
    geo2kegg <- runDE(geo2kegg)

    evalRandomGS("camera", geo2kegg[[1]], reps = 3)
    

# loading two datasets from the GEO2KEGG compendium
    geo2kegg <- loadEData("geo2kegg", nr.datasets = 2)

    # only considering the first 1000 probes for demonstration
    geo2kegg <- lapply(geo2kegg, function(d) d[1:1000,]) 

    # preprocessing and DE analysis for two of the datasets
    geo2kegg <- maPreproc(geo2kegg)
    geo2kegg <- runDE(geo2kegg)

    evalRandomGS("camera", geo2kegg[[1]], reps = 3)

Evaluating phenotype relevance of gene set rankings

Description

This function evaluates gene set rankings obtained from the application of enrichment methods to multiple datasets - where each dataset investigates a certain phenotype such as a disease. Given pre-defined phenotype relevance scores for the gene sets, indicating how important a gene set is for the investigated phenotype (as e.g. judged by evidence from the literature), this allows to assess whether enrichment methods produce gene set rankings in which phenotype-relevant gene sets accumulate at the top.

Usage

evalRelevance(
  ea.ranks,
  rel.ranks,
  data2pheno,
  method = "wsum",
  top = 0,
  rel.thresh = 0,
  ...
)

compOpt(rel.ranks, gs.ids, data2pheno = NULL, top = 0)

compRand(rel.ranks, gs.ids, data2pheno = NULL, perm = 1000)
evalRelevance(
  ea.ranks,
  rel.ranks,
  data2pheno,
  method = "wsum",
  top = 0,
  rel.thresh = 0,
  ...
)

compOpt(rel.ranks, gs.ids, data2pheno = NULL, top = 0)

compRand(rel.ranks, gs.ids, data2pheno = NULL, perm = 1000)

Arguments

`ea.ranks`	Enrichment analysis rankings. A list with an entry for each enrichment method applied. Each entry is a list that stores for each dataset analyzed the resulting gene set ranking, obtained from applying the respective method to the respective dataset. Resulting gene set rankings are assumed to be of class `DataFrame` in which gene sets (required column named `GENE.SET`) are ranked according to a ranking measure such as a gene set p-value (required column named `PVAL`). See `gsRanking` for an example.
`rel.ranks`	Relevance score rankings. A list with an entry for each phenotype investigated. Each entry should be a `DataFrame` in which gene sets (rownames are assumed to be gene set IDs) are ranked according to a phenotype relevance score (required column `REL.SCORE`).
`data2pheno`	A named character vector where the names correspond to dataset IDs and the elements of the vector to the corresponding phenotypes investigated.
`method`	Character. Determines how the relevance score is summarized across the enrichment analysis ranking. Choose `"wsum"` (default) to compute a weighted sum of the relevance scores, `"auc"` to perform a ROC/AUC analysis, or `"cor"` to compute a correlation. This can also be a user-defined function for customized behaviors. See Details.
`top`	Integer. If `top` is non-zero, the evaluation will be restricted to the first `top` gene sets of each enrichment analysis ranking. Defaults to `0`, which will then evaluate the full ranking. If used with `method="auc"`, it defines the number of gene sets at the top of the relevance ranking that are considered relevant (true positives).
`rel.thresh`	Numeric. Relevance score threshold. Restricts relevance score rankings (argument `rel.ranks`) to gene sets exceeding the threshold in the `REL.SCORE` column.
`...`	Additional arguments for computation of the relevance measure as defined by the `method` argument. This includes for `method="wsum"`: perc: Logical. Should observed scores be returned as-is or as a percentage of the respective optimal score. Percentages of the optimal score are typically easier to interpret and are comparable between datasets / phenotypes. Defaults to `TRUE`. rand: Logical. Should gene set rankings be randomized to assess how likely it is to observe a score equal or greater than the respective obtained score? Defaults to `FALSE`.
`gs.ids`	Character vector of gene set IDs on which enrichment analysis has been carried out.
`perm`	Integer. Number of permutations if `rand` set to `TRUE`.

Details

The function evalRelevance evaluates the similarity of a gene set ranking obtained from enrichment analysis and a gene set ranking based on phenotype relevance scores. Therefore, the function first transforms the ranks 'r' from the enrichment analysis to weights 'w' in [0,1] via w = 1 - r / N; where 'N' denotes the total number of gene sets on which the enrichment analysis has been carried out. These weights are then multiplied with the corresponding relevance scores and summed up.

The function compOpt applies evalRelevance to the theoretically optimal case in which the enrichment analysis ranking is identical to the relevance score ranking. The ratio between observed and optimal score is useful for comparing observed scores between datasets / phenotypes.

The function compRand repeatedly applies evalRelevance to random rankings obtained from placing the gene sets randomly along the ranking, thereby assessing how likely it is to observe a score equal or greater than the one obtained.

It is also possible to inspect other measures for summarizing the phenotype relevance, instead of calculating weighted relevance scores sums (argument method="wsum", default). One possibility is to treat the comparison of the EA ranking and the relevance ranking as a classification problem, and to compute standard classification performance measures such as the area under the ROC curve (method="auc"). However, this requires to divide the relevance rankings (argument rel.ranks) into relevant (true positives) and irrelevant (true negatives) gene sets using the top argument. Instead of method="auc", this can also be any other performance measure that the ROCR package (https://rocr.bioinf.mpi-sb.mpg.de) implements. For example, method="tnr" for calculation of the true negative rate. Although such classification performance measures are easy to interpret, the weighted sum has certain preferable properties such as avoiding thresholding and accounting for varying degrees of relevance in the relevance rankings.

It is also possible to compute a standard rank-based correlation measure such as Spearman's correlation (method="cor") to compare the similarity of the enrichment analysis rankings and the relevance rankings. However, this might not be optimal for a comparison of an EA ranking going over the full gene set vector against the typically much smaller vector of gene sets for which a relevance score is annotated. For this scenario, using rank correlation reduces the question to "does a subset of the EA ranking preserve the order of the relevance ranking"; although our question of interest is rather "is a subset of the relevant gene sets ranked highly in the EA ranking".

Value

A numeric matrix (rows = datasets, columns = methods) storing in each cell the chosen relevance measure (score, AUC, cor) obtained from applying the respective enrichment method to the respective expression dataset.

Author(s)

Ludwig Geistlinger <[email protected]>

Examples


    #
    # (1) simulated setup: 1 enrichment method applied to 1 dataset
    #  

    # simulate gene set ranking
    ea.ranks <- EnrichmentBrowser::makeExampleData("ea.res")
    ea.ranks <- EnrichmentBrowser::gsRanking(ea.ranks, signif.only=FALSE)

    # simulated relevance score ranking
    rel.ranks <- ea.ranks
    rel.ranks[,2] <- runif(nrow(ea.ranks), min=1, max=100)
    colnames(rel.ranks)[2] <- "REL.SCORE"
    rownames(rel.ranks) <- rel.ranks[,"GENE.SET"]
    ind <- order(rel.ranks[,"REL.SCORE"], decreasing=TRUE)
    rel.ranks <- rel.ranks[ind,]

    # evaluate
    evalRelevance(ea.ranks, rel.ranks)    
    compOpt(rel.ranks, ea.ranks[,"GENE.SET"])
    compRand(rel.ranks, ea.ranks[,"GENE.SET"], perm=3)

    # 
    # (2) simulated setup: 2 methods & 2 datasets
    #
       
    methods <- paste0("m", 1:2)
    data.ids <- paste0("d", 1:2)

    # simulate gene set rankings
    ea.ranks <- sapply(methods, function(m) 
            sapply(data.ids, 
                function(d)
                {
                    r <- EnrichmentBrowser::makeExampleData("ea.res") 
                    r <- EnrichmentBrowser::gsRanking(r, signif.only=FALSE)
                    return(r)
                }, simplify=FALSE),
                simplify=FALSE)

    # simulate a mapping from datasets to disease codes
    d2d <- c("ALZ", "BRCA")
    names(d2d) <- data.ids

    # simulate relevance score rankings
    rel.ranks <- lapply(ea.ranks[[1]],
        function(rr)
        {
            rr[,2] <- runif(nrow(rr), min=1, max=100)
            colnames(rr)[2] <- "REL.SCORE"
            rownames(rr) <- rr[,"GENE.SET"]
            ind <- order(rr[,"REL.SCORE"], decreasing=TRUE)
            rr <- rr[ind,]
            return(rr)
        })
    names(rel.ranks) <- unname(d2d)

    # evaluate
    evalRelevance(ea.ranks, rel.ranks, d2d)

#
    # (1) simulated setup: 1 enrichment method applied to 1 dataset
    #  

    # simulate gene set ranking
    ea.ranks <- EnrichmentBrowser::makeExampleData("ea.res")
    ea.ranks <- EnrichmentBrowser::gsRanking(ea.ranks, signif.only=FALSE)

    # simulated relevance score ranking
    rel.ranks <- ea.ranks
    rel.ranks[,2] <- runif(nrow(ea.ranks), min=1, max=100)
    colnames(rel.ranks)[2] <- "REL.SCORE"
    rownames(rel.ranks) <- rel.ranks[,"GENE.SET"]
    ind <- order(rel.ranks[,"REL.SCORE"], decreasing=TRUE)
    rel.ranks <- rel.ranks[ind,]

    # evaluate
    evalRelevance(ea.ranks, rel.ranks)    
    compOpt(rel.ranks, ea.ranks[,"GENE.SET"])
    compRand(rel.ranks, ea.ranks[,"GENE.SET"], perm=3)

    # 
    # (2) simulated setup: 2 methods & 2 datasets
    #
       
    methods <- paste0("m", 1:2)
    data.ids <- paste0("d", 1:2)

    # simulate gene set rankings
    ea.ranks <- sapply(methods, function(m) 
            sapply(data.ids, 
                function(d)
                {
                    r <- EnrichmentBrowser::makeExampleData("ea.res") 
                    r <- EnrichmentBrowser::gsRanking(r, signif.only=FALSE)
                    return(r)
                }, simplify=FALSE),
                simplify=FALSE)

    # simulate a mapping from datasets to disease codes
    d2d <- c("ALZ", "BRCA")
    names(d2d) <- data.ids

    # simulate relevance score rankings
    rel.ranks <- lapply(ea.ranks[[1]],
        function(rr)
        {
            rr[,2] <- runif(nrow(rr), min=1, max=100)
            colnames(rr)[2] <- "REL.SCORE"
            rownames(rr) <- rr[,"GENE.SET"]
            ind <- order(rr[,"REL.SCORE"], decreasing=TRUE)
            rr <- rr[ind,]
            return(rr)
        })
    names(rel.ranks) <- unname(d2d)

    # evaluate
    evalRelevance(ea.ranks, rel.ranks, d2d)

Evaluation of the type I error rate of enrichment methods

Description

This function evaluates the type I error rate of selected methods for enrichment analysis when applied to one or more expression datasets.

Usage

evalTypeIError(
  methods,
  exp.list,
  gs,
  alpha = 0.05,
  ea.perm = 1000,
  tI.perm = 1000,
  perm.block.size = -1,
  summarize = TRUE,
  save2file = FALSE,
  out.dir = NULL,
  verbose = TRUE,
  ...
)
evalTypeIError(
  methods,
  exp.list,
  gs,
  alpha = 0.05,
  ea.perm = 1000,
  tI.perm = 1000,
  perm.block.size = -1,
  summarize = TRUE,
  save2file = FALSE,
  out.dir = NULL,
  verbose = TRUE,
  ...
)

Arguments

`methods`	Methods for enrichment analysis. This can be either a character vector with method names chosen from `sbeaMethods` and `nbeaMethods`, a user-defined function implementing a method for enrichment analysis, or a named list, containing pre-defined and/or user-defined enrichment methods. See examples.
`exp.list`	Experiment list. A `list` of datasets, each being of class `SummarizedExperiment`.
`gs`	Gene sets, i.e. a list of character vectors of gene IDs.
`alpha`	Numeric. Statistical significance level. Defaults to 0.05.
`ea.perm`	Integer. Number of permutations of the sample group assignments during enrichment analysis. Defaults to 1000. Can also be an integer vector matching the length of 'methods' to assign different numbers of permutations for different methods.
`tI.perm`	Integer. Number of permutations of the sample group assignments during type I error rate evaluation. Defaults to 1000. Can also be an integer vector matching the length of `methods` to assign different numbers of permutations for different methods.
`perm.block.size`	Integer. When running in parallel, splits `tI.perm` into blocks of the indicated size. Defaults to -1, which indicates to not partition `tI.perm`.
`summarize`	Logical. If `TRUE` (default) applies `summary` to the vector of type I error rates across `tI.perm` permutations of the sample labels. Use `FALSE` to return the full vector of type I error rates.
`save2file`	Logical. Should results be saved to file for subsequent benchmarking? Defaults to `FALSE`.
`out.dir`	Character. Determines the output directory where results are saved to. Defaults to `NULL`, which then writes to `tools::R_user_dir("GSEABenchmarkeR")` in case `save2file` is set to `TRUE`.
`verbose`	Logical. Should progress be reported? Defaults to `TRUE`.
`...`	Additional arguments passed to the selected enrichment methods.

Value

A list with an entry for each method applied. Each method entry is a list with an entry for each dataset analyzed. Each dataset entry is either a summary (summarize=TRUE) or the full vector of type I error rates (summarize=FALSE) across tI.perm permutations of the sample labels.

Author(s)

Ludwig Geistlinger <[email protected]>

Examples


    # loading three datasets from the GEO2KEGG compendium
    geo2kegg <- loadEData("geo2kegg", nr.datasets=3)

    # only considering the first 1000 probes for demonstration
    geo2kegg <- lapply(geo2kegg, function(d) d[1:1000,]) 

    # preprocessing and DE analysis for two of the datasets
    geo2kegg <- maPreproc(geo2kegg[2:3])
    geo2kegg <- runDE(geo2kegg)

    # getting a subset of human KEGG gene sets
    gs.file <- system.file("extdata", package="EnrichmentBrowser")
    gs.file <- file.path(gs.file, "hsa_kegg_gs.gmt") 
    kegg.gs <- EnrichmentBrowser::getGenesets(gs.file)

    # evaluating type I error rate of two methods on two datasets
    # NOTE: using a small number of permutations for demonstration;
    #       for a meaningful evaluation tI.perm should be >= 1000   
    res <- evalTypeIError(geo2kegg, methods=c("ora", 
             "camera"), gs=kegg.gs, ea.perm=0, tI.perm=3)

    # applying a user-defined enrichment method ...
    # ... or a mix of pre-defined and user-defined methods
    dummySBEA <- function(se, gs) 
    {
         sig.ps <- sample(seq(0, 0.05, length=1000), 5)
         nsig.ps <- sample(seq(0.1, 1, length=1000), length(gs)-5)
         ps <- sample(c(sig.ps, nsig.ps), length(gs))
         names(ps) <- names(gs)
         return(ps)
    }  

    methods <- list(camera = "camera", dummySBEA = dummySBEA)
    res <- evalTypeIError(methods, geo2kegg, gs=kegg.gs, tI.perm=3)   

# loading three datasets from the GEO2KEGG compendium
    geo2kegg <- loadEData("geo2kegg", nr.datasets=3)

    # only considering the first 1000 probes for demonstration
    geo2kegg <- lapply(geo2kegg, function(d) d[1:1000,]) 

    # preprocessing and DE analysis for two of the datasets
    geo2kegg <- maPreproc(geo2kegg[2:3])
    geo2kegg <- runDE(geo2kegg)

    # getting a subset of human KEGG gene sets
    gs.file <- system.file("extdata", package="EnrichmentBrowser")
    gs.file <- file.path(gs.file, "hsa_kegg_gs.gmt") 
    kegg.gs <- EnrichmentBrowser::getGenesets(gs.file)

    # evaluating type I error rate of two methods on two datasets
    # NOTE: using a small number of permutations for demonstration;
    #       for a meaningful evaluation tI.perm should be >= 1000   
    res <- evalTypeIError(geo2kegg, methods=c("ora", 
             "camera"), gs=kegg.gs, ea.perm=0, tI.perm=3)

    # applying a user-defined enrichment method ...
    # ... or a mix of pre-defined and user-defined methods
    dummySBEA <- function(se, gs) 
    {
         sig.ps <- sample(seq(0, 0.05, length=1000), 5)
         nsig.ps <- sample(seq(0.1, 1, length=1000), length(gs)-5)
         ps <- sample(c(sig.ps, nsig.ps), length(gs))
         names(ps) <- names(gs)
         return(ps)
    }  

    methods <- list(camera = "camera", dummySBEA = dummySBEA)
    res <- evalTypeIError(methods, geo2kegg, gs=kegg.gs, tI.perm=3)

Loading pre-defined and user-defined expression data

Description

This function implements a general interface for loading the pre-defined GEO2KEGG microarray compendium and the TCGA RNA-seq compendium. It also allows loading of user-defined data from file.

Usage

loadEData(edata, nr.datasets = NULL, cache = TRUE, ...)
loadEData(edata, nr.datasets = NULL, cache = TRUE, ...)

Arguments

`edata`	Expression data compendium. A character vector of length 1 that must be either 'geo2kegg': to load the GEO2KEGG microarray compendium, 'tcga': to load the TCGA RNA-seq compendium, or an absolute file path pointing to a directory, in which a user-defined compendium has been saved in RDS files. See details.
`nr.datasets`	Integer. Number of datasets that should be loaded from the compendium. This is mainly for demonstration purposes.
`cache`	Logical. Should an already cached version used if available? Defaults to `TRUE`.
`...`	Additional arguments passed to the internal loading routines of the GEO2KEGG and TCGA compendia. This currently includes for loading of the GEO2KEGG compendium `preproc`: logical. Should probe level data automatically be summarized to gene level data? Defaults to `FALSE`. `de.only`: logical. Include only datasets in which differentially expressed genes have been found? Defaults to `FALSE`. `excl.metac`: logical. Exclude datasets for which MetaCore rather than KEGG pathways have been assigned as target pathways? Defaults to `FALSE`. And for loading of the TCGA compendium `mode`: character, determines how TCGA RNA-seq datasets are obtained. To obtain raw read counts from GSE62944 use either `'ehub'` (default, via ExperimentHub) or `'geo'` (direct download from GEO, slow). Alternatively, use `'cTD'` to obtain normalized log2 TPM values from curatedTCGAData. `data.dir`: character. Absolute file path indicating where processed RDS files for each dataset are written to. Defaults to `NULL`, which will then write to `tools::R_user_dir("GSEABenchmarkeR")`. `min.ctrls`: integer. Minimum number of controls, i.e. adjacent normal samples, for a cancer type to be included. Defaults to 9. `paired`: Logical. Should the pairing of samples (tumor and adjacent normal) be taken into account? Defaults to `TRUE`, which reduces the data for each cancer type to patients for which both sample types (tumor and adjacent normal) are available. Use `FALSE` to obtain all samples in an unpaired manner. `min.cpm`: integer. Minimum counts-per-million reads mapped. See the edgeR vignette for details. The default filter is to exclude genes with cpm < 2 in more than half of the samples. `with.clin.vars`: logical. Should clinical variables (>500) be kept to allow for more advanced sample groupings in addition to the default binary grouping (tumor vs. normal)? `map2entrez`: Should human gene symbols be automatically mapped to Entrez Gene IDs? Defaults to `TRUE`.

Details

The pre-defined GEO2KEGG microarray compendium consists of 42 datasets investigating a total of 19 different human diseases as collected by Tarca et al. (2012 and 2013).

The pre-defined TCGA RNA-seq compendium consists of datasets from The Cancer Genome Atlas (TCGA, 2013) investigating a total of 34 different cancer types.

User-defined data can also be loaded, given that datasets, preferably of class SummarizedExperiment, have been saved as RDS files.

Value

A list of datasets, typically of class SummarizedExperiment.

Note that loadEData("geo2kegg", preproc = FALSE) (the default) returns the original microarray probe level data as a list of ExpressionSet objects. Use preproc = TRUE or the maPreproc function to summarize the probe level data to gene level data and to obtain a list of SummarizedExperiment objects.

Author(s)

Ludwig Geistlinger <[email protected]>

References

Tarca et al. (2012) Down-weighting overlapping genes improves gene set analysis. BMC Bioinformatics, 13:136.

Tarca et al. (2013) A comparison of gene set analysis methods in terms of sensitivity, prioritization and specificity. PLoS One, 8(11):e79217.

The Cancer Genome Atlas Research Network (2013) The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet, 45(10):1113-20.

Rahman et al. (2015) Alternative preprocessing of RNA-Sequencing data in The Cancer Genome Atlas leads to improved analysis results. Bioinformatics, 31(22):3666-72.

Examples


    # (1) Loading the GEO2KEGG microarray compendium
    geo2kegg <- loadEData("geo2kegg", nr.datasets=2)

    # (2) Loading the TCGA RNA-seq compendium
    tcga <- loadEData("tcga", nr.datasets=2)

    # (3) reading user-defined expression data from file
    data.dir <- system.file("extdata/myEData", package="GSEABenchmarkeR")
    edat <- loadEData(data.dir)

# (1) Loading the GEO2KEGG microarray compendium
    geo2kegg <- loadEData("geo2kegg", nr.datasets=2)

    # (2) Loading the TCGA RNA-seq compendium
    tcga <- loadEData("tcga", nr.datasets=2)

    # (3) reading user-defined expression data from file
    data.dir <- system.file("extdata/myEData", package="GSEABenchmarkeR")
    edat <- loadEData(data.dir)

Preprocessing of microarray expression data

Description

This function prepares datasets of the GEO2KEGG microarray compendium for further analysis. This includes summarization of probe level expression to gene level expression as well as annotation of required colData slots for sample grouping.

Usage

maPreproc(exp.list, parallel = NULL)
maPreproc(exp.list, parallel = NULL)

Arguments

`exp.list`	Experiment list. A `list` of datasets, each being of class `ExpressionSet`.
`parallel`	Parallel computation mode. An instance of class `BiocParallelParam`. See the vignette of the `BiocParallel` package for switching between serial, multi-core, and grid execution. Defaults to `NULL`, which then uses the first element of `BiocParallel::registered()` for execution. If not changed by the user, this accordingly defaults to multi-core execution on the local host.

Value

A list of datasets, each being of class SummarizedExperiment.

Author(s)

Ludwig Geistlinger <[email protected]>

Examples


    # reading user-defined expression data from file
    geo2kegg <- loadEData("geo2kegg", nr.datasets=3)

    # only considering the first 100 probes for demonstration
    geo2kegg <- lapply(geo2kegg, function(d) d[1:100,])   

    # preprocessing two datasets
    geo2kegg <- maPreproc(geo2kegg[2:3])

# reading user-defined expression data from file
    geo2kegg <- loadEData("geo2kegg", nr.datasets=3)

    # only considering the first 100 probes for demonstration
    geo2kegg <- lapply(geo2kegg, function(d) d[1:100,])   

    # preprocessing two datasets
    geo2kegg <- maPreproc(geo2kegg[2:3])

Read a mapping between dataset ID and disease code

Description

When assessing enrichment analysis results for phenotype relevance, it is assumed that each analyzed dataset investigates a certain phenotype such as a disease. This function reads a mapping between dataset IDs and assigned disease codes.

Usage

readDataId2diseaseCodeMap(map.file)
readDataId2diseaseCodeMap(map.file)

Arguments

map.file

Character. The path to the mapping file.

Value

A named character vector where each element of the vector is a disease code and the names are the dataset IDs.

Author(s)

Ludwig Geistlinger <[email protected]>

Examples


    data.dir <- system.file("extdata", package="GSEABenchmarkeR")
    d2d.file <- file.path(data.dir, "malacards", "GseId2Disease.txt")
    d2d.map <- readDataId2diseaseCodeMap(d2d.file)

data.dir <- system.file("extdata", package="GSEABenchmarkeR")
    d2d.file <- file.path(data.dir, "malacards", "GseId2Disease.txt")
    d2d.map <- readDataId2diseaseCodeMap(d2d.file)

Reading results of enrichment analysis

Description

These functions read results obtained from the application of enrichment methods to multiple datasets for subsequent assessment.

Usage

readResults(
  data.dir,
  data.ids,
  methods,
  type = c("runtime", "ranking", "typeI")
)
readResults(
  data.dir,
  data.ids,
  methods,
  type = c("runtime", "ranking", "typeI")
)

Arguments

`data.dir`	Character. The data directory where results have been saved to.
`data.ids`	A character vector of dataset IDs.
`methods`	Methods for enrichment analysis. A character vector with method names typically chosen from `sbeaMethods` and `nbeaMethods`, or user-defined functions implementing methods for enrichment analysis.
`type`	Character. Type of the result. Should be one out of 'runtime', 'ranking', or 'typeI'.

Value

A result list with an entry for each method applied. Each entry stores corresponding runtimes (type="runtime" ), gene set rankings (type="ranking"), or type I error rates (type="typeI") as obtained from applying the respective method to the given datasets.

Author(s)

Ludwig Geistlinger <[email protected]>

Examples


    # simulated setup: 
    # 1 methods & 1 datasets
    methods <- paste0("m", 1:2)
    data.ids <- paste0("d", 1:2)

    # result directory
    res.dir <- tempdir()
    sdirs <- file.path(res.dir, methods)
    for(d in sdirs) dir.create(d)
    
    # store runtime & rankings 
    for(m in 1:2)
    {
        rt <- runif(5, min=m, max=m+1)
        for(d in 1:2)
        {
            # runtime
            out.file <- paste(data.ids[d], "txt", sep=".")
            out.file <- file.path(sdirs[m], out.file)
            cat(rt[d], file=out.file) 

            # ranking
            out.file <- sub("txt$", "rds", out.file)
            r <- EnrichmentBrowser::makeExampleData("ea.res") 
            r <- EnrichmentBrowser::gsRanking(r, signif.only=FALSE)
            saveRDS(r, file=out.file)   
        }
    }

    # reading runtime & rankings
    rts <- readResults(res.dir, data.ids, methods, type="runtime")
    rkgs <- readResults(res.dir, data.ids, methods, type="ranking")

# simulated setup: 
    # 1 methods & 1 datasets
    methods <- paste0("m", 1:2)
    data.ids <- paste0("d", 1:2)

    # result directory
    res.dir <- tempdir()
    sdirs <- file.path(res.dir, methods)
    for(d in sdirs) dir.create(d)
    
    # store runtime & rankings 
    for(m in 1:2)
    {
        rt <- runif(5, min=m, max=m+1)
        for(d in 1:2)
        {
            # runtime
            out.file <- paste(data.ids[d], "txt", sep=".")
            out.file <- file.path(sdirs[m], out.file)
            cat(rt[d], file=out.file) 

            # ranking
            out.file <- sub("txt$", "rds", out.file)
            r <- EnrichmentBrowser::makeExampleData("ea.res") 
            r <- EnrichmentBrowser::gsRanking(r, signif.only=FALSE)
            saveRDS(r, file=out.file)   
        }
    }

    # reading runtime & rankings
    rts <- readResults(res.dir, data.ids, methods, type="runtime")
    rkgs <- readResults(res.dir, data.ids, methods, type="ranking")

Differential expression analysis for datasets of a compendium

Description

This function applies selected methods for differential expression (DE) analysis to selected datasets of an expression data compendium.

Usage

runDE(
  exp.list,
  de.method = c("limma", "edgeR", "DESeq2"),
  padj.method = "flexible",
  parallel = NULL,
  ...
)

metaFC(exp.list, max.na = round(length(exp.list)/3))

writeDE(exp.list, out.dir = NULL)

plotDEDistribution(exp.list, alpha = 0.05, beta = 1)

plotNrSamples(exp.list)
runDE(
  exp.list,
  de.method = c("limma", "edgeR", "DESeq2"),
  padj.method = "flexible",
  parallel = NULL,
  ...
)

metaFC(exp.list, max.na = round(length(exp.list)/3))

writeDE(exp.list, out.dir = NULL)

plotDEDistribution(exp.list, alpha = 0.05, beta = 1)

plotNrSamples(exp.list)

Arguments

`exp.list`	Experiment list. A `list` of datasets, each being of class `SummarizedExperiment`.
`de.method`	Differential expression method. See documentation of `deAna`.
`padj.method`	Method for adjusting p-values to multiple testing. For available methods see the man page of the stats function `p.adjust`. Defaults to 'flexible', which applies a dataset-specific correction strategy. See details.
`parallel`	Parallel computation mode. An instance of class `BiocParallelParam`. See the vignette of the `BiocParallel` package for switching between serial, multi-core, and grid execution. Defaults to `NULL`, which then uses the first element of `BiocParallel::registered()` for execution. If not changed by the user, this accordingly defaults to multi-core execution on the local host.
`...`	Additional arguments passed to `EnrichmentBrowser::deAna`.
`max.na`	Integer. Determines for which genes a meta fold change is computed. Per default, excludes genes for which the fold change is not annotated in >= 1/3 of the datasets in `exp.list`.
`out.dir`	Character. Determines the output directory where DE results for each dataset are written to. Defaults to `NULL`, which then writes to a subdir named 'de' in `tools::R_user_dir("GSEABenchmarkeR")`.
`alpha`	Statistical significance level. Defaults to 0.05.
`beta`	Absolute log2 fold change cut-off. Defaults to 1 (2-fold).

Details

DE studies typically report a gene as differentially expressed if the corresponding DE p-value, corrected for multiple testing, satisfies the chosen significance level. Enrichment methods that work directly on the list of DE genes are then substantially influenced by the multiple testing correction.

An example is the frequently used over-representation analysis (ORA), which assesses the overlap between the DE genes and a gene set under study based on the hypergeometric distribution (see Appendix A of the EnrichmentBrowser vignette for an introduction).

ORA is inapplicable if there are few genes satisfying the significance threshold, or if almost all genes are DE.

Using padj.method="flexible" accounts for these cases by applying multiple testing correction in dependence on the degree of differential expression:

the correction method from Benjamini and Hochberg (BH) is applied if it renders >= 1% and <= 25% of all measured genes as DE,
the p-values are left unadjusted, if the BH correction results in < 1% DE genes, and
the more stringent Bonferroni correction is applied, if the BH correction results in > 25% DE genes.

Note that resulting p-values should not be used for assessing the statistical significance of DE genes within or between datasets. They are solely used to determine which genes are included in the analysis with ORA - where the flexible correction ensures that the fraction of included genes is roughly in the same order of magnitude across datasets.

Alternative stratgies could also be applied - such as taking a constant number of genes for each dataset or excluding ORA methods in general from the assessment.

Value

runDE returns exp.list with DE measures annotated to the rowData slot of each dataset, writeDE writes to file, and plotDEDistribution plots to a graphics device.

Author(s)

Ludwig Geistlinger <[email protected]>

Examples


    # reading user-defined expression data from file
    data.dir <- system.file("extdata/myEData", package="GSEABenchmarkeR")
    edat <- loadEData(data.dir)

    # differential expression analysis
    edat <- runDE(edat)

    # visualization of per-dataset DE distribution
    plotDEDistribution(edat)

    # calculating meta fold changes across datasets 
    mfcs <- metaFC(edat, max.na=0) 

    # writing DE results to file
    out.dir <- tempdir()
    out.dir <- file.path(out.dir, "de")
    if(!file.exists(out.dir)) dir.create(out.dir)
 
    writeDE(edat, out.dir)    

# reading user-defined expression data from file
    data.dir <- system.file("extdata/myEData", package="GSEABenchmarkeR")
    edat <- loadEData(data.dir)

    # differential expression analysis
    edat <- runDE(edat)

    # visualization of per-dataset DE distribution
    plotDEDistribution(edat)

    # calculating meta fold changes across datasets 
    mfcs <- metaFC(edat, max.na=0) 

    # writing DE results to file
    out.dir <- tempdir()
    out.dir <- file.path(out.dir, "de")
    if(!file.exists(out.dir)) dir.create(out.dir)
 
    writeDE(edat, out.dir)

Application of enrichment methods to multiple datasets

Description

This function applies selected methods for enrichment analysis to selected datasets of a compendium.

Usage

runEA(
  exp.list,
  methods,
  gs,
  perm = 1000,
  parallel = NULL,
  save2file = FALSE,
  out.dir = NULL,
  ...
)
runEA(
  exp.list,
  methods,
  gs,
  perm = 1000,
  parallel = NULL,
  save2file = FALSE,
  out.dir = NULL,
  ...
)

Arguments

`exp.list`	Experiment list. A `list` of datasets, each being of class `SummarizedExperiment`. In case of just one dataset a single `SummarizedExperiment` is also allowed. See the documentation of `sbea` for required minimal annotations.
`methods`	Methods for enrichment analysis. This can be either a character vector with method names chosen from `sbeaMethods` and `nbeaMethods`, a user-defined function implementing a method for enrichment analysis, or a named list, containing pre-defined and/or user-defined enrichment methods. See examples.
`gs`	Gene sets, i.e. a list of character vectors of gene IDs.
`perm`	Number of permutations of the sample group assignments. Defaults to 1000. Can also be an integer vector matching the length of `methods` to assign different numbers of permutations for different methods.
`parallel`	Parallel computation mode. An instance of class `BiocParallelParam`. See the vignette of the `BiocParallel` package for switching between serial, multi-core, and grid execution. Defaults to `NULL`, which then uses the first element of `BiocParallel::registered()` for execution. If not changed by the user, this accordingly defaults to multi-core execution on the local host.
`save2file`	Logical. Should results be saved to file for subsequent benchmarking? Defaults to `FALSE`.
`out.dir`	Character. Determines the output directory where results are saved to. Defaults to `NULL`, which then writes to `tools::R_user_dir("GSEABenchmarkeR")` in case `save2file` is set to `TRUE`.
`...`	Additional arguments passed to the selected enrichment methods.

Value

A list with an entry for each method applied. Each method entry is a list with an entry for each dataset analyzed. Each dataset entry is a list of length 2, with the first element being the runtime and the second element being the gene set ranking, as obtained from applying the respective method to the respective dataset.

Author(s)

Ludwig Geistlinger <[email protected]>

Examples


    # loading three datasets from the GEO2KEGG compendium
    geo2kegg <- loadEData("geo2kegg", nr.datasets=3)

    # only considering the first 1000 probes for demonstration
    geo2kegg <- lapply(geo2kegg, function(d) d[1:1000,]) 

    # preprocessing and DE analysis for two of the datasets
    geo2kegg <- maPreproc(geo2kegg[2:3])
    geo2kegg <- runDE(geo2kegg)

    # getting a subset of human KEGG gene sets
    gs.file <- system.file("extdata/hsa_kegg_gs.gmt", package="EnrichmentBrowser")
    kegg.gs <- EnrichmentBrowser::getGenesets(gs.file)

    # applying two methods to two datasets 
    res <- runEA(geo2kegg, methods=c("ora", "camera"), gs=kegg.gs, perm=0)

    # applying a user-defined enrichment method
    dummySBEA <- function(se, gs) 
    {
         sig.ps <- sample(seq(0, 0.05, length=1000), 5)
         nsig.ps <- sample(seq(0.1, 1, length=1000), length(gs)-5)
         ps <- sample(c(sig.ps, nsig.ps), length(gs))
         names(ps) <- names(gs)
         return(ps)
    }  
    res <- runEA(geo2kegg, methods=dummySBEA, gs=kegg.gs)

    # applying a mix of pre-defined and user-defined methods
    methods <- list(camera = "camera", dummySBEA = dummySBEA)
    res <- runEA(geo2kegg, methods, gs=kegg.gs, perm=0)   


# loading three datasets from the GEO2KEGG compendium
    geo2kegg <- loadEData("geo2kegg", nr.datasets=3)

    # only considering the first 1000 probes for demonstration
    geo2kegg <- lapply(geo2kegg, function(d) d[1:1000,]) 

    # preprocessing and DE analysis for two of the datasets
    geo2kegg <- maPreproc(geo2kegg[2:3])
    geo2kegg <- runDE(geo2kegg)

    # getting a subset of human KEGG gene sets
    gs.file <- system.file("extdata/hsa_kegg_gs.gmt", package="EnrichmentBrowser")
    kegg.gs <- EnrichmentBrowser::getGenesets(gs.file)

    # applying two methods to two datasets 
    res <- runEA(geo2kegg, methods=c("ora", "camera"), gs=kegg.gs, perm=0)

    # applying a user-defined enrichment method
    dummySBEA <- function(se, gs) 
    {
         sig.ps <- sample(seq(0, 0.05, length=1000), 5)
         nsig.ps <- sample(seq(0.1, 1, length=1000), length(gs)-5)
         ps <- sample(c(sig.ps, nsig.ps), length(gs))
         names(ps) <- names(gs)
         return(ps)
    }  
    res <- runEA(geo2kegg, methods=dummySBEA, gs=kegg.gs)

    # applying a mix of pre-defined and user-defined methods
    methods <- list(camera = "camera", dummySBEA = dummySBEA)
    res <- runEA(geo2kegg, methods, gs=kegg.gs, perm=0)

Package 'GSEABenchmarkeR'

Help Index

Customized boxplot visualization of benchmark results

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Caching of a resource

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Evaluating gene set rankings for the number of (significant) sets

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Evaluation of enrichment methods on random gene sets

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Evaluating phenotype relevance of gene set rankings

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Evaluation of the type I error rate of enrichment methods

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Loading pre-defined and user-defined expression data

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Preprocessing of microarray expression data

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Read a mapping between dataset ID and disease code

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Reading results of enrichment analysis

Description

Usage