Title: | Construct Benchmarks for Single Cell Analysis Methods |
---|---|
Description: | This package contains infrastructure for benchmarking analysis methods and access to single cell mixture benchmarking data. It provides a framework for organising analysis methods and testing combinations of methods in a pipeline without explicitly laying out each combination. It also provides utilities for sampling and filtering SingleCellExperiment objects, constructing lists of functions with varying parameters, and multithreaded evaluation of analysis methods. |
Authors: | Shian Su [cre, aut], Saskia Freytag [aut], Luyi Tian [aut], Xueyi Dong [aut], Matthew Ritchie [aut], Peter Hickey [ctb], Stuart Lee [ctb] |
Maintainer: | Shian Su <[email protected]> |
License: | GPL-3 |
Version: | 1.23.0 |
Built: | 2024-11-18 03:11:16 UTC |
Source: | https://github.com/bioc/CellBench |
This package contains a framework for benchmarking combinations of methods in
a multi-stage pipeline. It is mainly based around the apply_methods
function, which takes lists of functions to be applied in stages of a
pipeline.
Shian Su <https://www.github.com/shians>
The core function in this package is apply_methods
,
see vignette("Introduction", package = "CellBench")
for basic usage.
Run cellbench_case_study()
to see a case study using CellBench. The
data loading functions from load_all_data
may also be of
interest.
Check the results column of a benchmark tibble for any task_error objects.
any_task_errors(x, verbose) ## S3 method for class 'benchmark_tbl' any_task_errors(x, verbose = FALSE)
any_task_errors(x, verbose) ## S3 method for class 'benchmark_tbl' any_task_errors(x, verbose = FALSE)
x |
the tibble to check |
verbose |
TRUE if the rows with errors should be reported |
TRUE if any entry in the result column is a task_error object
any_task_errors(benchmark_tbl)
:
apply_methods() and its aliases apply_metrics and begin_benchmark take either lists of datasets or benchmark_tbl objects and applies a list of functions. The output is a benchmark_tbl where each method has been applied to each dataset or preceeding result.
apply_methods(x, fn_list, name = NULL, suppress.messages = TRUE) ## S3 method for class 'list' apply_methods(x, fn_list, name = NULL, suppress.messages = TRUE) ## S3 method for class 'benchmark_tbl' apply_methods(x, fn_list, name = NULL, suppress.messages = TRUE) ## S3 method for class 'tbl_df' apply_methods(x, fn_list, name = NULL, suppress.messages = TRUE) apply_metrics(x, fn_list, name = NULL, suppress.messages = TRUE) begin_benchmark(x, fn_list, name = NULL, suppress.messages = TRUE)
apply_methods(x, fn_list, name = NULL, suppress.messages = TRUE) ## S3 method for class 'list' apply_methods(x, fn_list, name = NULL, suppress.messages = TRUE) ## S3 method for class 'benchmark_tbl' apply_methods(x, fn_list, name = NULL, suppress.messages = TRUE) ## S3 method for class 'tbl_df' apply_methods(x, fn_list, name = NULL, suppress.messages = TRUE) apply_metrics(x, fn_list, name = NULL, suppress.messages = TRUE) begin_benchmark(x, fn_list, name = NULL, suppress.messages = TRUE)
x |
the list of data or benchmark tibble to apply methods to |
fn_list |
the list of methods to be applied |
name |
(optional) the name of the column for methods applied |
suppress.messages |
TRUE if messages from running methods should be suppressed |
benchmark_tbl object containing results from methods applied, the first column is the name of the dataset as factors, middle columns contain method names as factors and the final column is a list of results of applying the methods.
# list of data datasets <- list( set1 = rnorm(500, mean = 2, sd = 1), set2 = rnorm(500, mean = 1, sd = 2) ) # list of functions add_noise <- list( none = identity, add_bias = function(x) { x + 1 } ) res <- apply_methods(datasets, add_noise)
# list of data datasets <- list( set1 = rnorm(500, mean = 2, sd = 1), set2 = rnorm(500, mean = 1, sd = 2) ) # list of functions add_noise <- list( none = identity, add_bias = function(x) { x + 1 } ) res <- apply_methods(datasets, add_noise)
Utility function for generating unicode arrow separators.
arrow_sep(towards = c("right", "left"), unicode = FALSE)
arrow_sep(towards = c("right", "left"), unicode = FALSE)
towards |
the direction the unicode arrow points towards |
unicode |
whether unicode arrows should be used. Does not work inside plots within knitted PDF documents. |
a string containing an unicode arrow surrounded by two spaces
arrow_sep("left") # left arrrow arrow_sep("right") # right arrrow
arrow_sep("left") # left arrrow arrow_sep("right") # right arrrow
convert a benchmark_tbl to a list where the name of the elements represent the pipeline steps separated by "..". This can be useful for using the apply family of functions.
as_pipeline_list(x)
as_pipeline_list(x)
x |
the benchmark_tbl object to convert |
list containing the results with names set to data and pipeline steps separated by ..
# list of data datasets <- list( set1 = rnorm(500, mean = 2, sd = 1), set2 = rnorm(500, mean = 1, sd = 2) ) # list of functions add_noise <- list( none = identity, add_bias = function(x) { x + 1 } ) res <- apply_methods(datasets, add_noise) as_pipeline_list(res)
# list of data datasets <- list( set1 = rnorm(500, mean = 2, sd = 1), set2 = rnorm(500, mean = 1, sd = 2) ) # list of functions add_noise <- list( none = identity, add_bias = function(x) { x + 1 } ) res <- apply_methods(datasets, add_noise) as_pipeline_list(res)
Take a function and return a cached version. The arguments and results of a cached method is saved to disk and if the cached function is called again with the same arguments then the results will be retrieved from the cache rather than be recomputed.
cache_method(f, cache = getOption("CellBench.cache"))
cache_method(f, cache = getOption("CellBench.cache"))
f |
the function to be cached |
cache |
the cache information (from memoise package) |
(CAUTION) Because cached functions called with the same argument will always return the same output, pseudo-random methods will not return varying results over repeated runs as one might expect.
This function is a thin wrapper around memoise
function whose results are cached and is called identically to f
# sets cache path to a temporary directory set_cellbench_cache_path(file.path(tempdir(), ".CellBenchCache")) f <- function(x) { x + 1 } cached_f <- cache_method(f)
# sets cache path to a temporary directory set_cellbench_cache_path(file.path(tempdir(), ".CellBenchCache")) f <- function(x) { x + 1 } cached_f <- cache_method(f)
Open vignetted containing a case study using CellBench
cellbench_case_study()
cellbench_case_study()
opens a vignette containing a case study
## Not run: cellbench_case_study() ## End(Not run)
## Not run: cellbench_case_study() ## End(Not run)
Search CellBench package for packaged data, leaving argument empty will list the available data.
cellbench_file(filename = NULL)
cellbench_file(filename = NULL)
filename |
the name of the file to look for |
string containing the path to the packaged data
cellbench_file() # shows available files cellbench_file("10x_sce_sample.rds") # returns path to 10x sample data
cellbench_file() # shows available files cellbench_file("10x_sce_sample.rds") # returns path to 10x sample data
Delete the datasets cached by the load_*_data set of functions
clear_cached_datasets()
clear_cached_datasets()
None
## Not run: clear_cached_datasets() ## End(Not run)
## Not run: clear_cached_datasets() ## End(Not run)
Clears the method cache for CellBench
clear_cellbench_cache()
clear_cellbench_cache()
None
## Not run: clear_cellbench_cache() ## End(Not run)
## Not run: clear_cellbench_cache() ## End(Not run)
Collapse benchmark_tbl into two columns: "pipeline" and "result". The "pipeline" column will be the concatenated values from the data and methods columns while the "result" column remains unchanged from the benchmark_tbl. This is useful for having a string summary of the pipeline for annotating.
collapse_pipeline( x, sep = arrow_sep("right"), drop.steps = TRUE, data.name = TRUE ) pipeline_collapse( x, sep = arrow_sep("right"), drop.steps = TRUE, data.name = TRUE )
collapse_pipeline( x, sep = arrow_sep("right"), drop.steps = TRUE, data.name = TRUE ) pipeline_collapse( x, sep = arrow_sep("right"), drop.steps = TRUE, data.name = TRUE )
x |
the benchmark_tbl to collapse |
sep |
the separator to use for concatenating the pipeline steps |
drop.steps |
if the data name and methods steps should be dropped from the output. TRUE by default. |
data.name |
if the dataset name should be included in the pipeline string. Useful if only a single dataset is used. |
benchmark_tbl with pipeline and result columns (and all other columns if drop.steps is FALSE)
# list of data datasets <- list( set1 = rnorm(500, mean = 2, sd = 1), set2 = rnorm(500, mean = 1, sd = 2) ) # list of functions add_noise <- list( none = identity, add_bias = function(x) { x + 1 } ) res <- apply_methods(datasets, add_noise) collapse_pipeline(res)
# list of data datasets <- list( set1 = rnorm(500, mean = 2, sd = 1), set2 = rnorm(500, mean = 1, sd = 2) ) # list of functions add_noise <- list( none = identity, add_bias = function(x) { x + 1 } ) res <- apply_methods(datasets, add_noise) collapse_pipeline(res)
Constructor for a list of data, a thin wrapper around list() which checks that all the inputs are of the same type and have names
data_list(...)
data_list(...)
... |
objects, must all be named |
a list of named data
data(iris) flist <- data_list( data1 = iris[1:20, ], data2 = iris[21:40, ] )
data(iris) flist <- data_list( data1 = iris[1:20, ], data2 = iris[21:40, ] )
Remove all genes (rows) where the total count is 0
filter_zero_genes(x)
filter_zero_genes(x)
x |
the SingleCellExperiment or matrix to filter |
object of same type as input with all zero count genes removed
x <- matrix(rep(0:5, times = 5), nrow = 6, ncol = 5) filter_zero_genes(x)
x <- matrix(rep(0:5, times = 5), nrow = 6, ncol = 5) filter_zero_genes(x)
Generate a list of functions where specific arguments have been pre-applied from a sequences of arguments, i.e. a function f(x, n) may have the 'n' argument pre-applied with specific values to obtain functions f1(x, n = 1) and f2(x, n = 2) stored in a list.
fn_arg_seq(func, ..., .strict = FALSE)
fn_arg_seq(func, ..., .strict = FALSE)
func |
function to generate list from |
... |
vectors of values to use as arguments |
.strict |
TRUE if argument names are checked, giving an error if specified argument does not appear in function signature. Note that functions with multiple methods generally have only f(x, ...) as their signature, so the check would fail even if the arguments are passed on. |
If multiple argument vectors are provided then the combinations of arguments in the sequences will be generated.
list of functions with the specified arguments pre-applied. Names of the list indicate the values that have been pre-applied.
f <- function(x) { cat("x:", x) } f_list <- fn_arg_seq(f, x = c(1, 2)) f_list f_list[[1]]() # x: 1 f_list[[2]]() # x: 2 g <- function(x, y) { cat("x:", x, "y:", y) } g_list <- fn_arg_seq(g, x = c(1, 2), y = c(3, 4)) g_list g_list[[1]]() # x: 1 y: 3 g_list[[2]]() # x: 1 y: 4 g_list[[3]]() # x: 2 y: 3 g_list[[4]]() # x: 2 y: 4
f <- function(x) { cat("x:", x) } f_list <- fn_arg_seq(f, x = c(1, 2)) f_list f_list[[1]]() # x: 1 f_list[[2]]() # x: 2 g <- function(x, y) { cat("x:", x, "y:", y) } g_list <- fn_arg_seq(g, x = c(1, 2), y = c(3, 4)) g_list g_list[[1]]() # x: 1 y: 3 g_list[[2]]() # x: 1 y: 4 g_list[[3]]() # x: 2 y: 3 g_list[[4]]() # x: 2 y: 4
Constructor for a list of functions, a thin wrapper around list() which checks that all the inputs are functions and have names
fn_list(...)
fn_list(...)
... |
objects, must all be named |
a list of named functions
flist <- fn_list( mean = mean, median = median )
flist <- fn_list( mean = mean, median = median )
This is a helper function for checking the result column of a benchmark_tbl for task_error objects. This is useful for filtering out rows where the result is a task error.
is.task_error(x)
is.task_error(x)
x |
the object to be tested |
vector of logicals denoting if elements of the list are task_error objects
Filter a SingleCellExperiment or matrix down to the cells (columns) with the highest counts
keep_high_count_cells(x, n)
keep_high_count_cells(x, n)
x |
the SingleCellExperiment or matrix |
n |
the number of highest count cells to keep |
object of same type as input containing the highest count cells
data(sample_sce_data) keep_high_count_cells(sample_sce_data, 10)
data(sample_sce_data) keep_high_count_cells(sample_sce_data, 10)
Filter a SingleCellExperiment or matrix down to the genes (rows) with the highest counts
keep_high_count_genes(x, n)
keep_high_count_genes(x, n)
x |
the SingleCellExperiment or matrix |
n |
the number of highest count genes to keep |
object of same type as input containing the highest count genes
data(sample_sce_data) keep_high_count_genes(sample_sce_data, 300)
data(sample_sce_data) keep_high_count_genes(sample_sce_data, 300)
Filter a SingleCellExperiment or matrix down to the most variable genes (rows), variability is determined by var() scaled by the total counts for the gene.
keep_high_var_genes(x, n)
keep_high_var_genes(x, n)
x |
the SingleCellExperiment or matrix |
n |
the number of most variable genes to keep |
object of same type as input containing the most variable genes
data(sample_sce_data) keep_high_var_genes(sample_sce_data, 50)
data(sample_sce_data) keep_high_var_genes(sample_sce_data, 50)
Load in all CellBench data described at <https://github.com/LuyiTian/CellBench_data/blob/master/README.md>.
load_sc_data() load_cell_mix_data() load_mrna_mix_data() load_all_data()
load_sc_data() load_cell_mix_data() load_mrna_mix_data() load_all_data()
list of SingleCellExperiment
load_sc_data()
: Load single cell data
load_cell_mix_data()
: Load cell mixture data
load_mrna_mix_data()
: Load mrna mixture data
## Not run: cellbench_file <- load_all_data() ## End(Not run)
## Not run: cellbench_file <- load_all_data() ## End(Not run)
head prints all columns which may flood the console, mhead takes a square block which can look nicer and still provide a good inspection of the contents
mhead(x, n = 6)
mhead(x, n = 6)
x |
the object with 2 dimensions |
n |
the size of the n-by-n block to extract |
an n-by-n sized subset of x
x <- matrix(runif(100), nrow = 10, ncol = 10) mhead(x) mhead(x, n = 3)
x <- matrix(runif(100), nrow = 10, ncol = 10) mhead(x) mhead(x, n = 3)
Sample n cells from a SingleCellExperiment object with no replacement.
sample_cells(x, n)
sample_cells(x, n)
x |
the SingleCellExperiment object |
n |
the number of cells to sample |
SingleCellExperiment object
sample_sce_data <- readRDS(cellbench_file("celseq_sce_sample.rds")) dim(sample_sce_data) x <- sample_cells(sample_sce_data, 10) dim(x)
sample_sce_data <- readRDS(cellbench_file("celseq_sce_sample.rds")) dim(sample_sce_data) x <- sample_cells(sample_sce_data, 10) dim(x)
Sample n genes from a SingleCellExperiment object with no replacement
sample_genes(x, n)
sample_genes(x, n)
x |
the SingleCellExperiment object |
n |
the number of genes to sample |
SingleCellExperiment object
sample_sce_data <- readRDS(cellbench_file("10x_sce_sample.rds")) dim(sample_sce_data) x <- sample_genes(sample_sce_data, 50) dim(x)
sample_sce_data <- readRDS(cellbench_file("10x_sce_sample.rds")) dim(sample_sce_data) x <- sample_genes(sample_sce_data, 50) dim(x)
A dataset containing 200 genes and 50 cells randomly sampled from the CelSeq mRNA mixture dataset, each sample is a mixture of mRNA material from 3 different human adenocarcinoma cell lines. Useful for quick prototyping of method wrappers.
data(sample_sce_data)
data(sample_sce_data)
A SingleCellExperiment object with sample annotations in
colData(sample_sce_data)
. The annotation contains various QC metrics
as well as the cell line mixture proportions
proportion of mRNA from H2228 cell line
proportion of mRNA from H1975 cell line
proportion of mRNA from HCC827 cell line
This is a more advanced interface for changing CellBench's parallelism settings. Internally CellBench uses BiocParallel for parallelism, consult the documentation of BiocParallel to see what settings are available.
set_cellbench_bpparam(param)
set_cellbench_bpparam(param)
param |
the BiocParallel parameter object |
None
set_cellbench_threads
for more basic interface
set_cellbench_threads(1) # CellBench runs on a single thread
set_cellbench_threads(1) # CellBench runs on a single thread
Set CellBench cache path
set_cellbench_cache_path(path = "./.CellBenchCache")
set_cellbench_cache_path(path = "./.CellBenchCache")
path |
the path to where method caches should be stored |
None
cache_method
for constructing cached methods.
## Not run: # hidden folder in local path set_cellbench_cache_path(".CellBenchCache")) ## End(Not run) # store in temp directory valid for this session set_cellbench_cache_path(file.path(tempdir(), ".CellBenchCache"))
## Not run: # hidden folder in local path set_cellbench_cache_path(".CellBenchCache")) ## End(Not run) # store in temp directory valid for this session set_cellbench_cache_path(file.path(tempdir(), ".CellBenchCache"))
Sets global parameter for CellBench to use multiple threads for applying methods. If any methods applied are multi-threaded then it's recommended to set CellBench threads to 1. It only recommended to use CellBench with multiple threads if methods applied can be set to run on single threads.
set_cellbench_threads(nthreads = 1)
set_cellbench_threads(nthreads = 1)
nthreads |
the number of threads used by CellBench |
None
set_cellbench_bpparam
for more advanced interface
set_cellbench_threads(1) # CellBench runs on a single thread
set_cellbench_threads(1) # CellBench runs on a single thread
Some methods perform multiple steps of a pipeline. This function assists with splitting the combined pipeline step into multiple steps with duplicated method names.
split_step(x, step, into)
split_step(x, step, into)
x |
a results data.frame from 'apply_methods()'. |
step |
the name of the column to split. |
into |
the name of the columns to split into. |
a results data.frame where the 'step' column has been split into the 'into' columns with duplicated values.
datasets <- list( set1 = rnorm(500, mean = 2, sd = 1), set2 = rnorm(500, mean = 1, sd = 2) ) # list of functions add_noise <- list( none = identity, add_bias = function(x) { x + 1 } ) res <- apply_methods(datasets, add_noise) res %>% split_step("add_noise", c("split1", "split2"))
datasets <- list( set1 = rnorm(500, mean = 2, sd = 1), set2 = rnorm(500, mean = 1, sd = 2) ) # list of functions add_noise <- list( none = identity, add_bias = function(x) { x + 1 } ) res <- apply_methods(datasets, add_noise) res %>% split_step("add_noise", c("split1", "split2"))
time_methods() take either lists of datasets or benchmark_timing_tbl objects and applies a list of functions. The output is a benchmark_timing_tbl where each method has been applied to each dataset or preceding result. Unlike apply_methods(), time_methods() is always single threaded as to produce fair and more consistent timings.
time_methods(x, fn_list, name = NULL, suppress.messages = TRUE) ## S3 method for class 'list' time_methods(x, fn_list, name = NULL, suppress.messages = TRUE) ## S3 method for class 'benchmark_timing_tbl' time_methods(x, fn_list, name = NULL, suppress.messages = TRUE)
time_methods(x, fn_list, name = NULL, suppress.messages = TRUE) ## S3 method for class 'list' time_methods(x, fn_list, name = NULL, suppress.messages = TRUE) ## S3 method for class 'benchmark_timing_tbl' time_methods(x, fn_list, name = NULL, suppress.messages = TRUE)
x |
the list of data or benchmark timing tibble to apply methods to |
fn_list |
the list of methods to be applied |
name |
(optional) the name of the column for methods applied |
suppress.messages |
TRUE if messages from running methods should be suppressed |
benchmark_timing_tbl object containing results from methods applied, the first column is the name of the dataset as factors, middle columns contain method names as factors and the final column is a list of lists containing the results of applying the methods and timings from calling system.time().
datasets <- list( set1 = 1:1e7 ) transform <- list( sqrt = sqrt, log = log ) time_methods(datasets, transform) %>% unpack_timing() # extract timings out of list
datasets <- list( set1 = 1:1e7 ) transform <- list( sqrt = sqrt, log = log ) time_methods(datasets, transform) %>% unpack_timing() # extract timings out of list