Title: | An R interface for python subsampling/sketching algorithms |
---|---|
Description: | Provides an R interface for various subsampling algorithms implemented in python packages. Currently, interfaces to the geosketch and scSampler python packages are implemented. In addition it also provides diagnostic plots to evaluate the subsampling. |
Authors: | Charlotte Soneson [aut, cre] , Michael Stadler [aut] , Friedrich Miescher Institute for Biomedical Research [cph] |
Maintainer: | Charlotte Soneson <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.3.0 |
Built: | 2024-10-31 05:33:59 UTC |
Source: | https://github.com/bioc/sketchR |
Plot the composition of a data set (e.g., the number of cells from each cell type) and contrast it with the corresponding composition of a subset.
compareCompositionPlot( df, idx, column, showPercentages = TRUE, fontSizePercentages = 4 )
compareCompositionPlot( df, idx, column, showPercentages = TRUE, fontSizePercentages = 4 )
df |
A |
idx |
A numeric vector representing the row indexes of |
column |
A character scalar corresponding to a column of
|
showPercentages |
Logical scalar, indicating whether relative frequencies of each category should be shown in the plot. |
fontSizePercentages |
Numerical scalar, indicating the font size
of the relative frequencies, if |
A ggplot
object.
Charlotte Soneson
df <- data.frame(celltype = sample(LETTERS[1:5], 1000, replace = TRUE, prob = c(0.1, 0.2, 0.5, 0.05, 0.15))) idx <- sample(seq_len(1000), 200) compareCompositionPlot(df, idx, "celltype")
df <- data.frame(celltype = sample(LETTERS[1:5], 1000, replace = TRUE, prob = c(0.1, 0.2, 0.5, 0.05, 0.15))) idx <- sample(seq_len(1000), 200) compareCompositionPlot(df, idx, "celltype")
Perform geometric sketching with the geosketch
python package.
geosketch( mat, N, replace = FALSE, k = "auto", alpha = 0.1, seed = NULL, max_iter = 200, one_indexed = TRUE, verbose = FALSE )
geosketch( mat, N, replace = FALSE, k = "auto", alpha = 0.1, seed = NULL, max_iter = 200, one_indexed = TRUE, verbose = FALSE )
mat |
m x n matrix. Samples (the dimension along which to subsample) should be in the rows, features in the columns. |
N |
Numeric scalar, the number of samples to retain. |
replace |
Logical scalar, whether to sample with replacement. |
k |
Numeric scalar or |
alpha |
Numeric scalar defining the acceptable interval around |
seed |
Numeric scalar or |
max_iter |
Numeric scalar giving the maximum iterations at which to terminate binary search in rare cases of non-monotonicity of covering boxes. |
one_indexed |
Logical scalar, whether to return one-indexed indices. |
verbose |
Locigal scalar, whether to print logging output while running. |
The first time this function is run, it will create a conda environment
containing the geosketch
package.
This is done via the basilisk
R/Bioconductor package - see the
documentation for that package for troubleshooting.
A numeric vector with indices to retain.
Charlotte Soneson, Michael Stadler
Hie et al (2019): Geometric sketching compactly summarizes the single-cell transcriptomic landscape. Cell Systems 8, 483–493.
x <- matrix(rnorm(500), nrow = 100) geosketch(mat = x, N = 10, seed = 42)
x <- matrix(rnorm(500), nrow = 100) geosketch(mat = x, N = 10, seed = 42)
Get names of geosketch functions
getGeosketchNames()
getGeosketchNames()
A list of names of objects exposed in the geosketch module
Charlotte Soneson
getGeosketchNames()
getGeosketchNames()
Get names of scSampler functions
getScSamplerNames()
getScSamplerNames()
A list of names of objects exposed in the scSampler module
Charlotte Soneson
getScSamplerNames()
getScSamplerNames()
Create diagnostic plot showing the Hausdorff distance between a sketch
and the full data set, for varying sketch sizes. For reproducibility,
seed the random number generator before calling this function using
set.seed
.
hausdorffDistPlot( mat, Nvec, Nrep = 5, q = 1e-04, methods = c("geosketch", "scsampler", "uniform"), extraArgs = list() )
hausdorffDistPlot( mat, Nvec, Nrep = 5, q = 1e-04, methods = c("geosketch", "scsampler", "uniform"), extraArgs = list() )
mat |
m x n matrix. Samples (the dimension along which to subsample) should be in the rows, features in the columns. |
Nvec |
Numeric vector of sketch sizes. |
Nrep |
Numeric scalar indicating the number of sketches to draw for each sketch size. |
q |
Numeric scalar in [0,1], indicating the fraction of largest minimum distances to discard when calculating the robust Hausdorff distance. Setting q=0 gives the classical Hausdorff distance. The default is 1e-4, as suggested by Hie et al (2019). |
methods |
Character vector, indicating which method(s) to include in the plot. Should be a subset of c("geosketch", "scsampler", "uniform"), where "uniform" randomly samples from input features with uniform probabilities. |
extraArgs |
Named list providing extra arguments to the respective
methods (beyond the matrix and the sketch size). The names of the list
should be the method names (currently, "geosketch" or "scsampler"),
and each list element should be a named list of argument values. See
the examples for an illustration of how to use this argument. Note that
the |
A ggplot
object.
Charlotte Soneson, Michael Stadler
Hie et al (2019): Geometric sketching compactly summarizes the single-cell transcriptomic landscape. Cell Systems 8, 483–493.
Song et al (2022): scSampler: fast diversity-preserving subsampling of large-scale single-cell transcriptomic data. bioRxiv doi:10.1101/2022.01.15.476407
Huttenlocher et al (1993): Comparing images using the Hausdorff distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(9), 850-863.
## Generate example data matrix mat <- matrix(rnorm(1000), nrow = 100) ## Generate diagnostic Hausdorff distance plot ## (including all available methods) hausdorffDistPlot(mat, Nvec = c(10, 25, 50)) ## Provide additional arguments for geosketch hausdorffDistPlot(mat, Nvec = c(10, 25, 50), Nrep = 2, extraArgs = list(geosketch = list(max_iter = 100)))
## Generate example data matrix mat <- matrix(rnorm(1000), nrow = 100) ## Generate diagnostic Hausdorff distance plot ## (including all available methods) hausdorffDistPlot(mat, Nvec = c(10, 25, 50)) ## Provide additional arguments for geosketch hausdorffDistPlot(mat, Nvec = c(10, 25, 50), Nrep = 2, extraArgs = list(geosketch = list(max_iter = 100)))
Perform subsampling with the scSampler
python package.
scsampler(mat, N, random_split = 1, seed = 0)
scsampler(mat, N, random_split = 1, seed = 0)
mat |
m x n matrix. Samples (the dimension along which to subsample) should be in the rows, features in the columns. |
N |
Numeric scalar, the number of samples to retain. |
random_split |
Numeric scalar, the number of parts to randomly split the data into before subsampling within each part. A larger value will speed up computations, but give less optimal results. |
seed |
Numeric scalar, passed to |
The first time this function is run, it will create a conda environment
containing the scSampler
package.
This is done via the basilisk
R/Bioconductor package - see the
documentation for that package for troubleshooting.
A numeric vector with indices to retain.
Charlotte Soneson, Michael Stadler
Song et al (2022): scSampler: fast diversity-preserving subsampling of large-scale single-cell transcriptomic data. bioRxiv doi:10.1101/2022.01.15.476407
x <- matrix(rnorm(500), nrow = 100) scsampler(mat = x, N = 10)
x <- matrix(rnorm(500), nrow = 100) scsampler(mat = x, N = 10)