| Title: | Bindings to C++ Libraries for Single-Cell Analysis |
|---|---|
| Description: | Implements R bindings to C++ code for analyzing single-cell (expression) data, mostly from various libscran libraries. Each function performs an individual step in the single-cell analysis workflow, ranging from quality control to clustering and marker detection. Additional wrappers are provided for easy construction of end-to-end workflows involving Bioconductor objects like SingleCellExperiments. |
| Authors: | Aaron Lun [cre, aut] |
| Maintainer: | Aaron Lun <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.7.3 |
| Built: | 2026-05-19 11:11:36 UTC |
| Source: | https://github.com/bioc/scrapper |
Implements R bindings to C++ code for analyzing single-cell (expression) data, mostly from various libscran libraries. Each function performs an individual step in the single-cell analysis workflow, ranging from quality control to clustering and marker detection. Additional wrappers are provided for easy construction of end-to-end workflows involving Bioconductor objects like SingleCellExperiments.
Maintainer: Aaron Lun [email protected]
Authors:
Aaron Lun [email protected]
Useful links:
Compute per-cell QC metrics from an initialized matrix of ADT counts, and use the metrics to suggest filter thresholds to retain high-quality cells.
computeAdtQcMetrics(x, subsets, num.threads = NULL) suggestAdtQcThresholds( metrics, block = NULL, min.detected.drop = NULL, num.mads = NULL, detected.num.mads = num.mads, subset.sum.num.mads = num.mads ) filterAdtQcMetrics(thresholds, metrics, block = NULL)computeAdtQcMetrics(x, subsets, num.threads = NULL) suggestAdtQcThresholds( metrics, block = NULL, min.detected.drop = NULL, num.mads = NULL, detected.num.mads = num.mads, subset.sum.num.mads = num.mads ) filterAdtQcMetrics(thresholds, metrics, block = NULL)
x |
A matrix-like object where rows are ADTs and columns are cells. Values are expected to be counts. |
subsets |
Named list of vectors specifying tag subsets of interest, typically control tags like IgGs.
Each vector may be logical (whether to keep each row), integer (row indices) or character (row names).
For character vectors, strings not present in |
num.threads |
Integer specifying the number of threads to use. If |
metrics |
DataFrame of per-cell QC metrics.
This should have the same structure as the return value of |
block |
Factor specifying the block of origin (e.g., batch, sample) for each cell in For |
min.detected.drop |
Minimum drop in the number of detected features from the median, in order to consider a cell to be of low quality. If |
num.mads |
Number of median from the median, to define the threshold for outliers in each metric. |
detected.num.mads |
Number of median from the median, to define the threshold for outliers in the number of detected tags. If |
subset.sum.num.mads |
Number of median from the median, to define the threshold for outliers in the subset sums. If |
thresholds |
List with the same structure as produced by |
For computeAdtQcMetrics, a DataFrame is returned with one row per cell in x.
This contains the following columns:
sum, a numeric vector containing the total ADT count for each cell.
In theory, this represents the efficiency of library preparation and sequencing.
Compared to RNA, the sum is less useful as a QC metric for ADT data as it is strongly influenced by biological variation in the abundance of the targeted features.
Nonetheless, we compute it for diagnostic purposes.
detected, an integer vector containing the number of detected tags per cell.
Even though ADTs are typically used in situations where few features are highly abundant (e.g., cell type-specific markers),
we still expect detectable coverage of most features due to ambient contamination, non-specific binding or some background expression.
Low numbers of detected tags indicates that library preparation or sequencing depth was suboptimal.
subsets, a nested DataFrame where each column corresponds to a control subset and is a numeric vector containing the total count in that subset.
The exact interpretation depends on the nature of the feature subset but the most common use case involves isotype control (IgG) features.
IgG antibodies should not bind to anything so a high subset sum suggests that non-specific binding is a problem, e.g., due to antibody conjugates.
(Unlike RNA quality control, we do not use proportions here as it is entirely possible for a cell to have low counts for other tags due to the absence of their targeted features;
this would result in a high proportion even if the cell has a "normal" level of non-specific binding.)
Each vector is of length equal to the number of cells.
For suggestAdtQcThresholds, a named list is returned:
If block=NULL, the list contains:
detected, a number containing the lower bound on the number of detected tags.
This is defined as the lower of (i) num.mads MADs below the median for the log-transformed values across all cells,
and (ii) the product of 1 - min.detected.drop and the median across all cells.
The latter avoids overly aggressive filtering when the MAD is zero.
subsets, a numeric vector containing the upper bound on the sum of counts in each control subset.
This is defined as num.mads MADs above the median of the log-transformed metrics across all cells.
Otherwise, if block is supplied, the list contains:
detected, a numeric vector containing the lower bound on the number of detected tags for each blocking level.
Here, the threshold is computed independently for each block, using the same method as the unblocked case.
subsets, a list of numeric vectors containing the upper bound on the sum of counts in each control subset for each blocking level.
Here, the threshold is computed independently for each block, using the same method as the unblocked case.
block.ids, a vector containing the identities of the unique blocks.
Each vector is of length equal to the number of levels in block and is named accordingly.
For filterAdtQcMetrics, a logical vector of length ncol(x) is returned indicating which cells are of high quality.
High-quality cells are defined as those with numbers of detected tags above the detected threshold and control subset sums below the subsets threshold.
Aaron Lun
The compute_adt_qc_metrics, compute_adt_qc_filters and compute_adt_qc_filters_blocked functions in https://libscran.github.io/scran_qc/.
quickAdtQc.se, to run all of the ADT-related QC functions on a SummarizedExperiment.
# Mocking a matrix: library(Matrix) x <- round(abs(rsparsematrix(1000, 100, 0.1) * 100)) # Mocking up a control set. sub <- list(IgG=rbinom(nrow(x), 1, 0.1) > 0) qc <- computeAdtQcMetrics(x, sub) qc filt <- suggestAdtQcThresholds(qc) str(filt) keep <- filterAdtQcMetrics(filt, qc) summary(keep)# Mocking a matrix: library(Matrix) x <- round(abs(rsparsematrix(1000, 100, 0.1) * 100)) # Mocking up a control set. sub <- list(IgG=rbinom(nrow(x), 1, 0.1) > 0) qc <- computeAdtQcMetrics(x, sub) qc filt <- suggestAdtQcThresholds(qc) str(filt) keep <- filterAdtQcMetrics(filt, qc) summary(keep)
Default parameters from the underlying C++ library. These may be overridden by defaults in each function's signature.
computeAdtQcMetricsDefaults() suggestAdtQcThresholdsDefaults()computeAdtQcMetricsDefaults() suggestAdtQcThresholdsDefaults()
Named list containing default values for the various function arguments.
Aaron Lun
computeAdtQcMetricsDefaults() suggestAdtQcThresholdsDefaults()computeAdtQcMetricsDefaults() suggestAdtQcThresholdsDefaults()
Aggregate expression values across cells based on one or more grouping factors. This is usually applied to a count matrix to create pseudo-bulk profiles for each cluster/sample combination. These profiles can then be used as if they were counts from bulk data, e.g., for differential analyses with edgeR.
aggregateAcrossCells( x, factors, compute.sum = NULL, compute.detected = NULL, compute.median = NULL, num.threads = NULL )aggregateAcrossCells( x, factors, compute.sum = NULL, compute.detected = NULL, compute.median = NULL, num.threads = NULL )
x |
A matrix-like object where rows correspond to genes or genomic features and columns correspond to cells. Values are typically expected to be counts. |
factors |
A list or data frame (or their equivalents from S4Vectors) containing one or more grouping factors, see |
compute.sum |
Boolean indicating whether to compute the sum in each group. If |
compute.detected |
Boolean indicating whether to compute the number of cells with detected expression in each group. If |
compute.median |
Boolean indicating whether to compute the median in each group. If |
num.threads |
Integer specifying the number of threads to be used for aggregation. If |
List containing:
sums, a numeric matrix where each row corresponds to a gene and each column corresponds to a unique combination of levels from factors.
Each entry contains the summed expression across all cells with that combination.
Omitted if compute.sum=FALSE.
detected, an integer matrix where each row corresponds to a gene and each column corresponds to a unique combination of levels from factors.
Each entry contains the number of cells with detected (i.e., positive) expression in that combination.
Omitted if compute.detected=FALSE.
medians, a numeric matrix where each row corresponds to a gene and each column corresponds to a unique combination of levels from factors.
Each entry contains the median expression across all cells with that combination.
Omitted if compute.median=FALSE.
combinations, a DataFrame containing the unique combination of levels from factors.
Rows correspond to columns of sums and detected, while columns correspond to the factors in factors.
counts, the number of cells associated with each combination.
Each entry corresponds to a row of combinations.
index, an integer vector of length equal to the number of cells in x.
This specifies the combination in combinations to which each cell was assigned.
Aaron Lun
The aggregate_across_cells function in https://libscran.github.io/scran_aggregate/.
aggregateAcrossCells.se, to perform aggregation on a SummarizedExperiment.
aggregateAcrossGenes, to aggregate expression values across gene sets.
# Mocking a matrix: library(Matrix) x <- round(abs(rsparsematrix(1000, 100, 0.1) * 100)) # Simple aggregation: clusters <- sample(LETTERS, 100, replace=TRUE) agg <- aggregateAcrossCells(x, list(cluster=clusters)) str(agg) # Multi-factor aggregation samples <- sample(1:5, 100, replace=TRUE) agg2 <- aggregateAcrossCells(x, list(cluster=clusters, sample=samples)) str(agg2)# Mocking a matrix: library(Matrix) x <- round(abs(rsparsematrix(1000, 100, 0.1) * 100)) # Simple aggregation: clusters <- sample(LETTERS, 100, replace=TRUE) agg <- aggregateAcrossCells(x, list(cluster=clusters)) str(agg) # Multi-factor aggregation samples <- sample(1:5, 100, replace=TRUE) agg2 <- aggregateAcrossCells(x, list(cluster=clusters, sample=samples)) str(agg2)
Aggregate expression values across groups of cells for each gene,
by calling aggregateAcrossCells on an assay in a SummarizedExperiment.
aggregateAcrossCells.se( x, factors, num.threads = 1, more.aggr.args = list(), assay.type = "counts", output.prefix = "factor.", counts.name = "counts", meta.name = "aggregated", include.coldata = TRUE, more.coldata.args = list(), altexps = NULL, copy.altexps = FALSE ) aggregateColData(coldata, index, number, only.atomic = TRUE, placeholder = NA)aggregateAcrossCells.se( x, factors, num.threads = 1, more.aggr.args = list(), assay.type = "counts", output.prefix = "factor.", counts.name = "counts", meta.name = "aggregated", include.coldata = TRUE, more.coldata.args = list(), altexps = NULL, copy.altexps = FALSE ) aggregateColData(coldata, index, number, only.atomic = TRUE, placeholder = NA)
x |
A SummarizedExperiment object or one of its subclasses. Rows correspond to genes and columns correspond to cells. |
factors |
List or data frame (or their equivalents from S4Vectors) containing grouping factors, see Alternatively, an atomic vector or factor representing a single variable. |
num.threads |
Number of threads, passed to |
more.aggr.args |
Named list of additional arguments to pass to |
assay.type |
Integer or string specifying the assay of |
output.prefix |
String specifying a prefix to add to the names of the |
counts.name |
String specifying the name of the |
meta.name |
String specifying the name of the |
include.coldata |
Logical scalar indicating whether to add the aggregated |
more.coldata.args |
Named list of additional arguments to pass to |
altexps |
Unnamed integer or character vector specifying the indices/names of alternative experiments of Alternatively, this may be a named integer or character vector. Each name specifies an alternative experiment while each value is the index/name of the assay to aggregate from that experiment. Only relevant if |
copy.altexps |
Logical scalar indicating whether to copy the |
coldata |
DataFrame of column data, containing one row for each cell. |
index |
Integer vector containing the index of the factor combination to which each cell in |
number |
Integer specifying the total number of unique factor combinations.
All elements of |
only.atomic |
Logical scalar specifying whether to skip non-atomic, non-factor columns. |
placeholder |
Placeholder value to store in the output column when a factor combination does not have a single unique value. |
For aggregateAcrossCells.se, a SummarizedExperiment is returned where each column corresponds to a factor combination.
Each row corresponds to a gene in x and the rowData is taken from x.
The assays contain the sum of counts ("sums") and the number of detected cells ("detected") in each combination for each gene.
The colData contains:
The factor combinations, with column names prefixed by output.prefix.
The cell count for each combination, named by counts.name.
Additional colData from x if include.coldata=TRUE.
This is aggregated with aggregateColData on the combination indices.
The metadata contains a list named as meta.name, containing a index integer vector of length equal to the number of cells in x.
Each entry of this vector is an index of the factor combination (i.e., column of the output object) to which the corresponding cell was assigned.
If altexps is specified, a SingleCellExperiment is returned instead.
The same aggregation for the main experiment is applied to each alternative experiment.
If copy.altexps=TRUE, the colData for each alternative experiment will contain a copy of the factor combinations and cell counts,
and the metadata will contain a copy of the index vector.
For aggregateColData, a DataFrame is returned with number of rows equal to number.
Each atomic or factor column in coldata is represented by a column in the output DataFrame.
In each column, the j-th entry is equal to the unique value of all rows where index == j,
or placeholder if there is not exactly one unique value.
If only.atomic=FALSE, any non-atomic/non-factor columns of coldata are represented in the output DataFrame by a vector of placeholder values.
If only.atomic=TRUE, any non-atomic/non-factor columns of coldata are skipped.
Aaron Lun
library(SingleCellExperiment) sce <- getTestRnaData.se("start") aggr <- aggregateAcrossCells.se(sce, sce$level1class) head(assay(aggr)) colData(aggr) # We can also aggregate within alternative experiments. aggr2 <- aggregateAcrossCells.se(sce, sce$level1class, altexps="ERCC") head(assay(altExp(aggr2, "ERCC")))library(SingleCellExperiment) sce <- getTestRnaData.se("start") aggr <- aggregateAcrossCells.se(sce, sce$level1class) head(assay(aggr)) colData(aggr) # We can also aggregate within alternative experiments. aggr2 <- aggregateAcrossCells.se(sce, sce$level1class, altexps="ERCC") head(assay(altExp(aggr2, "ERCC")))
aggregateAcrossCells
Default parameters from the underlying C++ library.
These may be overridden by defaults in the aggregateAcrossCells function signature.
aggregateAcrossCellsDefaults()aggregateAcrossCellsDefaults()
Named list containing default values for various function arguments.
Aaron Lun
aggregateAcrossCellsDefaults()aggregateAcrossCellsDefaults()
Aggregate expression values across genes, potentially with weights. This is typically used to summarize expression values for gene sets into a single per-cell score.
aggregateAcrossGenes( x, sets, average = NULL, convert = TRUE, num.threads = NULL )aggregateAcrossGenes( x, sets, average = NULL, convert = TRUE, num.threads = NULL )
x |
A matrix-like object where rows correspond to genes or genomic features and columns correspond to cells. Values are usually normalized expression values, possibly log-transformed depending on the application. |
sets |
List of vectors where each entry corresponds to a gene set.
Each entry may be an integer vector of row indices, a logical vector of length equal to the number of rows, or a character vector of row names.
For integer and character vectors, duplicate elements are ignored.
For character vectors, any strings not present in Alternatively, each entry may be a list of two vectors.
The first vector should be either integer (row indices) or character (row names), specifying the genes in the set.
The second vector should be numeric and of the same length as the first vector, specifying the weight associated with each gene.
If duplicate genes are present, only the first occurrence is used.
If the first vector contains gene names not present in |
average |
Boolean indicating whether to compute the average rather than the sum. If |
convert |
Boolean indicating whether to convert gene identities to non-duplicate row indices in each entry of |
num.threads |
Integer specifying the number of threads to be used for aggregation. If |
List of length equal to that of sets.
Each entry is a numeric vector of length equal to the number of columns in x,
containing the (weighted) sum/mean of expression values for the corresponding set across all cells.
Aaron Lun
The aggregate_across_genes function in https://libscran.github.io/scran_aggregate/.
aggregateAcrossGenes.se, to perform aggregation on a SummarizedExperiment.
aggregateAcrossCells, to aggregate expression values across groups of cells.
# Mocking a matrix: library(Matrix) x <- round(abs(rsparsematrix(1000, 100, 0.1) * 100)) # Unweighted aggregation: sets <- list( foo = sample(nrow(x), 20), bar = sample(nrow(x), 10) ) agg <- aggregateAcrossGenes(x, sets) str(agg) # Weighted aggregation: sets <- list( foo = list(sample(nrow(x), 20), runif(20)), bar = list(sample(nrow(x), 10), runif(10)) ) agg2 <- aggregateAcrossGenes(x, sets, average = TRUE) str(agg2)# Mocking a matrix: library(Matrix) x <- round(abs(rsparsematrix(1000, 100, 0.1) * 100)) # Unweighted aggregation: sets <- list( foo = sample(nrow(x), 20), bar = sample(nrow(x), 10) ) agg <- aggregateAcrossGenes(x, sets) str(agg) # Weighted aggregation: sets <- list( foo = list(sample(nrow(x), 20), runif(20)), bar = list(sample(nrow(x), 10), runif(10)) ) agg2 <- aggregateAcrossGenes(x, sets, average = TRUE) str(agg2)
Aggregate expression values across sets of genes for each cell,
by calling aggregateAcrossGenes on an assay in a SummarizedExperiment.
aggregateAcrossGenes.se( x, sets, num.threads = 1, more.aggr.args = list(), assay.type = "logcounts", output.name = NULL )aggregateAcrossGenes.se( x, sets, num.threads = 1, more.aggr.args = list(), assay.type = "logcounts", output.name = NULL )
x |
A SummarizedExperiment object or one of its subclasses. Rows correspond to genes and columns correspond to cells. |
sets |
List of gene sets, see Alternatively, |
num.threads |
Number of threads, passed to |
more.aggr.args |
Named list of additional arguments to pass to |
assay.type |
Integer or string specifying the assay of |
output.name |
String specifying the assay name in the output object.
Defaults to |
A SummarizedExperiment with number of rows equal to the number of gene sets.
The lone assay contains the aggregated values for each gene set for all cells.
The colData is the same as that of x.
Aaron Lun
library(SingleCellExperiment) sce <- getTestRnaData.se("norm") library(org.Mm.eg.db) set.df <- select( org.Mm.eg.db, keytype="GO", keys=c( "GO:0048709", # oligodendrocyte differentiation "GO:0048699", # neuron development "GO:0048143" # astrocyte activation ), columns="SYMBOL" ) sets <- splitAsList(set.df$SYMBOL, set.df$GO) aggregated <- aggregateAcrossGenes.se(sce, sets) aggregated assay(aggregated)[,1:10]library(SingleCellExperiment) sce <- getTestRnaData.se("norm") library(org.Mm.eg.db) set.df <- select( org.Mm.eg.db, keytype="GO", keys=c( "GO:0048709", # oligodendrocyte differentiation "GO:0048699", # neuron development "GO:0048143" # astrocyte activation ), columns="SYMBOL" ) sets <- splitAsList(set.df$SYMBOL, set.df$GO) aggregated <- aggregateAcrossGenes.se(sce, sets) aggregated assay(aggregated)[,1:10]
aggregateAcrossGenes
Default parameters from the underlying C++ library.
These may be overridden by defaults in the aggregateAcrossGenes function signature.
aggregateAcrossGenesDefaults()aggregateAcrossGenesDefaults()
Named list containing default values for various function arguments.
Aaron Lun
aggregateAcrossGenesDefaults()aggregateAcrossGenesDefaults()
See Details for each function's recommended replacements.
analyze(...) convertAnalyzeResults(...)analyze(...) convertAnalyzeResults(...)
... |
Arguments, ignored. |
analyze is replaced by analyze.se.
convertAnalyzeResults is no longer necessary as conversion is done within analyze.se.
Execute a simple single-cell analysis pipeline, starting from a count matrix and ending with clusters, visualizations and markers. This also supports integration of multiple modalities and correction of batch effects.
analyze.se( x, rna.altexp = NA, adt.altexp = NULL, crispr.altexp = NULL, rna.assay.type = "counts", adt.assay.type = "counts", crispr.assay.type = "counts", block = NULL, block.name = "block", rna.qc.subsets = list(), rna.qc.output.prefix = NULL, more.rna.qc.args = list(), adt.qc.subsets = list(), adt.qc.output.prefix = NULL, more.adt.qc.args = list(), crispr.qc.output.prefix = NULL, more.crispr.qc.args = list(), filter.cells = TRUE, rna.norm.output.name = "logcounts", more.rna.norm.args = list(), adt.norm.output.name = "logcounts", more.adt.norm.args = list(), crispr.norm.output.name = "logcounts", more.crispr.norm.args = list(), rna.hvg.output.prefix = NULL, more.rna.hvg.args = list(), rna.pca.output.name = "PCA", more.rna.pca.args = list(), adt.pca.output.name = "PCA", more.adt.pca.args = list(), use.rna.pcs = TRUE, use.adt.pcs = TRUE, scale.output.name = "combined", more.scale.args = list(), mnn.output.name = "MNN", more.mnn.args = list(), more.umap.args = list(), more.tsne.args = list(), cluster.graph.output.name = "graph.cluster", more.build.graph.args = list(), more.cluster.graph.args = list(), more.neighbor.args = list(), kmeans.clusters = NULL, kmeans.clusters.output.name = "kmeans.cluster", more.kmeans.args = list(), clusters.for.markers = c("graph", "kmeans"), more.rna.marker.args = list(), more.adt.marker.args = list(), more.crispr.marker.args = list(), BNPARAM = AnnoyParam(), num.threads = 3L )analyze.se( x, rna.altexp = NA, adt.altexp = NULL, crispr.altexp = NULL, rna.assay.type = "counts", adt.assay.type = "counts", crispr.assay.type = "counts", block = NULL, block.name = "block", rna.qc.subsets = list(), rna.qc.output.prefix = NULL, more.rna.qc.args = list(), adt.qc.subsets = list(), adt.qc.output.prefix = NULL, more.adt.qc.args = list(), crispr.qc.output.prefix = NULL, more.crispr.qc.args = list(), filter.cells = TRUE, rna.norm.output.name = "logcounts", more.rna.norm.args = list(), adt.norm.output.name = "logcounts", more.adt.norm.args = list(), crispr.norm.output.name = "logcounts", more.crispr.norm.args = list(), rna.hvg.output.prefix = NULL, more.rna.hvg.args = list(), rna.pca.output.name = "PCA", more.rna.pca.args = list(), adt.pca.output.name = "PCA", more.adt.pca.args = list(), use.rna.pcs = TRUE, use.adt.pcs = TRUE, scale.output.name = "combined", more.scale.args = list(), mnn.output.name = "MNN", more.mnn.args = list(), more.umap.args = list(), more.tsne.args = list(), cluster.graph.output.name = "graph.cluster", more.build.graph.args = list(), more.cluster.graph.args = list(), more.neighbor.args = list(), kmeans.clusters = NULL, kmeans.clusters.output.name = "kmeans.cluster", more.kmeans.args = list(), clusters.for.markers = c("graph", "kmeans"), more.rna.marker.args = list(), more.adt.marker.args = list(), more.crispr.marker.args = list(), BNPARAM = AnnoyParam(), num.threads = 3L )
x |
A SummarizedExperiment object or one of its subclasses. Rows correspond to genomic features (genes, ADTs or CRISPR guides) and columns correspond to cells. |
rna.altexp |
String or integer specifying the alternative experiment of |
adt.altexp |
String or integer specifying the alternative experiment of |
crispr.altexp |
String or integer specifying the alternative experiment of |
rna.assay.type |
String or integer specifying the assay containing the RNA count data.
Only used if |
adt.assay.type |
String or integer specifying the assay containing the ADT count data.
Only used if |
crispr.assay.type |
String or integer specifying the assay containing the CRISPR count data.
Only used if |
block |
Vector or factor specifying the block of origin (e.g., batch, sample) for each cell in |
block.name |
String specifying the name of the |
rna.qc.subsets |
Passed to |
rna.qc.output.prefix |
Passed to |
more.rna.qc.args |
Named list of additional arguments to pass to |
adt.qc.subsets |
Passed to |
adt.qc.output.prefix |
Passed to |
more.adt.qc.args |
Named list of additional arguments to pass to |
crispr.qc.output.prefix |
Passed to |
more.crispr.qc.args |
Named list of additional arguments to pass to |
filter.cells |
Logical scalar indicating whether to filter |
rna.norm.output.name |
Passed to |
more.rna.norm.args |
Named list of arguments to pass to |
adt.norm.output.name |
Passed to |
more.adt.norm.args |
Named list of arguments to pass to |
crispr.norm.output.name |
Passed to |
more.crispr.norm.args |
Named list of arguments to pass to |
rna.hvg.output.prefix |
Passed to |
more.rna.hvg.args |
Named list of arguments to pass to |
rna.pca.output.name |
Passed to |
more.rna.pca.args |
Named list of arguments to pass to |
adt.pca.output.name |
Passed to |
more.adt.pca.args |
Named list of arguments to pass to |
use.rna.pcs |
Logical scalar indicating whether to use the RNA-derived PCs for downstream steps (i.e., clustering, visualization).
Only used if |
use.adt.pcs |
Logical scalar indicating whether to use the ADT-derived PCs for downstream steps (i.e., clustering, visualization).
Only used if |
scale.output.name |
Passed to |
more.scale.args |
Named list of arguments to pass to |
mnn.output.name |
Passed to |
more.mnn.args |
Named list of arguments to pass to |
more.umap.args |
Passed to |
more.tsne.args |
Passed to |
cluster.graph.output.name |
Passed to |
more.build.graph.args |
Passed to |
more.cluster.graph.args |
Passed to |
more.neighbor.args |
Passed to |
kmeans.clusters |
Passed to |
kmeans.clusters.output.name |
Passed to |
more.kmeans.args |
Named list of arguments to pass to |
clusters.for.markers |
Character vector of clustering algorithms (either |
more.rna.marker.args |
Named list of arguments to pass to |
more.adt.marker.args |
Named list of arguments to pass to |
more.crispr.marker.args |
Named list of arguments to pass to |
BNPARAM |
A BiocNeighborParam instance specifying the nearest-neighbor search algorithm to use. |
num.threads |
Integer scalar specifying the number of threads to use in each step. |
This function is equivalent to:
Running quickRnaQc.se, quickAdtQc.se and/or quickCrisprQc.se, for quality control.
Subsetting x to only retain the high-quality cells in all modalities, based on filter.cells.
Running normalizeRnaCounts.se, normalizeAdtCounts.se and/or normalizeCrisprCounts.se, for normalization.
Running chooseRnaHvgs.se to identify highly variable genes.
Running runPca.se on the RNA and/or ADT data.
Running scaleByNeighbors.se if multiple modalities are present.
Running correctMnn.se if multiple batches are present in block.
Running runAllNeighborSteps.se to obtain t-SNE and UMAP coordinates, and to perform graph-based clustering.
Running clusterKmeans.se to perform k-means clustering, depending on kmeans.clusters.
Running scoreMarkers.se to compute markers for each modality based on one of the clusterings.
List containing:
x, a SingleCellExperiment that is a copy of the input x.
It is also decorated with the results of each analysis step - see Details.
markers, a list of list of DataFrames containing the marker statistics for each modality.
Each inner list corresponds to a modality (RNA, ADT, etc.) while each DataFrame corresponds to a cluster.
If no clusterings were generated, this is set to NULL.
Aaron Lun
library(SingleCellExperiment) sce <- getTestRnaData.se("start") res <- analyze.se( sce, rna.qc.subsets=list(mito=grep("^mt-", rownames(sce))), num.threads=2 # keep R CMD check happy ) assayNames(res$x) reducedDimNames(res$x) colData(res$x) previewMarkers(res$markers$rna[[1]], "cohens.d.mean")library(SingleCellExperiment) sce <- getTestRnaData.se("start") res <- analyze.se( sce, rna.qc.subsets=list(mito=grep("^mt-", rownames(sce))), num.threads=2 # keep R CMD check happy ) assayNames(res$x) reducedDimNames(res$x) colData(res$x) previewMarkers(res$markers$rna[[1]], "cohens.d.mean")
Build a shared nearest neighbor (SNN) graph where each node is a cell. Edges are formed between cells that share one or more nearest neighbors, weighted by the number or ranking of those shared neighbors. If two cells are close together but have distinct sets of neighbors, the corresponding edge is downweighted as the two cells are unlikely to be part of the same neighborhood. In this manner, strongly weighted edges will only form within highly interconnected neighborhoods where many cells share the same neighbors. This provides a more sophisticated definition of similarity between cells compared to a simpler (unweighted) nearest neighbor graph that just focuses on immediate proximity.
buildSnnGraph( x, num.neighbors = NULL, weight.scheme = NULL, num.threads = NULL, BNPARAM = AnnoyParam(), as.pointer = FALSE )buildSnnGraph( x, num.neighbors = NULL, weight.scheme = NULL, num.threads = NULL, BNPARAM = AnnoyParam(), as.pointer = FALSE )
x |
For Alternatively, a named list of nearest-neighbor search results.
This should contain Alternatively, an index constructed by |
num.neighbors |
Integer specifying the number of neighbors to use to construct the graph. Larger values increase the connectivity of the graph and reduce the granularity of subsequent community detection steps, at the cost of speed. If This argument is ignored if |
weight.scheme |
String specifying the weighting scheme to use for constructing the SNN graph. This can be one of:
If |
num.threads |
Integer specifying the number of threads to use. If This argument is only used if |
BNPARAM |
A BiocNeighborParam object specifying the algorithm to use. This argument is only used if |
as.pointer |
Boolean indicating whether to return an external pointer for direct use in |
If as.pointer=FALSE, a list is returned containing:
vertices, an integer specifying the number of vertices in the graph (i.e., cells in x).
edges, an integer vector of 1-based indices for graph edges.
Pairs of values represent the endpoints of an (undirected) edge,
i.e., edges[1:2] form the first edge, edges[3:4] form the second edge and so on.
weights, a numeric vector of weights for each edge.
This has length equal to half the length of edges.
If as.pointer=TRUE, an external pointer to the graph structure is returned.
This can be directly used in clusterGraph.
Aaron Lun
The build_snn_graph function in https://libscran.github.io/scran_graph_cluster/.
clusterGraph, to define clusters (i.e., communities) from the graph.
clusterGraph.se, which builds an SNN graph from a SingleCellExperiment.
data <- matrix(rnorm(10000), ncol=1000) out <- buildSnnGraph(data) str(out) # We can use this to make an igraph::graph. g <- igraph::make_undirected_graph(out$edges, n = out$vertices) igraph::E(g)$weight <- out$weightdata <- matrix(rnorm(10000), ncol=1000) out <- buildSnnGraph(data) str(out) # We can use this to make an igraph::graph. g <- igraph::make_undirected_graph(out$edges, n = out$vertices) igraph::E(g)$weight <- out$weight
buildSnnGraph
Default parameters from the underlying C++ library.
These may be overridden by defaults in the buildSnnGraph function signature.
buildSnnGraphDefaults()buildSnnGraphDefaults()
Named list containing default values for various function arguments.
Aaron Lun
buildSnnGraphDefaults()buildSnnGraphDefaults()
Scale the size factors so they are centered at unity.
This ensures that the original scale of the counts is preserved in the normalized values from normalizeCounts,
which simplifies interpretation and ensures that any pseudo-count added prior to log-transformation has a predictable shrinkage effect.
centerSizeFactors(size.factors, block = NULL, mode = NULL)centerSizeFactors(size.factors, block = NULL, mode = NULL)
size.factors |
Numeric vector of size factors across cells. Invalid size factors (e.g., non-positive, non-finite) will be ignored. |
block |
Vector or factor of length equal to |
mode |
String specifying how to scale size factors across blocks.
This can be either If This argument is only used if |
"lowest" will compute the average size factor in each block, identify the lowest average across all blocks, and then scale all size factors by that value.
Here, our normalization strategy involves downscaling all blocks to match the coverage of the lowest-coverage block.
This is useful for datasets with big differences in coverage between blocks as it avoids egregious upscaling of low-coverage blocks.
Specifically, strong upscaling allows the log-transformation to ignore any shrinkage from the pseudo-count.
This is problematic as it inflates differences between cells at log-values derived from low counts, increasing noise and overstating log-fold changes.
Downscaling is safer as it allows the pseudo-count to shrink the log-differences between cells towards zero at low counts,
effectively sacrificing some information in the higher-coverage batches so that they can be compared to the low-coverage batches
(which is preferable to exaggerating the informativeness of the latter for comparison to the former).
"per-block" will compute the average size factor in each block, and then scale each size factor by the average of block to which it belongs.
The scaled size factors are identical to those obtained by separate invocations of 'center_size_factors()' on the size factors for each block.
This can be desirable to ensure consistency with independent analyses of each block - otherwise, the centering would depend on the size factors in other blocks.
However, any systematic differences in the size factors between blocks are lost, i.e., systematic changes in coverage between blocks will not be normalized.
Numeric vector of length equal to size.factors, containing the centered size factors.
Aaron Lun
The center_size_factors and center_size_factors_blocked functions in https://libscran.github.io/scran_norm/.
normalizeRnaCounts.se and related functions, which center the size factors prior to normalization of a SummarizedExperiment.
centerSizeFactors(runif(100)) centerSizeFactors(runif(100), block=sample(3, 100, replace=TRUE))centerSizeFactors(runif(100)) centerSizeFactors(runif(100), block=sample(3, 100, replace=TRUE))
centerSizeFactors
Default parameters from the underlying C++ library.
These may be overridden by defaults in the centerSizeFactors function signature.
centerSizeFactorsDefaults()centerSizeFactorsDefaults()
Named list containing default values for various function arguments.
Aaron Lun
centerSizeFactorsDefaults()centerSizeFactorsDefaults()
Center size factors for endogenous genes and spike-in transcripts, following the rationale in centerSizeFactors.
This aims to preserve the scale of the counts in each feature set so that the normalized expression values are still comparable.
centerSpikeInFactors(endogenous, spike.ins, block = NULL, mode = NULL)centerSpikeInFactors(endogenous, spike.ins, block = NULL, mode = NULL)
endogenous |
Numeric vector of size factors for endogenous genes across cells. Invalid size factors (e.g., non-positive, non-finite) will be ignored. |
spike.ins |
List of numeric vectors.
Each vector should be of length equal to |
block |
Vector or factor of length equal to |
mode |
String specifying how to scale size factors across blocks, see the argument of the same name in If This argument is only used if |
This function is effectively a convenient wrapper around centerSizeFactors,
configured to ensure that the mean of each set of spike-in size factors is the same as that of the endogenous size factors.
The aim is to preserve the scale of the average abundances of the spike-ins relative to the endogenous genes for a valid comparison between feature sets.
Typically, we would use spike-in transcripts to infer some properties of endogenous genes of similar molarity, e.g., technical variance.
List containing:
endogenous, a numeric vector containing the centered size factors for the endogenous genes.
spike.ins, a named list of numeric vectors.
Each vector is named after an entry of spike.ins and contains centered size factors for the corresponding spike-in transcripts.
Aaron Lun
The center_spike_in_factors and center_spike_in_factors_blocked functions in https://libscran.github.io/scran_norm/.
endogenous <- runif(100) spike.ercc <- runif(100) centerSpikeInFactors(endogenous, list(ERCC = spike.ercc)) block <- sample(3, 100, replace=TRUE) centerSpikeInFactors(endogenous, list(ERCC = spike.ercc), block=block)endogenous <- runif(100) spike.ercc <- runif(100) centerSpikeInFactors(endogenous, list(ERCC = spike.ercc)) block <- sample(3, 100, replace=TRUE) centerSpikeInFactors(endogenous, list(ERCC = spike.ercc), block=block)
centerSpikeInFactors
Default parameters for centerSpikeInFactors
centerSpikeInFactorsDefaults()centerSpikeInFactorsDefaults()
Named list containing default values for various function arguments.
Aaron Lun
centerSpikeInFactorsDefaults()centerSpikeInFactorsDefaults()
Choose highly variable genes (HVGs) based on a variance-related statistic.
This is typically used to subset the gene-cell matrix prior to calling runPca.
chooseHighlyVariableGenes( stats, top = NULL, larger = NULL, keep.ties = NULL, use.bound = NULL, bound = NULL )chooseHighlyVariableGenes( stats, top = NULL, larger = NULL, keep.ties = NULL, use.bound = NULL, bound = NULL )
stats |
Numeric vector of variances (or a related statistic) across all genes.
Typically, the residuals from |
top |
Integer specifying the number of top genes to retain.
Note that the actual number of retained genes may not be equal to If |
larger |
Boolean indicating whether larger values of If |
keep.ties |
Boolean indicating whether to keep tied values of If |
use.bound |
Boolean indicating whether to ignore genes with If |
bound |
Number specifying the lower bound (if If This argument is ignored if |
Integer vector containing the indices of genes in stats that are considered to be highly variable.
Aaron Lun
The choose_highly_variable_genes function in https://libscran.github.io/scran_variances/.
chooseRnaHvgs.se, which choose the HVGs from the residuals computed from a SummarizedExperiment.
resids <- rexp(10000) str(chooseHighlyVariableGenes(resids))resids <- rexp(10000) str(chooseHighlyVariableGenes(resids))
chooseHighlyVariableGenes
Default parameters from the underlying C++ library.
These may be overridden by defaults in the chooseHighlyVariableGenes function signature.
chooseHighlyVariableGenesDefaults()chooseHighlyVariableGenesDefaults()
Named list containing default values for various function arguments.
Aaron Lun
chooseHighlyVariableGenesDefaults()chooseHighlyVariableGenesDefaults()
Choose a suitable pseudo-count to control the bias introduced by log-transformation of normalized counts from normalizeCounts.
Larger pseudo-counts shrink log-expression values towards the zero-expression baseline,
reducing the impact of the transformation bias at the cost of some sensitivity.
choosePseudoCount( size.factors, quantile = NULL, max.bias = NULL, min.value = NULL )choosePseudoCount( size.factors, quantile = NULL, max.bias = NULL, min.value = NULL )
size.factors |
Numeric vector of size factors for all cells.
It is expected that these have already been centered, e.g., with |
quantile |
Number specifying the quantile to use for finding the smallest/largest size factors. Setting this to zero will use the observed minimum and maximum, though in practice, this is usually too sensitive to outliers. If |
max.bias |
Number specifying the maximum allowed bias. This is the maximum absolute value of any spurious log2-fold change between the cells with the smallest and largest size factors. If |
min.value |
Number specifying the minimum value for the pseudo-count. If |
A choice of pseudo-count for normalizeCounts.
Aaron Lun
Lun ATL (2018). Overcoming systematic errors caused by log-transformation of normalized single-cell RNA sequencing data. biorXiv doi:10.1101/404962
choose_pseudo_count in https://libscran.github.io/scran_norm/.
sf <- centerSizeFactors(runif(100)) choosePseudoCount(sf) choosePseudoCount(sf, quantile=0.01) choosePseudoCount(sf, max.bias=0.5)sf <- centerSizeFactors(runif(100)) choosePseudoCount(sf) choosePseudoCount(sf, quantile=0.01) choosePseudoCount(sf, max.bias=0.5)
choosePseudoCount
Default parameters from the underlying C++ library.
These may be overridden by defaults in the choosePseudoCount function signature.
choosePseudoCountDefaults()choosePseudoCountDefaults()
Named list containing default values for various function arguments.
Aaron Lun
choosePseudoCountDefaults()choosePseudoCountDefaults()
Model the mean-variance relationship across genes and choose highly variable genes (HVGs) based on the residuals of the fitted trend.
This calls modelGeneVariances on an assay of a SummarizedExperiment,
and then calls chooseHighlyVariableGenes on the residuals.
chooseRnaHvgs.se( x, block = NULL, num.threads = 1, more.var.args = list(), top = 4000, more.choose.args = list(), assay.type = "logcounts", output.prefix = NULL, include.per.block = FALSE ) formatModelGeneVariancesResult( model.res, choose.res = NULL, include.per.block = FALSE )chooseRnaHvgs.se( x, block = NULL, num.threads = 1, more.var.args = list(), top = 4000, more.choose.args = list(), assay.type = "logcounts", output.prefix = NULL, include.per.block = FALSE ) formatModelGeneVariancesResult( model.res, choose.res = NULL, include.per.block = FALSE )
x |
A SummarizedExperiment object or one of its subclasses. Rows correspond to genes and columns correspond to cells. |
block |
Block assignment for each cell, to pass to |
num.threads |
Number of threads, to pass to |
more.var.args |
Named list of arguments to pass to |
top |
Number of HVGs to choose, to pass to |
more.choose.args |
Named list of arguments to pass to |
assay.type |
Integer or string specifying the assay of |
output.prefix |
String containing a prefix to add to the names of the |
include.per.block |
Logical scalar indicating whether the per-block statistics should be stored in the output |
model.res |
List returned by |
choose.res |
Integer vector returned by |
For chooseRnaHvgs.se, x is returned with the per-gene variance modelling statistics added to its rowData.
The hvg column in the rowData indicates whether a gene was chosen as a HVG.
If include.per.block=TRUE and block is specified, the per-block statistics are stored as a nested DataFrame in the per.block column.
For formatModelGeneVariancesResult, a DataFrame is returned with the per-gene variance modelling statistics.
If choose.res is provided, a hvg column is also stored that indicates whether a gene was chosen as a HVG.
Aaron Lun
chooseRnaHvgsWithSpikeIns.se, to choose HVGs with technical noise estimates from spike-in data.
library(SingleCellExperiment) sce <- getTestRnaData.se("norm") sce <- chooseRnaHvgs.se(sce) summary(rowData(sce)$hvg) plot(rowData(sce)$means, rowData(sce)$variances, col=factor(rowData(sce)$hvg)) curve(approxfun(rowData(sce)$means, rowData(sce)$fitted)(x), col="dodgerblue", add=TRUE)library(SingleCellExperiment) sce <- getTestRnaData.se("norm") sce <- chooseRnaHvgs.se(sce) summary(rowData(sce)$hvg) plot(rowData(sce)$means, rowData(sce)$variances, col=factor(rowData(sce)$hvg)) curve(approxfun(rowData(sce)$means, rowData(sce)$fitted)(x), col="dodgerblue", add=TRUE)
Fit a mean-variance trend to spike-in transcripts, and use this trend to estimate the technical noise for endogenous genes of similar abundance. The biological component of variation are defined as the residuals from the trend and used to select highly variable genes.
chooseRnaHvgsWithSpikeIns.se( x, spike.altexp, block = NULL, num.threads = 1, more.endogenous.var.args = list(), more.spike.var.args = list(use.min.width = FALSE), top = 4000, more.choose.args = list(), assay.type = "logcounts", output.prefix = NULL, include.per.block = FALSE )chooseRnaHvgsWithSpikeIns.se( x, spike.altexp, block = NULL, num.threads = 1, more.endogenous.var.args = list(), more.spike.var.args = list(use.min.width = FALSE), top = 4000, more.choose.args = list(), assay.type = "logcounts", output.prefix = NULL, include.per.block = FALSE )
x |
A SingleCellExperiment or one of its subclasses.
The main experiment should contain count data for endogenous genes, where rows correspond to genes and columns correspond to cells.
There should be one or more alternative experiments containing spike-in data, see |
spike.altexp |
Integer or string specifying the name/index of the alternative experiment containing the spike-in data.
The assay to use is determined by Alternatively, |
block |
Block assignment for each cell, to pass to |
num.threads |
Number of threads, to pass to |
more.endogenous.var.args |
Named list of additional arguments to pass to |
more.spike.var.args |
Named list of additional arguments to pass to |
top |
Number of HVGs to choose, to pass to |
more.choose.args |
Named list of arguments to pass to |
assay.type |
Integer or string specifying the assay of |
output.prefix |
String containing a prefix to add to the names of the |
include.per.block |
Logical scalar indicating whether the per-block statistics should be stored in the output |
It is generally expected that normalization has been performed with normalizeRnaCountsWithSpikeIns.se.
This ensures that the means are comparable between endogenous genes and spike-in transcripts.
We set use.min.width=FALSE by default for the spike-in trend fit,
as (i) there are typically too few spike-in transcripts for the default choice of min.window.count,
and (ii) spike-in sets typically have regularly-spaced abundances that are amenable to regular LOWESS.
x is returned with the per-gene variance modelling statistics for endogenous genes added to its rowData.
This is equivalent to the columns returned by modelGeneVariances, except that:
fitted now contains the interpolated values from the mean-variance trend fitted to the spike-in transcripts.
This represents the technical component of variation for each gene.
residuals now contains the difference of each gene's variance from the trend.
This represents the biological component of variation for each gene.
The hvg column in the rowData indicates whether a gene was chosen as a HVG based on the largest residuals.
If include.per.block=TRUE and block is specified, the per-block statistics are stored as a nested DataFrame in the per.block column.
Per-transcript modelling statistics are also added to the rowData of the alternative experiment containing the spike-in data.
This has all of the same columns returned by modelGeneVariances.
Aaron Lun
chooseRnaHvgs.se, for a simpler function when spike-ins are not available.
library(SingleCellExperiment) sce <- getTestRnaData.se("qc") sce <- normalizeRnaCountsWithSpikeIns.se(sce, "ERCC") sce <- chooseRnaHvgsWithSpikeIns.se(sce, "ERCC") summary(rowData(sce)$hvg) plot(rowData(sce)$means, rowData(sce)$variances, col=factor(rowData(sce)$hvg)) spike.rd <- rowData(altExp(sce, "ERCC")) points(spike.rd$means, spike.rd$variances, col="dodgerblue", pch=4) curve(approxfun(spike.rd$means, spike.rd$fitted)(x), col="dodgerblue", add=TRUE)library(SingleCellExperiment) sce <- getTestRnaData.se("qc") sce <- normalizeRnaCountsWithSpikeIns.se(sce, "ERCC") sce <- chooseRnaHvgsWithSpikeIns.se(sce, "ERCC") summary(rowData(sce)$hvg) plot(rowData(sce)$means, rowData(sce)$variances, col=factor(rowData(sce)$hvg)) spike.rd <- rowData(altExp(sce, "ERCC")) points(spike.rd$means, spike.rd$variances, col="dodgerblue", pch=4) curve(approxfun(spike.rd$means, spike.rd$fitted)(x), col="dodgerblue", add=TRUE)
Identify clusters by applying community detection algorithms to a graph. This assumes that that the nodes on the graph represent cells and weighted edges are formed between related cells.
clusterGraph( x, method = c("multilevel", "leiden", "walktrap"), seed = NULL, multilevel.resolution = NULL, multilevel.seed = seed, leiden.resolution = NULL, leiden.seed = seed, leiden.objective = "modularity", walktrap.steps = NULL )clusterGraph( x, method = c("multilevel", "leiden", "walktrap"), seed = NULL, multilevel.resolution = NULL, multilevel.seed = seed, leiden.resolution = NULL, leiden.seed = seed, leiden.objective = "modularity", walktrap.steps = NULL )
x |
List containing graph information or an external pointer to a graph, as returned by |
method |
String specifying the algorithm to use.
|
seed |
Integer specifying the random seed for some of community detection algorithms. |
multilevel.resolution |
Number specifying the resolution when If |
multilevel.seed |
Integer specifying the random seed to use for If |
leiden.resolution |
Number specifying the resolution when If |
leiden.seed |
Integer specifying the random seed to use for If |
leiden.objective |
String specifying the objective function when If |
walktrap.steps |
Integer specifying the number of steps to use when If |
A list containing membership, a factor containing the cluster assignment for each cell.
Additional fields may be present depending on the method:
For method="multilevel", the levels list contains the clustering result at each level of the algorithm.
A modularity numeric vector also contains the modularity at each level, the highest of which corresponds to the reported membership.
For method="leiden", the quality number contains the quality of the partitioning.
For method="walktrap", the merges matrix specifies the pair of cells or clusters that were merged at each step of the algorithm.
The modularity number also contains the modularity of the final partitioning.
Aaron Lun
The various cluster_* functions in https://libscran.github.io/scran_graph_cluster/.
clusterGraph.se, which performs clustering on graph constructed from a SingleCellExperiment.
data <- matrix(rnorm(10000), ncol=1000) gout <- buildSnnGraph(data) str(gout) str(clusterGraph(gout)) str(clusterGraph(gout, method="leiden")) str(clusterGraph(gout, method="walktrap"))data <- matrix(rnorm(10000), ncol=1000) gout <- buildSnnGraph(data) str(gout) str(clusterGraph(gout)) str(clusterGraph(gout, method="leiden")) str(clusterGraph(gout, method="walktrap"))
Construct a shared-nearest neighbor (SNN) graph from an existing low-dimensional embedding
by calling buildSnnGraph on a reduced dimension entry in a SingleCellExperiment.
Then, apply community detection algorithms to obtain clusters of cells with clusterGraph.
clusterGraph.se( x, num.neighbors = 10, num.threads = 1, more.build.args = list(), method = "multilevel", resolution = NULL, more.cluster.args = list(), reddim.type = "PCA", output.name = "clusters", meta.name = NULL, graph.name = NULL )clusterGraph.se( x, num.neighbors = 10, num.threads = 1, more.build.args = list(), method = "multilevel", resolution = NULL, more.cluster.args = list(), reddim.type = "PCA", output.name = "clusters", meta.name = NULL, graph.name = NULL )
x |
A SingleCellExperiment object or one of its subclasses. Rows correspond to genomic features and columns correspond to cells. |
num.neighbors |
Number of neighbors for constructing the graph, passed to |
num.threads |
Number of threads for graph construction, passed to |
more.build.args |
Named list of further arguments to be passed to |
method |
Clustering method to use, passed to |
resolution |
Resolution for the community detection method in |
more.cluster.args |
Named list of further arguments to be passed to |
reddim.type |
Integer or string specifying the existing embedding in the |
output.name |
String containing the name of the column of the |
meta.name |
String containing the name of the |
graph.name |
String containing the name of the |
x is returned with the cluster assignment for each cell stored in the colData.
Additional clustering output is stored in the metadata.
Aaron Lun
sce <- getTestRnaData.se("pca") sce <- clusterGraph.se(sce) table(sce$clusters)sce <- getTestRnaData.se("pca") sce <- clusterGraph.se(sce) table(sce$clusters)
clusterGraph
Default parameters from the underlying C++ library.
Some of these may be overridden by defaults in the clusterGraph function signature.
clusterGraphDefaults()clusterGraphDefaults()
Named list containing default values for various function arguments.
Aaron Lun
clusterGraphDefaults()clusterGraphDefaults()
Perform k-means clustering with a variety of different initialization and refinement algorithms.
clusterKmeans( x, k, init.method = c("var-part", "kmeans++", "random"), refine.method = c("hartigan-wong", "lloyd"), warn = TRUE, seed = NULL, random.seed = seed, kmeanspp.seed = seed, var.part.optimize.partition = NULL, var.part.size.adjustment = NULL, lloyd.iterations = NULL, hartigan.wong.iterations = NULL, hartigan.wong.quick.transfer.iterations = NULL, hartigan.wong.quit.quick.transfer.failure = NULL, num.threads = NULL )clusterKmeans( x, k, init.method = c("var-part", "kmeans++", "random"), refine.method = c("hartigan-wong", "lloyd"), warn = TRUE, seed = NULL, random.seed = seed, kmeanspp.seed = seed, var.part.optimize.partition = NULL, var.part.size.adjustment = NULL, lloyd.iterations = NULL, hartigan.wong.iterations = NULL, hartigan.wong.quick.transfer.iterations = NULL, hartigan.wong.quit.quick.transfer.failure = NULL, num.threads = NULL )
x |
Matrix-like object where rows are dimensions and columns are cells.
This is typically a dense double-precision matrix containing a low-dimensional representation from, e.g., |
k |
Integer specifying the number of clusters. |
init.method |
String specifying the initialization method for the centers:
|
refine.method |
String specifying the refinement method.
|
warn |
Boolean specifying whether a warning should be emitted if the k-means algorithm failed to converge. |
seed |
Integer specifying the seed for random number generation in some of the initialization methods. |
random.seed |
Integer specifying the seed for random number generation when If |
kmeanspp.seed |
Integer specifying the seed for random number generation when If |
var.part.optimize.partition |
Boolean indicating whether each partition boundary should be optimized in If |
var.part.size.adjustment |
Number specifying the cluster size adjustment when selecting the next cluster to partition in If |
lloyd.iterations |
Integer specifying the maximum number of iterations for If |
hartigan.wong.iterations |
Integer specifying the maximum number of iterations for If |
hartigan.wong.quick.transfer.iterations |
Integer specifying the maximum number of quick transfer iterations for If |
hartigan.wong.quit.quick.transfer.failure |
Boolean indicating whether to quit kon convergence failure during quick transfer iterations in If |
num.threads |
Integer specifying the number of threads to use. If |
List containing:
clusters, a factor containing the cluster assignment for each cell.
The number of levels is no greater than k, where each level is an integer that refer to a column of centers.
centers, a numeric matrix with the coordinates of the cluster centroids (dimensions in rows, centers in columns).
The number of columns is no greater than k.
Empty clusters are automatically removed.
iterations, an integer specifying the number of refinement iterations that were performed.
status, an integer specifying the completion status of the algorithm.
A value of zero indicates success while the meaning of any non-zero value depends on the choice of refine.method:
For Lloyd, a value of 2 indicates convergence failure.
For Hartigan-Wong, a value of 2 indicates convergence failure in the optimal transfer iterations.
A value of 4 indicates convergence failure in the quick transfer iterations when hartigan.wong.quit.quick.transfer.failure = TRUE.
Aaron Lun
Hartigan JA. and Wong MA (1979). Algorithm AS 136: A K-means clustering algorithm. Applied Statistics 28, 100-108.
Arthur D and Vassilvitskii S (2007). k-means++: the advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms 1027-1035.
Su T and Dy JG (2007). In Search of Deterministic Methods for Initializing K-Means and Gaussian Mixture Clustering. Intelligent Data Analysis 11, 319-338.
https://libscran.github.io/kmeans/, which describes the various initialization and refinement algorithms in more detail.
clusterKmeans.se, for k-means clustering on a SingleCellExperiment.
x <- t(as.matrix(iris[,1:4])) clustering <- clusterKmeans(x, k=3) table(clustering$clusters, iris[,"Species"])x <- t(as.matrix(iris[,1:4])) clustering <- clusterKmeans(x, k=3) table(clustering$clusters, iris[,"Species"])
Perform k-means clustering on an existing low-dimensional embedding
by calling clusterKmeans on a reduced dimension entry in a SingleCellExperiment.
clusterKmeans.se( x, k, num.threads = 1, more.kmeans.args = list(), reddim.type = "PCA", output.name = "clusters", meta.name = NULL )clusterKmeans.se( x, k, num.threads = 1, more.kmeans.args = list(), reddim.type = "PCA", output.name = "clusters", meta.name = NULL )
x |
A SingleCellExperiment object or one of its subclasses. Rows correspond to genomic features and columns correspond to cells. |
k |
Number of clusters, passed to |
num.threads |
Number of threads, passed to |
more.kmeans.args |
Named list of further arguments to be passed to |
reddim.type |
Integer or string specifying the existing embedding in the |
output.name |
String containing the name of the |
meta.name |
String containing the name of the |
x is returned with the cluster assignment for each cell stored in the colData.
Additional clustering output is stored in the metadata.
Aaron Lun
sce <- getTestRnaData.se("pca") sce <- clusterKmeans.se(sce, k=10) table(sce$clusters)sce <- getTestRnaData.se("pca") sce <- clusterKmeans.se(sce, k=10) table(sce$clusters)
clusterKmeans
Default parameters from the underlying C++ library.
These may be overridden by defaults in the clusterKmeans function signature.
clusterKmeansDefaults()clusterKmeansDefaults()
Named list containing default values for various function arguments.
Aaron Lun
clusterKmeansDefaults()clusterKmeansDefaults()
Combine multiple factors into a single factor where each level of the latter is a unique combination of levels from the former.
combineFactors(factors, keep.unused = FALSE)combineFactors(factors, keep.unused = FALSE)
factors |
An ordinary list or List of vectors or factors of the same length. Corresponding elements across all vectors/factors represent the combination of levels for a single observation. For factors, any existing levels are respected. For other vectors, the sorted and unique values are used as levels. Alternatively, a data frame or DataFrame where each column is a vector or factor and each row corresponds to an observation. |
keep.unused |
Boolean indicating whether to report unused combinations of levels. |
List containing:
levels, a DataFrame containing the sorted and unique combinations of levels from factors.
Each column corresponds to a factor in factors while each row corresponds to a unqiue combination.
index, an integer vector specifying the index into levels for each observation.
For observation i and factor j, levels[[[j]][index[i]] will recover factors[[j]][i].
Aaron Lun
The combine_to_factor function in https://libscran.github.io/factorize/.
combineFactors(list( sample(LETTERS[1:5], 100, replace=TRUE), sample(3, 100, replace=TRUE) )) combineFactors(list( factor(sample(LETTERS[1:5], 10, replace=TRUE), LETTERS[1:5]), factor(sample(5, 10, replace=TRUE), 1:5) ), keep.unused=TRUE)combineFactors(list( sample(LETTERS[1:5], 100, replace=TRUE), sample(3, 100, replace=TRUE) )) combineFactors(list( factor(sample(LETTERS[1:5], 10, replace=TRUE), LETTERS[1:5]), factor(sample(5, 10, replace=TRUE), 1:5) ), keep.unused=TRUE)
Compute a weight for each block based on the number of cells in each block. This is typically used to aggregate statistics across blocks, e.g., with weighted sums/averages.
computeBlockWeights( sizes, block.weight.policy = c("variable", "equal", "size", "none"), variable.block.weight = NULL )computeBlockWeights( sizes, block.weight.policy = c("variable", "equal", "size", "none"), variable.block.weight = NULL )
sizes |
Numeric vector containing the size of (i.e., number of cells in) each block. |
block.weight.policy |
String specifying the policy to use for weighting different blocks. This should be one of:
|
variable.block.weight |
Numeric vector of length 2, specifying the parameters for variable block weighting. The first and second values are used as the lower and upper bounds, respectively, for the variable weight calculation. If This argument is only used if |
Numeric vector containing the relative block weights.
Aaron Lun
The compute_weights function from https://libscran.github.io/scran_blocks/.
computeBlockWeights(c(1, 10, 100, 1000, 10000)) computeBlockWeights(c(1, 10, 100, 1000, 10000), block.weight.policy="equal") computeBlockWeights(c(1, 10, 100, 1000, 10000), variable.block.weight=c(50, 5000))computeBlockWeights(c(1, 10, 100, 1000, 10000)) computeBlockWeights(c(1, 10, 100, 1000, 10000), block.weight.policy="equal") computeBlockWeights(c(1, 10, 100, 1000, 10000), variable.block.weight=c(50, 5000))
computeBlockWeights
Default parameters from the underlying C++ library.
These may be overridden by defaults in the computeBlockWeights function signature.
computeBlockWeightsDefaults()computeBlockWeightsDefaults()
Named list containing default values for various function arguments.
Aaron Lun
computeBlockWeightsDefaults()computeBlockWeightsDefaults()
Compute size factors from an ADT count matrix using the CLRm1 method. This is a variant of the centered log-ratio (CLR) method, where the size factors are defined from the geometric mean of counts within each cell.
computeClrm1Factors(x, num.threads = NULL)computeClrm1Factors(x, num.threads = NULL)
x |
A matrix-like object containing ADT count data. Rows correspond to tags and columns correspond to cells. |
num.threads |
Number of threads to use. If |
Numeric vector containing the CLRm1 size factor for each cell.
Note that these size factors are not centered and should be passed through, e.g., centerSizeFactors before normalization.
Aaron Lun
https://github.com/libscran/clrm1, for a description of the CLRm1 method.
normalizeAdtCounts.se, which computes CLRm1 factors prior to normalization.
library(Matrix) x <- abs(rsparsematrix(1000, 100, 0.1) * 10) head(computeClrm1Factors(x))library(Matrix) x <- abs(rsparsematrix(1000, 100, 0.1) * 10) head(computeClrm1Factors(x))
computeClrm1Factors
Default parameters from the underlying C++ library.
These may be overridden by defaults in the computeClrm1Factors function signature.
computeClrm1FactorsDefaults()computeClrm1FactorsDefaults()
Named list containing default values for various function arguments.
Aaron Lun
computeClrm1FactorsDefaults()computeClrm1FactorsDefaults()
Apply mutual nearest neighbor (MNN) correction to remove batch effects from a low-dimensional embedding.
correctMnn( x, block, num.neighbors = NULL, num.steps = NULL, merge.policy = NULL, num.mads = NULL, robust.iterations = NULL, robust.trim = NULL, mass.cap = NULL, order = NULL, reference.policy = NULL, BNPARAM = AnnoyParam(), num.threads = NULL )correctMnn( x, block, num.neighbors = NULL, num.steps = NULL, merge.policy = NULL, num.mads = NULL, robust.iterations = NULL, robust.trim = NULL, mass.cap = NULL, order = NULL, reference.policy = NULL, BNPARAM = AnnoyParam(), num.threads = NULL )
x |
Numeric matrix where rows are dimensions and columns are cells,
typically containing coordinates in a low-dimensional embedding (e.g., from |
block |
Factor specifying the block of origin (e.g., batch, sample) for each cell in |
num.neighbors |
Integer specifying the number of neighbors in the various search steps. Larger values improve the stability of the correction by increasing the number of MNN pairs and including more observations in each center of mass. However, this comes at the cost of reduced resolution when matching subpopulations across batches. If |
num.steps |
Integer specifying the number of steps for the recursive neighbor search to compute the center of mass. Larger values mitigate the kissing effect but increase the risk of including inappropriately distant subpopulations into the center of mass. If |
merge.policy |
String specifying the policy to use to choose the order of batches to merge.
If |
num.mads |
Deprecated and ignored. |
robust.iterations |
Deprecated and ignored. |
robust.trim |
Deprecated and ignored. |
mass.cap |
Deprecated and ignored. |
order |
Deprecated and ignored, the merge order is now always automatically determined. |
reference.policy |
Deprecated, use |
BNPARAM |
A BiocNeighborParam object specifying the nearest-neighbor algorithm to use. |
num.threads |
Integer specifying the number of threads to use. If |
List containing:
corrected, a numeric matrix of the same dimensions as x.
This contains the MNN-corrected coordinates for each cell (column) across dimensions (rows).
Aaron Lun
Haghverdi L, Lun ATL, Morgan MD, Marioni JC (2018). Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36(5):421-427
The compute function in https://libscran.github.io/mnncorrect/.
correctMnn.se, to perform MNN correction on a SingleCellExperiment.
# Mocking up a dataset with multiple batches. x <- matrix(rnorm(10000), nrow=10) b <- sample(3, ncol(x), replace=TRUE) x[,b==2] <- x[,b==2] + 3 x[,b==3] <- x[,b==3] + 5 lapply(split(colMeans(x), b), mean) # indeed the means differ... corrected <- correctMnn(x, b) str(corrected) lapply(split(colMeans(corrected$corrected), b), mean) # now merged.# Mocking up a dataset with multiple batches. x <- matrix(rnorm(10000), nrow=10) b <- sample(3, ncol(x), replace=TRUE) x[,b==2] <- x[,b==2] + 3 x[,b==3] <- x[,b==3] + 5 lapply(split(colMeans(x), b), mean) # indeed the means differ... corrected <- correctMnn(x, b) str(corrected) lapply(split(colMeans(corrected$corrected), b), mean) # now merged.
Correct batch effects from an existing embedding with mutual nearest neighbors (MNNs),
by calling correctMnn on a reduced dimension entry of a SingleCellExperiment.
correctMnn.se( x, block, BNPARAM = AnnoyParam(), num.threads = 1, more.mnn.args = list(), reddim.type = "PCA", output.name = "MNN", delayed.transpose = FALSE )correctMnn.se( x, block, BNPARAM = AnnoyParam(), num.threads = 1, more.mnn.args = list(), reddim.type = "PCA", output.name = "MNN", delayed.transpose = FALSE )
x |
A SingleCellExperiment object or one of its subclasses. Rows correspond to genomic features and columns correspond to cells. |
block |
Block assignment for each cell, passed to |
BNPARAM |
Algorithm for the nearest neighbor search, passed to |
num.threads |
Number of threads, passed to |
more.mnn.args |
Named list of additional arguments to pass to |
reddim.type |
String or integer specifying the |
output.name |
String containing the name of the |
delayed.transpose |
Logical scalar indicating whether to delay the transposition when storing coordinates in the |
x is returned with the corrected embedding stored as a reducedDim entry.
Aaron Lun
library(SingleCellExperiment) sce <- getTestRnaData.se("pca") # Treating the tissue of origin as the batch. sce <- correctMnn.se(sce, sce$tissue) reducedDimNames(sce)library(SingleCellExperiment) sce <- getTestRnaData.se("pca") # Treating the tissue of origin as the batch. sce <- correctMnn.se(sce, sce$tissue) reducedDimNames(sce)
correctMnn
Default parameters from the underlying C++ library.
These may be overridden by defaults in the correctMnn function signature.
correctMnnDefaults()correctMnnDefaults()
Named list containing default values for various function arguments.
Aaron Lun
correctMnnDefaults()correctMnnDefaults()
Tabulate the frequency of cells in each combination of groups and blocks. This is typically used to examine the distribution of cells across batches for each cluster - the presence of a batch-specific cluster may be indicative of a batch effect.
countGroupsByBlock( groups, block, normalize.block = FALSE, normalize.groups = FALSE )countGroupsByBlock( groups, block, normalize.block = FALSE, normalize.groups = FALSE )
groups |
Factor specifying the group to which each cell was assigned. This is typically used for clusters. |
block |
Factor specifying the block to which each cell was assigned. This is typically used for batches or samples. |
normalize.block |
Boolean indicating whether to normalize the number of cells across blocks.
If |
normalize.groups |
Boolean indicating whether to normalize the number of cells across groups.
If |
Matrix of (normalized) frequencies. Each row corresponds to a group and each column corresponds to a block.
Aaron Lun
table, which is used internally by this function.
groups <- sample(10, 100, replace=TRUE) block <- sample(LETTERS[1:6], 100, replace=TRUE) countGroupsByBlock(groups, block) countGroupsByBlock(groups, block, normalize.block=TRUE) countGroupsByBlock(groups, block, normalize.groups=TRUE) countGroupsByBlock(groups, block, normalize.block=TRUE, normalize.groups=TRUE)groups <- sample(10, 100, replace=TRUE) block <- sample(LETTERS[1:6], 100, replace=TRUE) countGroupsByBlock(groups, block) countGroupsByBlock(groups, block, normalize.block=TRUE) countGroupsByBlock(groups, block, normalize.groups=TRUE) countGroupsByBlock(groups, block, normalize.block=TRUE, normalize.groups=TRUE)
Compute per-cell QC metrics from an initialized matrix of CRISPR counts, and use the metrics to suggest filter thresholds to retain high-quality cells.
computeCrisprQcMetrics(x, num.threads = NULL) suggestCrisprQcThresholds( metrics, block = NULL, num.mads = NULL, max.value.num.mads = num.mads ) filterCrisprQcMetrics(thresholds, metrics, block = NULL)computeCrisprQcMetrics(x, num.threads = NULL) suggestCrisprQcThresholds( metrics, block = NULL, num.mads = NULL, max.value.num.mads = num.mads ) filterCrisprQcMetrics(thresholds, metrics, block = NULL)
x |
A matrix-like object where rows are CRISPRs and columns are cells. Values are expected to be counts. |
num.threads |
Integer scalar specifying the number of threads to use.
If |
metrics |
DataFrame of per-cell QC metrics.
This should have the same structure as the return value of |
block |
Factor specifying the block of origin (e.g., batch, sample) for each cell in For |
num.mads |
Number of median from the median, to define the threshold for outliers in each metric. |
max.value.num.mads |
Number of median from the median, to define the threshold for the maximum count in each cell.
If |
thresholds |
List with the same structure as produced by |
In CRISPR data, a cell is considered to be of low quality if it has a low count for its most abundant guide. However, directly defining a MAD-based outlier threshold on the maximum count is somewhat tricky as unsuccessful transfection can be common. This often results in a large subpopulation with low maximum counts, inflating the MAD and compromising the threshold calculation. Instead, we use the following approach:
Compute the proportion of counts in the most abundant guide (i.e., the maximum proportion) in each cell. Cells that were successfully transfected should have high maximum proportions. In contrast, unsuccessfully transfected cells will be dominated by ambient contamination and have low proportions.
Subset the dataset to only retain those cells with maximum proportions above the median. This assumes that at least 50 Thus, we remove all of the unsucessful transfections and enrich for mostly-high-quality cells.
Define a MAD-based threshold for low outliers on the log-transformed maximum count within the subset (see 'choose_filter_thresholds()' for details). This is now possible as we can assume that most of the remaining cells are of high quality.
Note that the maximum proportion is only used to define the subset for threshold calculation. Once the maximum count threshold is computed, it is applied to all cells regardless of their maximum proportions. This ensures that we correctly remove cells with low coverage, even if the proportion is high. It also allows us to retain cells transfected with multiple guides, as long as the maximum is high enough - such cells are not necessarily uninteresting, e.g., for examining interaction effects, so we will err on the side of caution and leave them in.
For computeCrisprQcMetrics, a DataFrame is returned with one row per cell in x.
This contains the following columns:
sum, a numeric vector containing the total CRISPR count for each cell.
Low counts indicate that the cell was not successfully transfected with a construct or that library preparation and sequencing failed.
detected, an integer vector containing the number of detected guides per cell.
In theory, this should be 1, as each cell should express no more than one guide construct.
However, ambient contamination may introduce non-zero counts for multiple guides, without necessarily interfering with downstream analyses.
As such, this metric is less useful for guide data, though we compute it anyway.
max.value, a numeric vector containing the count for the most abundant guide in cell.
Low values indicate that the cell was not successfully transfected or that library preparation and sequencing failed.
max.index, an integer vector containing the row index of the most abundant guide in cell.
Each vector is of length equal to the number of cells.
For suggestCrisprQcThresholds, a named list is returned.
If block=NULL, the list contains:
max.value, a numeric scalar containing the lower bound on the maximum count.
This is defined as num.mads MADs below the median of the log-transformed metrics across cells with high maximum proportions (see Details).
Otherwise, if block is supplied, the list contains:
max.value, a numeric vector containing the lower bound on the maximum counts for each blocking level.
Here, the threshold is computed independently for each block, using the same method as the unblocked case.
block.ids, a vector containing the identities of the unique blocks.
Each vector is of length equal to the number of levels in block and is named accordingly.
For filterCrisprQcMetrics, a logical vector of length ncol(x) is returned indicating which cells are of high quality.
High-quality cells are defined as those with maximum counts above the max.value threshold.
Aaron Lun
The compute_crispr_qc_metrics, compute_crispr_qc_filters and compute_crispr_qc_filters_blocked functions in https://libscran.github.io/scran_qc/.
quickCrisprQc.se, to run all of the CRISPR-related QC functions on a SummarizedExperiment.
# Mocking a matrix: library(Matrix) x <- round(abs(rsparsematrix(100, 100, 0.1) * 100)) qc <- computeCrisprQcMetrics(x) qc filt <- suggestCrisprQcThresholds(qc) str(filt) keep <- filterCrisprQcMetrics(filt, qc) summary(keep)# Mocking a matrix: library(Matrix) x <- round(abs(rsparsematrix(100, 100, 0.1) * 100)) qc <- computeCrisprQcMetrics(x) qc filt <- suggestCrisprQcThresholds(qc) str(filt) keep <- filterCrisprQcMetrics(filt, qc) summary(keep)
Default parameters from the underlying C++ library. These may be overridden by defaults in each function's signature.
computeCrisprQcMetricsDefaults() suggestCrisprQcThresholdsDefaults()computeCrisprQcMetricsDefaults() suggestCrisprQcThresholdsDefaults()
Named list containing default values for the various function arguments.
Aaron Lun
computeCrisprQcMetricsDefaults() suggestCrisprQcThresholdsDefaults()computeCrisprQcMetricsDefaults() suggestCrisprQcThresholdsDefaults()
Fit a trend to the per-gene variances with respect to their means, typically from normalized and log-transformed expression values.
fitVarianceTrend( means, variances, mean.filter = NULL, min.mean = NULL, transform = NULL, span = NULL, use.min.width = NULL, min.width = NULL, min.window.count = NULL, num.threads = NULL )fitVarianceTrend( means, variances, mean.filter = NULL, min.mean = NULL, transform = NULL, span = NULL, use.min.width = NULL, min.width = NULL, min.window.count = NULL, num.threads = NULL )
means |
Numeric vector containing the mean (log-)expression for each gene. |
variances |
Numeric vector containing the variance in the (log-)expression for each gene. |
mean.filter |
Logical scalar indicating whether to filter on the means before trend fitting. The assumption is that there is a bulk of low-abundance genes that are uninteresting and should be removed to avoid skewing the windows of the LOWESS smoother. If |
min.mean |
Numeric scalar specifying the minimum mean of genes to use in trend fitting. Genes with lower means do not participate in the LOWESS fit, to ensure that windows are not skewed towards the majority of low-abundance genes. Instead, the fitted values for these genes are defined by extrapolating the left edge of the fitted trend is extrapolated to the origin. If This argument is ignored if |
transform |
Logical scalar indicating whether a quarter-root transformation should be applied before trend fitting.
This transformation is copied from If |
span |
Numeric scalar specifying the span of the LOWESS smoother, as a proportion of the total number of points. Larger values improve stability at the cost of sensitivity to changes in low-density regions. If This argument is ignored if |
use.min.width |
Logical scalar indicating whether a minimum width constraint should be applied to the LOWESS smoother. This replaces the proportion-based span for defining each window. Instead, the window for each point must be of a minimum width and is extended until it contains a minimum number of points. Setting this to 'TRUE' ensures that sensitivity is maintained in the trend fit at low-density regions for the distribution of means, e.g., at high abundances. It also avoids overfitting from very small windows in high-density intervals. If |
min.width |
Minimum width of the window to use when If |
min.window.count |
Minimum number of observations in each window. This ensures that each window contains at least a given number of observations for a stable fit. If the minimum width window contains fewer observations, it is extended using the standard LOWESS logic until the minimum number is achieved. If This argument is ignored if |
num.threads |
Number of threads to use. If |
List containing fitted, a numeric vector containing the fitted values of the trend for each gene;
and residuals, a numeric vector containing the residuals from the trend.
Aaron Lun
modelGeneVariances, to compute the means and variances on which the trend is fitted.
The fit_variance_trend function in https://libscran.github.io/scran_variances/.
# Setting up some single-cell-like data. mu <- 2^runif(1000, -10, 10) counts <- matrix(rpois(20 * length(mu), lambda=mu), ncol=20) sf <- centerSizeFactors(colSums(counts)) normalized <- normalizeCounts(counts, size.factors=sf) stats <- modelGeneVariances(normalized) out <- fitVarianceTrend(stats$statistics$means, stats$statistics$variances) plot(stats$statistics$means, stats$statistics$variances) curve(approxfun(stats$statistics$means, out$fitted)(x), col="red", add=TRUE)# Setting up some single-cell-like data. mu <- 2^runif(1000, -10, 10) counts <- matrix(rpois(20 * length(mu), lambda=mu), ncol=20) sf <- centerSizeFactors(colSums(counts)) normalized <- normalizeCounts(counts, size.factors=sf) stats <- modelGeneVariances(normalized) out <- fitVarianceTrend(stats$statistics$means, stats$statistics$variances) plot(stats$statistics$means, stats$statistics$variances) curve(approxfun(stats$statistics$means, out$fitted)(x), col="red", add=TRUE)
fitVarianceTrend
Default parameters from the underlying C++ library.
These may be overridden by defaults in the fitVarianceTrend function signature.
fitVarianceTrendDefaults()fitVarianceTrendDefaults()
Named list containing default values for the various function arguments.
Aaron Lun
fitVarianceTrendDefaults()fitVarianceTrendDefaults()
Get single-cell datasets from the scRNAseq package with varying levels of processing. This is primarily intended for testing other scrapper functions, e.g., in their Examples section.
getTestRnaData.se(at = c("start", "qc", "norm", "hvg", "pca", "cluster")) getTestAdtData.se(at = c("start", "qc", "norm", "hvg", "pca")) getTestCrisprData.se(at = c("start", "qc"))getTestRnaData.se(at = c("start", "qc", "norm", "hvg", "pca", "cluster")) getTestAdtData.se(at = c("start", "qc", "norm", "hvg", "pca")) getTestCrisprData.se(at = c("start", "qc"))
at |
String specifying the level of processing.
For |
For getTestRnaData, this is a scRNA-seq dataset of the mouse brain,
where the main experiment contains RNA counts and the alternative experiments contain ERCC and repeat element counts.
This is obtained with fetchDataset("zeisel-brain-2015", "2023-12-14").
For getTestAdtData, this is a CITE-seq dataset of human PBMCs,
where the main experiment contains RNA counts and the alternative experiment contains ADT counts.
This is obtained with fetchDataset("kotliarov-pbmc-2020", "2024-04-18").
Only the first 5000 cells are loaded for speed.
For getTestCrisprData, this is a Perturb-seq dataset of a pancreatic beta cell line,
where the main experiment contains RNA counts and the alternative experiment contains CRISPR guide counts.
This is obtained with fetchDataset("cao-pancreas-2025", "2025-10-10", "rqc").
Only the first 5000 cells are loaded for speed.
A SingleCellExperiment containing a dataset at the specified level of processing.
Aaron Lun
fetchDataset, used to obtain each dataset.
getTestRnaData.se() getTestAdtData.se() getTestCrisprData.se()getTestRnaData.se() getTestAdtData.se() getTestCrisprData.se()
Delayed calculation of log-normalized expression values, typically returned by normalizeCounts.
LogNormalizedMatrix(x, size.factors, pseudo.count = 1, log.base = 2) LogNormalizedMatrixSeed(x, size.factors, pseudo.count = 1, log.base = 2)LogNormalizedMatrix(x, size.factors, pseudo.count = 1, log.base = 2) LogNormalizedMatrixSeed(x, size.factors, pseudo.count = 1, log.base = 2)
x |
Count matrix to be normalized. |
size.factors |
Numeric vector of size factors, of length equal to the number of columns of |
pseudo.count |
Number specifying the pseudo-count to add prior to log-transformation. |
log.base |
Number specifying the base of the log-transformation. |
This is based on the DelayedArray framework and
An instance of a LogNormalizedMatrix(Seed).
Aaron Lun
mat <- matrix(rpois(1000, lambda=2), ncol=10) sf <- centerSizeFactors(colSums(mat)) norm <- LogNormalizedMatrix(mat, sf) norm # Also works with sparse matrices. library(Matrix) smat <- abs(rsparsematrix(50, 20, density=0.21)) ssf <- centerSizeFactors(colSums(smat)) snorm <- LogNormalizedMatrix(smat, ssf) snormmat <- matrix(rpois(1000, lambda=2), ncol=10) sf <- centerSizeFactors(colSums(mat)) norm <- LogNormalizedMatrix(mat, sf) norm # Also works with sparse matrices. library(Matrix) smat <- abs(rsparsematrix(50, 20, density=0.21)) ssf <- centerSizeFactors(colSums(smat)) snorm <- LogNormalizedMatrix(smat, ssf) snorm
Model the per-gene variances as a function of the mean in single-cell expression data. Highly variable genes can then be selected for downstream analyses.
modelGeneVariances( x, block = NULL, block.average.policy = NULL, block.weight.policy = NULL, variable.block.weight = NULL, block.quantile = NULL, fit.trend = NULL, mean.filter = NULL, min.mean = NULL, transform = NULL, span = NULL, use.min.width = NULL, min.width = NULL, min.window.count = NULL, num.threads = NULL )modelGeneVariances( x, block = NULL, block.average.policy = NULL, block.weight.policy = NULL, variable.block.weight = NULL, block.quantile = NULL, fit.trend = NULL, mean.filter = NULL, min.mean = NULL, transform = NULL, span = NULL, use.min.width = NULL, min.width = NULL, min.window.count = NULL, num.threads = NULL )
x |
A matrix-like object where rows correspond to genes or genomic features and columns correspond to cells.
It is typically expected to contain log-expression values, e.g., from |
block |
Factor specifying the block of origin (e.g., batch, sample) for each cell in |
block.average.policy |
String specifying the policy to use for average statistics across blocks.
This can either be If This argument is only used if |
block.weight.policy |
String specifying the policy to use for weighting different blocks when computing the average for each statistic.
See the argument of the same name in If This argument is only used if |
variable.block.weight |
Numeric vector of length 2, specifying the parameters for variable block weighting.
See the argument of the same name in If This argument is only used if |
block.quantile |
Number specifying the probability of the quantile of statistics across blocks. If This argument is only used if |
fit.trend |
Boolean indicating whether a mean-variance trend should be fitted.
If If |
mean.filter |
Logical scalar indicating whether to filter on the means before trend fitting.
See the argument of the same name in If This argument is only used if |
min.mean |
Number specifying the minimum mean of genes to use in trend fitting.
See the argument of the same name in This argument is only used if |
transform |
Logical scalar indicating whether a quarter-root transformation should be applied before trend fitting.
See the argument of the same name in If This argument is only used if |
span |
Numeric scalar specifying the span of the LOWESS smoother, as a proportion of the total number of points.
See the argument of the same name in If This argument is only used if |
use.min.width |
Logical scalar indicating whether a minimum width constraint should be applied to the LOWESS smoother.
See the argument of the same name in If This argument is only used if |
min.width |
Minimum width of the window to use when If This argument is only used if |
min.window.count |
Minimum number of observations in each window.
See the argument of the same name in If This argument is only used if |
num.threads |
Integer scalar specifying the number of threads to use. If |
We compute the mean and variance for each gene and fit a trend to the variances with respect to the means using fitVarianceTrend.
We assume that most genes at any given abundance are not highly variable, such that the fitted value of the trend is interpreted as the “uninteresting” variance -
this is mostly attributed to technical variation like sequencing noise, but can also represent constitutive biological noise like transcriptional bursting.
Under this assumption, the residual can be treated as a measure of biologically interesting variation.
Genes with large residuals can then be selected for downstream analyses, e.g., with chooseHighlyVariableGenes.
A list containing statistics, a DataFrame with number of rows equal to the number of genes.
This contains the columns means, variances, fitted and residuals,
each of which is a numeric vector containing the statistic of the same name across all genes.
If block is supplied, each of the column vectors described above contains the average across all blocks.
The list will also contain per.block, a list of DataFrames containing the equivalent statistics for each block;
and block.ids, a vector containing the identities of the unique blocks in the same order as per.block.
If block is supplied and block.average.policy = "none", the statistics DataFrame will have no columns.
If fit.trend = FALSE, the fitted and residuals columns will be omitted from all DataFrames.
Aaron Lun
The model_gene_variances function in https://libscran.github.io/scran_variances/.
chooseRnaHvgs.se, which computes the variances and trend from a SummarizedExperiment.
library(Matrix) x <- abs(rsparsematrix(1000, 100, 0.1) * 10) out <- modelGeneVariances(x) out # Throwing in some blocking. block <- sample(letters[1:4], ncol(x), replace=TRUE) out <- modelGeneVariances(x, block=block) outlibrary(Matrix) x <- abs(rsparsematrix(1000, 100, 0.1) * 10) out <- modelGeneVariances(x) out # Throwing in some blocking. block <- sample(letters[1:4], ncol(x), replace=TRUE) out <- modelGeneVariances(x, block=block) out
modelGeneVariances
Default parameters from the underlying C++ library.
These may be overridden by defaults in the modelGeneVariances function signature.
modelGeneVariancesDefaults()modelGeneVariancesDefaults()
Named list of default values for various function arguments.
Aaron Lun
modelGeneVariancesDefaults()modelGeneVariancesDefaults()
Compute (log-)normalized expression values after performing scaling normalization of an ADT count matrix.
This calls computeClrm1Factors on an assay of a SingleCellExperiment,
centering the subsequent size factors with centerSizeFactors,
and then computing normalized log-expression values with normalizeCounts.
normalizeAdtCounts.se( x, size.factors = NULL, num.threads = 1, center = TRUE, block = NULL, mode = "lowest", log = TRUE, pseudo.count = 1, more.norm.args = list(), assay.type = "counts", output.name = "logcounts", factor.name = "sizeFactor" )normalizeAdtCounts.se( x, size.factors = NULL, num.threads = 1, center = TRUE, block = NULL, mode = "lowest", log = TRUE, pseudo.count = 1, more.norm.args = list(), assay.type = "counts", output.name = "logcounts", factor.name = "sizeFactor" )
x |
A SummarizedExperiment object or one of its subclasses. Rows correspond to antibody-derived tags (ADTs) and columns correspond to cells. |
size.factors |
Numeric vector of length equal to the number of columns of |
num.threads |
Number of threads, passed to |
center |
Logical scalar indicating whether to center the |
block |
Block assignments for each cell, passed to |
mode |
How to center size factors in different blocks, see |
log |
Whether to log-transform the normalized expression values, see |
pseudo.count |
The pseudo-count for log-transformation, see |
more.norm.args |
Named list of additional arguments to pass to |
assay.type |
Integer or string specifying the assay of |
output.name |
String containing the name of the assay to store the normalized matrix. |
factor.name |
String containing the name of the |
x is returned with a new assay containing the (log-)normalized matrix.
Size factors are also stored in the colData.
Aaron Lun
library(SingleCellExperiment) sce <- altExp(getTestAdtData.se("qc"), "ADT") sce <- normalizeAdtCounts.se(sce) assayNames(sce) summary(sizeFactors(sce))library(SingleCellExperiment) sce <- altExp(getTestAdtData.se("qc"), "ADT") sce <- normalizeAdtCounts.se(sce) assayNames(sce) summary(sizeFactors(sce))
Apply scaling normalization and log-transformation to a count matrix. This yields normalized expression values that can be used in downstream procedures like PCA.
normalizeCounts( x, size.factors, log = NULL, pseudo.count = NULL, log.base = NULL, preserve.sparsity = NULL, delayed = TRUE )normalizeCounts( x, size.factors, log = NULL, pseudo.count = NULL, log.base = NULL, preserve.sparsity = NULL, delayed = TRUE )
x |
A matrix-like object where rows correspond to genes or genomic features and columns correspond to cells.
Values are expected to be non-negative counts.
Alternatively, an external pointer created by |
size.factors |
A numeric vector of length equal to the number of cells in |
log |
Logical scalar indicating whether log-transformation should be performed. This ensures that downstream analyses like t-tests and distance calculations focus on relative fold-changes rather than absolute differences. The log-transformation also provides some measure of variance stabilization so that the downstream analyses are not dominated by sampling noise at large counts. If |
pseudo.count |
Numeric scalar specifying the positive pseudo-count to add before any log-transformation.
Larger values shrink the differences between cells towards zero, reducing spurious differences (but also signal) at low counts - see If This argument is ignored if |
log.base |
Numeric scalar specifying the base of the log-transformation. This is typically set to 2 or 10 for convenient interpretation of the log-values. If This argument is ignored if |
preserve.sparsity |
Logical scalar indicating whether to preserve sparsity for If This argument is ignored if |
delayed |
Logical scalar indicating whether operations on a matrix-like |
A normalized expression matrix in varying forms:
If x is a matrix-like object, a matrix-like object is returned containing the (log-transformed) normalized expression matrix.
If delayed=TRUE, the returned object is a DelayedArray, otherwise the return type depends on the type of x and the operations involved.
If x is an external pointer produced by initializeCpp, a new external pointer is returned containing the normalized expression matrix.
Aaron Lun
The normalize_counts function in https://libscran.github.io/scran_norm/.
normalizeRnaCounts.se and related functions, which compute normalized expression values from a SummarizedExperiment.
# Mocking a matrix: library(Matrix) x <- round(abs(rsparsematrix(1000, 100, 0.1) * 100)) sf <- centerSizeFactors(colSums(x)) normed <- normalizeCounts(x, size.factors=sf) normed # Passing a pointer. ptr <- beachmat::initializeCpp(x) optr <- normalizeCounts(ptr, sf) optr# Mocking a matrix: library(Matrix) x <- round(abs(rsparsematrix(1000, 100, 0.1) * 100)) sf <- centerSizeFactors(colSums(x)) normed <- normalizeCounts(x, size.factors=sf) normed # Passing a pointer. ptr <- beachmat::initializeCpp(x) optr <- normalizeCounts(ptr, sf) optr
normalizeCounts
Default parameters from the underlying C++ library.
These may be overridden by defaults in the normalizeCounts function signature.
normalizeCountsDefaults()normalizeCountsDefaults()
Named list of default values for various function arguments.
Aaron Lun
normalizeCountsDefaults()normalizeCountsDefaults()
Compute (log-)normalized expression values after performing scaling normalization of an CRISPR count matrix.
This calls normalizeCounts on an assay of a SummarizedExperiment,
after centering the size factors with centerSizeFactors.
normalizeCrisprCounts.se( x, size.factors = NULL, center = TRUE, block = NULL, mode = "lowest", log = TRUE, pseudo.count = 1, more.norm.args = list(), assay.type = "counts", output.name = "logcounts", factor.name = "sizeFactor", num.threads = 1 )normalizeCrisprCounts.se( x, size.factors = NULL, center = TRUE, block = NULL, mode = "lowest", log = TRUE, pseudo.count = 1, more.norm.args = list(), assay.type = "counts", output.name = "logcounts", factor.name = "sizeFactor", num.threads = 1 )
x |
A SummarizedExperiment object or one of its subclasses. Rows correspond to CRISPR guides and columns correspond to cells. |
size.factors |
Numeric vector of length equal to the number of columns of |
center |
Logical scalar indicating whether to center the |
block |
Block assignments for each cell, passed to |
mode |
How to center size factors in different blocks, see |
log |
Whether to log-transform the normalized expression values, see |
pseudo.count |
The pseudo-count for log-transformation, see |
more.norm.args |
Named list of additional arguments to pass to |
assay.type |
Integer or string specifying the assay of |
output.name |
String containing the name of the assay to store the normalized matrix. |
factor.name |
String containing the name of the |
num.threads |
Integer specifying the number of threads for computing column sums.
Only used if |
x is returned with a new assay containing the (log-)normalized matrix.
Size factors are also stored in the colData.
Aaron Lun
library(SingleCellExperiment) sce <- altExp(getTestCrisprData.se("qc"), "CRISPR Guide Capture") sce <- normalizeCrisprCounts.se(sce, size.factors=sce$sum) assayNames(sce) summary(sizeFactors(sce))library(SingleCellExperiment) sce <- altExp(getTestCrisprData.se("qc"), "CRISPR Guide Capture") sce <- normalizeCrisprCounts.se(sce, size.factors=sce$sum) assayNames(sce) summary(sizeFactors(sce))
Compute (log-)normalized expression values for a RNA count matrix after scaling normalization.
This calls normalizeCounts on an assay of a SummarizedExperiment,
after centering the size factors with centerSizeFactors.
normalizeRnaCounts.se( x, size.factors = NULL, center = TRUE, block = NULL, mode = "lowest", log = TRUE, pseudo.count = 1, more.norm.args = list(), assay.type = "counts", output.name = "logcounts", factor.name = "sizeFactor", num.threads = 1 )normalizeRnaCounts.se( x, size.factors = NULL, center = TRUE, block = NULL, mode = "lowest", log = TRUE, pseudo.count = 1, more.norm.args = list(), assay.type = "counts", output.name = "logcounts", factor.name = "sizeFactor", num.threads = 1 )
x |
A SummarizedExperiment object or one of its subclasses. Rows correspond to genes and columns correspond to cells. |
size.factors |
Numeric vector of length equal to the number of columns of |
center |
Logical scalar indicating whether to center the |
block |
Block assignments for each cell, passed to |
mode |
How to center size factors in different blocks, see |
log |
Whether to log-transform the normalized expression values, see |
pseudo.count |
The pseudo-count for log-transformation, see |
more.norm.args |
Named list of additional arguments to pass to |
assay.type |
Integer or string specifying the assay of |
output.name |
String containing the name of the assay to store the normalized matrix. |
factor.name |
String containing the name of the |
num.threads |
Integer specifying the number of threads for computing column sums.
Only used if |
x is returned with a new assay containing the (log-)normalized matrix.
Size factors are also stored in the colData.
Aaron Lun
normalizeRnaCountsWithSpikeIns.se, to normalize RNA data with spike-in transcripts.
library(SingleCellExperiment) sce <- getTestRnaData.se("qc") sce <- normalizeRnaCounts.se(sce, size.factors=sce$sum) assayNames(sce) summary(sizeFactors(sce))library(SingleCellExperiment) sce <- getTestRnaData.se("qc") sce <- normalizeRnaCounts.se(sce, size.factors=sce$sum) assayNames(sce) summary(sizeFactors(sce))
Compute (log-)normalized expression values for endogenous genes and spike-in transcripts after scaling normalization.
All sets of size factors are centered with centerSpikeInFactors prior to calling normalizeCounts on the relevant assays.
normalizeRnaCountsWithSpikeIns.se( x, spike.altexps, endogenous.factors = NULL, spike.factors = NULL, use.spike.ins.for.endogenous = FALSE, block = NULL, mode = "lowest", log = TRUE, pseudo.count = 1, more.norm.args = list(), assay.type = "counts", output.name = "logcounts", factor.name = "sizeFactor", num.threads = 1 )normalizeRnaCountsWithSpikeIns.se( x, spike.altexps, endogenous.factors = NULL, spike.factors = NULL, use.spike.ins.for.endogenous = FALSE, block = NULL, mode = "lowest", log = TRUE, pseudo.count = 1, more.norm.args = list(), assay.type = "counts", output.name = "logcounts", factor.name = "sizeFactor", num.threads = 1 )
x |
A SingleCellExperiment or one of its subclasses.
The main experiment should contain count data for endogenous genes, where rows correspond to genes and columns correspond to cells.
There should also be one or more alternative experiments containing spike-in data, see |
spike.altexps |
Alternative experiments containing spike-in data.
This should be an unnamed integer or character vector containing the names/indices of the alternative experiments of interest.
The assay to use from each alternative experiment is determined by Alternatively, |
endogenous.factors |
Numeric vector of length equal to |
spike.factors |
Named list of numeric vectors containing the size factors for each spike-in set.
Each entry corresponds to a spike-in set ad be named after its alternative experiment in |
use.spike.ins.for.endogenous |
Boolean indicating whether to use spike-in factors for normalization of endogenous genes. By default, each set of spike-in factors is only used on its corresponding count matrix, while the endogenous factors are used on the endogenous count matrix. If this is set to Alternatively, this may be a string specifying the name of the spike-in set from which to obtain size factors for normalization of the endogenous counts.
This should refer to one of the alternative experiments in |
block |
Block assignments for each cell, passed to |
mode |
How to center size factors in different blocks, see |
log |
Whether to log-transform the normalized expression values, see |
pseudo.count |
The pseudo-count for log-transformation, see |
more.norm.args |
Named list of additional arguments to pass to |
assay.type |
Integer or string specifying the assay of |
output.name |
String containing the name of the assay to store the normalized matrix. |
factor.name |
String containing the name of the |
num.threads |
Integer specifying the number of threads for computing column sums.
Only used if precomputed factors are not available in |
x is returned with new assays containing (log-)normalized matrices for endogenous genes and spike-in transcripts.
Size factors are also stored in the colData for each experiment.
Aaron Lun
normalizeRnaCounts.se, for simpler normalization when spike-ins are not present.
library(SingleCellExperiment) sce <- getTestRnaData.se("qc") sce <- normalizeRnaCountsWithSpikeIns.se(sce, "ERCC") summary(sce$sizeFactor) assayNames(sce) summary(altExp(sce, "ERCC")$sizeFactor) assayNames(altExp(sce, "ERCC"))library(SingleCellExperiment) sce <- getTestRnaData.se("qc") sce <- normalizeRnaCountsWithSpikeIns.se(sce, "ERCC") summary(sce$sizeFactor) assayNames(sce) summary(altExp(sce, "ERCC")$sizeFactor) assayNames(altExp(sce, "ERCC"))
Quickly compute quality control (QC) metrics, thresholds and filters from ADT data in a SummarizedExperiment.
This calls computeAdtQcMetrics on an assay in a SummarizedExperiment,
followed by suggestAdtQcThresholds and filterAdtQcMetrics to identify high-quality cells.
quickAdtQc.se( x, subsets, num.threads = 1, thresholds = NULL, block = NULL, more.suggest.args = list(), assay.type = "counts", output.prefix = NULL, meta.name = "qc", flatten = TRUE ) formatComputeAdtQcMetricsResult(compute.res, flatten = TRUE)quickAdtQc.se( x, subsets, num.threads = 1, thresholds = NULL, block = NULL, more.suggest.args = list(), assay.type = "counts", output.prefix = NULL, meta.name = "qc", flatten = TRUE ) formatComputeAdtQcMetricsResult(compute.res, flatten = TRUE)
x |
A SummarizedExperiment object or one of its subclasses. Rows correspond to antibody-derived tags (ADTs) and columns correspond to cells. |
subsets |
List of subsets of control tags, see |
num.threads |
Number of threads, to pass to |
thresholds |
List containing pre-defined thresholds for each QC metric,
see the return value of |
block |
Block assignment for each cell, to pass to |
more.suggest.args |
Named list of additional arguments to pass to |
assay.type |
Integer or string specifying the assay of |
output.prefix |
String containing a prefix to add to the names of the |
meta.name |
String containing the name of the |
flatten |
Logical scalar indicating whether to flatten the subset proportions into separate columns of the |
compute.res |
DataFrame returned by |
For quickAdtQc.se, x is returned with additional columns added to its colData.
Each column contains per-cell values for one of the QC metrics, see computeAdtQcMetrics for details.
The suggested thresholds are stored as a list in metadata.
The colData also contains a keep column, specifying which cells are to be retained.
For formatComputeAdtQcMetricsResult, a DataFrame is returned with the per-cell QC metrics.
Aaron Lun
library(SingleCellExperiment) sce <- altExp(getTestAdtData.se(), "ADT") sce <- quickAdtQc.se(sce, subsets=list(igg=grepl("IgG", rownames(sce)))) colData(sce)[,c("sum", "detected", "subset.sum.igg")] metadata(sce)$qc$thresholds summary(sce$keep)library(SingleCellExperiment) sce <- altExp(getTestAdtData.se(), "ADT") sce <- quickAdtQc.se(sce, subsets=list(igg=grepl("IgG", rownames(sce)))) colData(sce)[,c("sum", "detected", "subset.sum.igg")] metadata(sce)$qc$thresholds summary(sce$keep)
Quickly compute quality control (QC) metrics, thresholds and filters from CRISPR data in a SummarizedExperiment.
This calls computeCrisprQcMetrics on an assay in a SummarizedExperiment,
followed by suggestCrisprQcThresholds and filterCrisprQcMetrics to identify high-quality cells.
quickCrisprQc.se( x, num.threads = 1, thresholds = NULL, block = NULL, more.suggest.args = list(), assay.type = "counts", output.prefix = NULL, meta.name = "qc" ) formatComputeCrisprQcMetricsResult(compute.res)quickCrisprQc.se( x, num.threads = 1, thresholds = NULL, block = NULL, more.suggest.args = list(), assay.type = "counts", output.prefix = NULL, meta.name = "qc" ) formatComputeCrisprQcMetricsResult(compute.res)
x |
A SummarizedExperiment object or one of its subclasses. Rows correspond to CRISPR guides and columns correspond to cells. |
num.threads |
Number of threads, to pass to |
thresholds |
List containing pre-defined thresholds for each QC metric,
see the return value of |
block |
Block assignment for each cell, to pass to |
more.suggest.args |
Named list of additional arguments to pass to |
assay.type |
Integer or string specifying the assay of |
output.prefix |
String containing a prefix to add to the names of the |
meta.name |
String containing the name of the |
compute.res |
DataFrame returned by |
For quickCrisprQc.se, x is returned with additional columns added to its colData.
Each column contains per-cell values for one of the QC metrics, see computeCrisprQcMetrics for details.
The suggested thresholds are stored as a list in metadata.
The colData also contains a keep column, specifying which cells are to be retained.
For formatComputeCrisprQcMetricsResult, a DataFrame is returned with the per-cell QC metrics.
Aaron Lun
library(SingleCellExperiment) sce <- altExp(getTestCrisprData.se(), "CRISPR Guide Capture") sce <- quickCrisprQc.se(sce) colData(sce)[,c("sum", "detected", "max.value", "max.index")] metadata(sce)$qc$thresholds summary(sce$keep)library(SingleCellExperiment) sce <- altExp(getTestCrisprData.se(), "CRISPR Guide Capture") sce <- quickCrisprQc.se(sce) colData(sce)[,c("sum", "detected", "max.value", "max.index")] metadata(sce)$qc$thresholds summary(sce$keep)
Quickly compute quality control (QC) metrics, thresholds and filters from RNA data in a SummarizedExperiment.
This calls computeRnaQcMetrics on an assay in a SummarizedExperiment,
followed by suggestRnaQcThresholds and filterRnaQcMetrics to identify high-quality cells.
quickRnaQc.se( x, subsets, num.threads = 1, thresholds = NULL, block = NULL, more.suggest.args = list(), altexp.proportions = NULL, assay.type = "counts", output.prefix = NULL, meta.name = "qc", flatten = TRUE ) computeRnaQcMetricsWithAltExps( x, subsets, altexp.proportions, num.threads = 1, assay.type = "counts" ) formatComputeRnaQcMetricsResult(compute.res, flatten = TRUE)quickRnaQc.se( x, subsets, num.threads = 1, thresholds = NULL, block = NULL, more.suggest.args = list(), altexp.proportions = NULL, assay.type = "counts", output.prefix = NULL, meta.name = "qc", flatten = TRUE ) computeRnaQcMetricsWithAltExps( x, subsets, altexp.proportions, num.threads = 1, assay.type = "counts" ) formatComputeRnaQcMetricsResult(compute.res, flatten = TRUE)
x |
A SummarizedExperiment object or one of its subclasses. Rows correspond to genes and columns correspond to cells. |
subsets |
List of subsets of control genes, see |
num.threads |
Number of threads, to pass to |
thresholds |
List containing pre-defined thresholds for each QC metric,
see the return value of |
block |
Block assignment for each cell, to pass to |
more.suggest.args |
Named list of additional arguments to pass to |
altexp.proportions |
Alternative experiments for which to compute QC metrics.
This is typically used to refer to alternative experiments holding spike-in data.
For each alternative experiment, the proportion is defined as More specifically, Alternatively, Only relevant if |
assay.type |
Integer or string specifying the assay of |
output.prefix |
String containing a prefix to add to the names of the |
meta.name |
String containing the name of the |
flatten |
Logical scalar indicating whether to flatten the subset proportions into separate columns of the |
compute.res |
DataFrame returned by |
For quickRnaQc.se, x is returned with additional columns added to its colData.
Each column contains per-cell values for one of the QC metrics, see computeRnaQcMetrics for details.
The suggested thresholds are stored as a list in metadata.
The colData also contains a keep column, specifying which cells are to be retained.
If altexp.proportions is provided, QC metrics are added to the colData of the specified alternative experiments in the output object.
For computeRnaQcMetricsWithAltExps, a list is returned containing:
main, the result of calling computeRnaQcMetrics on the RNA count matrix in x.
The proportion of counts in each alternative experiment is added to the subsets.
altexp, a named list of length equal to altexp.proportions.
Each inner list is the result of calling computeRnaQcMetrics on the RNA count matrix of the corresponding alternative experiment of x.
For formatComputeRnaQcMetricsResult, a DataFrame is returned containing the per-cell QC metrics.
Aaron Lun
library(SingleCellExperiment) sce <- getTestRnaData.se() sce <- quickRnaQc.se(sce, subsets=list(mito=grepl("^mt", rownames(sce)))) colData(sce)[,c("sum", "detected", "subset.proportion.mito")] metadata(sce)$qc$thresholds summary(sce$keep) # Computing spike-in proportions, if available. sce <- getTestRnaData.se() sce <- quickRnaQc.se( sce, subsets=list(mito=grepl("^mt", rownames(sce))), altexp.proportions="ERCC" ) colData(sce)[,c("sum", "detected", "subset.proportion.mito", "subset.proportion.ERCC")] colData(altExp(sce, "ERCC"))[,c("sum", "detected")]library(SingleCellExperiment) sce <- getTestRnaData.se() sce <- quickRnaQc.se(sce, subsets=list(mito=grepl("^mt", rownames(sce)))) colData(sce)[,c("sum", "detected", "subset.proportion.mito")] metadata(sce)$qc$thresholds summary(sce$keep) # Computing spike-in proportions, if available. sce <- getTestRnaData.se() sce <- quickRnaQc.se( sce, subsets=list(mito=grepl("^mt", rownames(sce))), altexp.proportions="ERCC" ) colData(sce)[,c("sum", "detected", "subset.proportion.mito", "subset.proportion.ERCC")] colData(altExp(sce, "ERCC"))[,c("sum", "detected")]
Combine all marker statistics for a single group into a data frame for easy inspection. Users can pick one of the columns for sorting potential marker genes.
reportGroupMarkerStatistics( results, group, effect.sizes = NULL, summaries = NULL, include.mean = TRUE, include.detected = TRUE )reportGroupMarkerStatistics( results, group, effect.sizes = NULL, summaries = NULL, include.mean = TRUE, include.detected = TRUE )
results |
Named list of marker statistics, typically generated by |
group |
String or integer specifying the group of interest. |
effect.sizes |
Character vector specifying the effect sizes of interest.
If |
summaries |
Character vector specifying the summary statistics of interest.
If |
include.mean |
Boolean indicating whether the mean expression should be reported in the returned data frame. |
include.detected |
Boolean indicating whether the proportion of detected cells should be reported in the returned data frame. |
Data frame where each row corresponds to a gene.
Each column contains the requested statistics for group.
Effect size summary columns are named as <EFFECT>.<SUMMARY>.
Aaron Lun
scoreMarkers, to generate results.
summarizeEffects, for the trade-offs between effect size summaries.
Compute per-cell QC metrics from an initialized matrix of RNA counts, and use the metrics to suggest filter thresholds to retain high-quality cells.
computeRnaQcMetrics(x, subsets, num.threads = NULL) suggestRnaQcThresholds( metrics, block = NULL, num.mads = NULL, sum.num.mads = num.mads, detected.num.mads = num.mads, subset.proportion.num.mads = num.mads ) filterRnaQcMetrics(thresholds, metrics, block = NULL)computeRnaQcMetrics(x, subsets, num.threads = NULL) suggestRnaQcThresholds( metrics, block = NULL, num.mads = NULL, sum.num.mads = num.mads, detected.num.mads = num.mads, subset.proportion.num.mads = num.mads ) filterRnaQcMetrics(thresholds, metrics, block = NULL)
x |
A matrix-like object where rows are genes and columns are cells. Values are expected to be counts. |
subsets |
Named list of vectors specifying gene subsets of interest, typically for control-like features like mitochondrial genes or spike-in transcripts.
Each vector may be logical (whether to keep each row), integer (row indices) or character (row names).
For character vectors, strings not present in |
num.threads |
Integer scalar specifying the number of threads to use. If |
metrics |
DataFrame of per-cell QC metrics.
This should have the same structure as the return value of |
block |
Factor specifying the block of origin (e.g., batch, sample) for each cell in For |
num.mads |
Number of median absolute deviations (MADs) from the median to define the threshold for outliers in each metric. |
sum.num.mads |
Number of MADS from the median to define the threshold for outliers in the total sum of counts. If |
detected.num.mads |
Number of MADS from the median to define the threshold for outliers in the number of detected genes. If |
subset.proportion.num.mads |
Number of MADS from the median to define the threshold for outliers in the subset proportions. If |
thresholds |
List with the same structure as produced by |
For computeRnaQcMetrics, a DataFrame is returned with one row per cell in x.
This contains the following columns:
sum, a numeric vector containing the total RNA count for each cell.
This represents the efficiency of library preparation and sequencing.
Low totals indicate that the library was not successfully captured.
detected, an integer vector containing the number of detected genes per cell.
This also quantifies library preparation efficiency but with greater focus on capturing transcriptional complexity.
subsets, a nested DataFrame where each column corresponds to a feature subset and is a numeric vector containing the proportion of counts in that subset.
The exact interpretation of which depends on the nature of the subset.
For example, if one subset contains all genes on the mitochondrial chromosome, higher proportions are representative of cell damage;
the assumption is that cytoplasmic transcripts leak through tears in the cell membrane while the mitochondria are still trapped inside.
The proportion of spike-in transcripts can be interpreted in a similar manner, where the loss of endogenous transcripts results in higher spike-in proportions.
Each vector is of length equal to the number of cells.
For suggestRnaQcThresholds, a named list is returned.
If block=NULL, the list contains:
sum, a numeric scalar containing the lower bound on the sum.
This is defined as num.mads MADs below the median of the log-transformed metrics across all cells.
detected, a numeric scalar containing the lower bound on the number of detected genes.
This is defined as num.mads MADs below the median of the log-transformed metrics across all cells.
subsets, a numeric vector containing the upper bound on the sum of counts in each feature subset.
This is defined as num.mads MADs above the median across all cells.
Otherwise, if block is supplied, the list contains:
sum, a numeric vector containing the lower bound on the sum for each blocking level.
Here, the threshold is computed independently for each block, using the same method as the unblocked case.
detected, a numeric vector containing the lower bound on the number of detected genes for each blocking level.
Here, the threshold is computed independently for each block, using the same method as the unblocked case.
subsets, a list of numeric vectors containing the upper bound on the sum of counts in each feature subset for each blocking level.
Here, the threshold is computed independently for each block, using the same method as the unblocked case.
block.ids, a vector containing the identities of the unique blocks.
Each vector is of length equal to the number of levels in block and is named accordingly.
For filterRnaQcMetrics, a logical vector of length ncol(x) is returned indicating which cells are of high quality.
High-quality cells are defined as those with sums and detected genes above their respective thresholds and subset proportions below the subsets threshold.
Aaron Lun
The compute_rna_qc_metrics, compute_rna_qc_filters and compute_rna_qc_filters_blocked functions in https://libscran.github.io/scran_qc/.
quickRnaQc.se, to run all of the RNA-related QC functions on a SummarizedExperiment.
# Mocking a matrix: library(Matrix) x <- round(abs(rsparsematrix(1000, 100, 0.1) * 100)) # Mocking up a control set. sub <- list(mito=rbinom(nrow(x), 1, 0.1) > 0) qc <- computeRnaQcMetrics(x, sub) qc filt <- suggestRnaQcThresholds(qc) str(filt) keep <- filterRnaQcMetrics(filt, qc) summary(keep)# Mocking a matrix: library(Matrix) x <- round(abs(rsparsematrix(1000, 100, 0.1) * 100)) # Mocking up a control set. sub <- list(mito=rbinom(nrow(x), 1, 0.1) > 0) qc <- computeRnaQcMetrics(x, sub) qc filt <- suggestRnaQcThresholds(qc) str(filt) keep <- filterRnaQcMetrics(filt, qc) summary(keep)
Default parameters from the underlying C++ library. These may be overridden by defaults in each function's signature.
computeRnaQcMetricsDefaults() suggestRnaQcThresholdsDefaults()computeRnaQcMetricsDefaults() suggestRnaQcThresholdsDefaults()
Named list containing default values for the various function arguments.
Aaron Lun
computeRnaQcMetricsDefaults() suggestRnaQcThresholdsDefaults()computeRnaQcMetricsDefaults() suggestRnaQcThresholdsDefaults()
Run all steps that require a nearest-neighbor search.
This includs runUmap, runTsne and buildSnnGraph with clusterGraph.
The idea is to build the index once, perform the neighbor search, and run each task in parallel to save time.
runAllNeighborSteps( x, runUmap.args = list(), runTsne.args = list(), buildSnnGraph.args = list(), clusterGraph.args = list(), BNPARAM = AnnoyParam(), return.graph = FALSE, collapse.search = TRUE, num.threads = 3 )runAllNeighborSteps( x, runUmap.args = list(), runTsne.args = list(), buildSnnGraph.args = list(), clusterGraph.args = list(), BNPARAM = AnnoyParam(), return.graph = FALSE, collapse.search = TRUE, num.threads = 3 )
x |
Numeric matrix where rows are dimensions and columns are cells,
typically containing a low-dimensional representation from, e.g., Alternatively, an index constructed by |
runUmap.args |
Named list of further arguments to pass to |
runTsne.args |
Named list of further arguments to pass to |
buildSnnGraph.args |
Named list of further arguments to pass to |
clusterGraph.args |
Named list of further arguments to pass to |
BNPARAM |
A BiocNeighborParam instance specifying the nearest-neighbor search algorithm to use. |
return.graph |
Logical scalar indicating whether to return the output of |
collapse.search |
Logical scalar indicating whether to collapse the nearest-neighbor search for each step into a single search.
Steps that need fewer neighbors will take a subset of the neighbors from the collapsed search.
Setting this to |
num.threads |
Integer scalar specifying the number of threads to use. At least one thread should be available for each step. |
A named list containing the results of each step. See each individual function for the format of the results.
Aaron Lun
runAllNeighborSteps.se, to run each neighbor-related step on a SingleCellExperiment.
x <- t(as.matrix(iris[,1:4])) # (Turning down the number of threads so that R CMD check is happy.) res <- runAllNeighborSteps(x, num.threads=2) str(res)x <- t(as.matrix(iris[,1:4])) # (Turning down the number of threads so that R CMD check is happy.) res <- runAllNeighborSteps(x, num.threads=2) str(res)
Concurrently run all steps involving a nearest-neighbor search (t-SNE, UMAP and graph-based clustering) using the same nearest-neighbor index,
by calling runAllNeighborSteps on a reduced dimension entry of a SingleCellExperiment.
runAllNeighborSteps.se( x, umap.output.name = "UMAP", umap.dim.prefix = "UMAP", more.umap.args = list(), tsne.output.name = "TSNE", tsne.dim.prefix = "TSNE", more.tsne.args = list(), build.graph.name = NULL, more.build.graph.args = list(), cluster.output.name = "clusters", cluster.meta.name = NULL, more.cluster.graph.args = list(), BNPARAM = AnnoyParam(), num.threads = 3, more.neighbor.args = list(), reddim.type = "PCA" )runAllNeighborSteps.se( x, umap.output.name = "UMAP", umap.dim.prefix = "UMAP", more.umap.args = list(), tsne.output.name = "TSNE", tsne.dim.prefix = "TSNE", more.tsne.args = list(), build.graph.name = NULL, more.build.graph.args = list(), cluster.output.name = "clusters", cluster.meta.name = NULL, more.cluster.graph.args = list(), BNPARAM = AnnoyParam(), num.threads = 3, more.neighbor.args = list(), reddim.type = "PCA" )
x |
A SingleCellExperiment object or one of its subclasses. Rows correspond to genomic features and columns correspond to cells. |
umap.output.name |
String containing the name of the |
umap.dim.prefix |
String containing a prefix for the column names of the UMAP coordinate matrix,
see the |
more.umap.args |
Named list of additional arguments to pass to |
tsne.output.name |
String containing the name of the |
tsne.dim.prefix |
String containing a prefix for the column names of the t-SNE coordinate matrix,
see the |
more.tsne.args |
Named list of additional arguments to pass to |
build.graph.name |
String containing the name of the |
more.build.graph.args |
Named list of additional arguments to pass to |
cluster.output.name |
String containing the name of the |
cluster.meta.name |
String containing the name of the |
more.cluster.graph.args |
Named list of additional arguments to pass to |
BNPARAM, num.threads
|
Arguments to pass to |
more.neighbor.args |
Named list of additional arguments to pass to |
reddim.type |
String or integer specifying the |
x is returned with additional coordinates stored in its reducedDims
and clustering output in its colData.
Additional information may also be stored in its metadata.
Aaron Lun
library(SingleCellExperiment) sce <- getTestRnaData.se("pca") sce <- runAllNeighborSteps.se( sce, more.tsne.args=list(max.iterations=50), more.umap.args=list(num.epochs=50), num.threads=2 # to keep R CMD check happy ) reducedDimNames(sce) table(sce$clusters)library(SingleCellExperiment) sce <- getTestRnaData.se("pca") sce <- runAllNeighborSteps.se( sce, more.tsne.args=list(max.iterations=50), more.umap.args=list(num.epochs=50), num.threads=2 # to keep R CMD check happy ) reducedDimNames(sce) table(sce$clusters)
Run a PCA on the gene-by-cell log-expression matrix and extract the top principal components (PCs). This yields a low-dimensional representation that reduces noise and compute time in downstream analyses. For efficiency, the PCA itself is approximated using IRLBA.
runPca( x, number = NULL, scale = NULL, block = NULL, block.weight.policy = NULL, variable.block.weight = NULL, components.from.residuals = NULL, center.scores.by.block = components.from.residuals, subset = NULL, extra.work = NULL, iterations = NULL, tolerance = NULL, seed = NULL, realized = NULL, warn = TRUE, num.threads = NULL )runPca( x, number = NULL, scale = NULL, block = NULL, block.weight.policy = NULL, variable.block.weight = NULL, components.from.residuals = NULL, center.scores.by.block = components.from.residuals, subset = NULL, extra.work = NULL, iterations = NULL, tolerance = NULL, seed = NULL, realized = NULL, warn = TRUE, num.threads = NULL )
x |
A matrix-like object where rows correspond to genes or genomic features and columns correspond to cells.
Typically, the matrix is expected to contain log-expression values (see |
number |
Integer scalar specifying the number of top PCs to retain.
More PCs will capture more biological signal at the cost of increasing noise and compute time.
If this is greater than the maximum number of PCs (i.e., the smaller dimension of If |
scale |
Logical scalar indicating whether to scale all genes to have the same variance.
This ensures that each gene contributes equally to the PCA, favoring consistent variation across many genes rather than large variation in a few genes.
If If |
block |
Factor specifying the block of origin (e.g., batch, sample) for each cell in |
block.weight.policy |
String specifying the policy to use for weighting the contribution of different blocks to the PCA.
See the argument of the same name in If This argument is only used if |
variable.block.weight |
Numeric vector of length 2, specifying the parameters for variable block weighting.
See the argument of the same name in If This argument is only used if |
components.from.residuals |
Deprecated, use |
center.scores.by.block |
Boolean indicating whether to center the PC scores within each block, see Details. If This argument is only used if |
subset |
Integer, logical or character vector specifying the rows of |
extra.work |
Integer scalar specifying the number of extra dimensions for the IRLBA workspace. Larger values improve accuracy at the cost of compute time. If |
iterations |
Integer scalar specifying the maximum number of restart iterations for IRLBA. Larger values improve accuracy at the cost of compute time. If |
tolerance |
Number specifying the tolerance on the approximation error of the singular triplets, to determine IRLBA convergence. Lower values improve accuracy at the cost of compute time. If |
seed |
Integer scalar specifying the seed for the initial random vector in IRLBA. If |
realized |
Logical scalar indicating whether to realize If |
warn |
Boolean specifying whether a warning should be emitted if IRLBA failed to converge. |
num.threads |
Number of threads to use. If |
When block is specified, the interpretation of the reported PC scores in components depends on the choice of center.scores.by.block:
If FALSE (the default), the centroid of all cells is shifted to the origin.
This does not attempt to explicitly remove any differences between blocks.
Any differences in expression that are not orthogonal to the rotation vectors will still manifest in the PC scores.
In this mode, blocking only reduces the impact of inter-block differences on the identification of the rotation vectors.
If TRUE, the scores are centered at the origin within each block.
This yields a low-dimensional space where inter-block differences have been removed,
assuming that all blocks have the same subpopulation composition and the inter-block differences are consistent for all cell subpopulations.
Under these assumptions, we could use these components for downstream analysis without any concern for block-wise effects.
In complex datasets, the assumptions mentioned for TRUE not hold and more sophisticated batch correction methods like MNN correction are required.
Functions like correctMnn will accept a low-dimensional embedding of cells that can be created as described above with FALSE.
List containing:
components, a matrix of PC scores.
Rows are dimensions (i.e., PCs) and columns are cells.
rotation, the rotation matrix.
Rows are genes and columns are dimensions.
variance.explained, the vector of variances explained by each PC.
total.variance, the total variance in the dataset.
This can be used to divide variance.explained to obtain the proportion of variance explained by each PC.
center, a numeric vector containing the mean for each gene.
If block is provided, this is instead a matrix containing the mean for each gene (column) in each block (row).
block.ids, a vector containing the identities of the unique blocks in the same order as the rows of center.
Only reported if block is provided.
scale, a numeric vector containing the scaling for each gene.
Only reported if scale=TRUE.
metrics, a named list containing IRLBA metrics.
This includes converged, whether the algorithm converged successfully;
iterations, the number of IRLBA restart iterations;
and multiplications, the number of matrix multiplications.
Aaron Lun
The simple_pca and blocked_pca functions for https://libscran.github.io/scran_pca/.
runPca.se, to run a PCA on a SummarizedExperiment.
library(Matrix) x <- abs(rsparsematrix(1000, 100, 0.1) * 10) y <- normalizeCounts(x, size.factors=centerSizeFactors(colSums(x))) # A simple PCA: out <- runPca(y) str(out) # Blocking on uninteresting factors: block <- sample(LETTERS[1:3], ncol(y), replace=TRUE) bout <- runPca(y, block=block) str(bout)library(Matrix) x <- abs(rsparsematrix(1000, 100, 0.1) * 10) y <- normalizeCounts(x, size.factors=centerSizeFactors(colSums(x))) # A simple PCA: out <- runPca(y) str(out) # Blocking on uninteresting factors: block <- sample(LETTERS[1:3], ncol(y), replace=TRUE) bout <- runPca(y, block=block) str(bout)
Compact and denoise the dataset by performing PCA on the (log-)normalized expression matrix,
by calling runPca on an assay of a SummarizedExperiment.
runPca.se( x, features, number = 25, block = NULL, num.threads = 1, more.pca.args = list(), assay.type = "logcounts", output.name = "PCA", meta.name = "PCA", dim.prefix = "PC", delayed.transpose = FALSE )runPca.se( x, features, number = 25, block = NULL, num.threads = 1, more.pca.args = list(), assay.type = "logcounts", output.name = "PCA", meta.name = "PCA", dim.prefix = "PC", delayed.transpose = FALSE )
x |
A SummarizedExperiment object or one of its subclasses. Rows correspond to genomic features and columns correspond to cells. |
features |
Integer, logical or character vector containing the features of interest to use in the PCA.
For RNA data, this is typically the |
number |
Number of PCs to retain, passed to |
block |
Block assignment for each cell, passed to |
num.threads |
Number of threads for the PCA, passed to |
more.pca.args |
Named list of additional arguments to pass to |
assay.type |
Integer or string specifying the assay of |
output.name |
String containing the name of the |
meta.name |
String containing the name of the |
dim.prefix |
String containing a prefix for the column names of the score and rotation matrices.
Each column is named as |
delayed.transpose |
Logical scalar indicating whether to delay the transposition when storing coordinates in the |
x is returned with the principal component scores in the reducedDim.
(This is converted to a SingleCellExperiment if it wasn't one already.)
Additional outputs (e.g., rotation matrix, variance explained) are stored in the metadata.
Aaron Lun
library(SingleCellExperiment) sce <- getTestRnaData.se("hvg") sce <- runPca.se(sce, rowData(sce)$hvg) dim(reducedDim(sce, "PCA")) plot(metadata(sce)$PCA$variance.explained / metadata(sce)$PCA$total.variance)library(SingleCellExperiment) sce <- getTestRnaData.se("hvg") sce <- runPca.se(sce, rowData(sce)$hvg) dim(reducedDim(sce, "PCA")) plot(metadata(sce)$PCA$variance.explained / metadata(sce)$PCA$total.variance)
runPca
Default parameters from the underlying C++ library.
These may be overridden by defaults in the runPca function signature.
runPcaDefaults(block = NULL, subset = NULL)runPcaDefaults(block = NULL, subset = NULL)
block |
See the argument of the same name in |
subset |
See the argument of the same name in |
Named list containing default values for various function arguments.
These defaults may change depending on the values for subset and block.
Aaron Lun
runPcaDefaults()runPcaDefaults()
Compute t-SNE coordinates to visualize similarities between cells.
runTsne( x, perplexity = NULL, num.neighbors = NULL, theta = NULL, early.exaggeration.iterations = NULL, exaggeration.factor = NULL, momentum.switch.iterations = NULL, start.momentum = NULL, final.momentum = NULL, eta = NULL, max.depth = 7, leaf.approximation = NULL, max.iterations = NULL, seed = 42, num.threads = NULL, BNPARAM = AnnoyParam() ) tsnePerplexityToNeighbors(perplexity)runTsne( x, perplexity = NULL, num.neighbors = NULL, theta = NULL, early.exaggeration.iterations = NULL, exaggeration.factor = NULL, momentum.switch.iterations = NULL, start.momentum = NULL, final.momentum = NULL, eta = NULL, max.depth = 7, leaf.approximation = NULL, max.iterations = NULL, seed = 42, num.threads = NULL, BNPARAM = AnnoyParam() ) tsnePerplexityToNeighbors(perplexity)
x |
Numeric matrix where rows are dimensions and columns are cells,
typically containing a low-dimensional representation from, e.g., Alternatively, a named list of nearest-neighbor search results like that returned by Alternatively, an index constructed by |
perplexity |
Numeric scalar specifying the perplexity to use in the t-SNE algorithm. Higher perplexities will focus on global structure, at the cost of increased runtime and decreased local resolution. If |
num.neighbors |
Deprecated and ignored.
The number of neighbors is now computed from |
theta |
Number specifying the approximation level for the Barnes-Hut calculation of repulsive forces. Lower values increase accuracy at the cost of increased compute time. Any value should be non-negative. If |
early.exaggeration.iterations |
Integer scalar specifying the number of iterations of the early exaggeration phase,
where clusters are artificially compacted to leave more empty space so that cells can easily relocate to find a good global organization.
Larger values improve convergence within this phase at the cost of reducing the remaining iterations in If |
exaggeration.factor |
Numeric scalar containing the exaggeration factor for the early exaggeration phase (see If |
momentum.switch.iterations |
Integer scalar specifying the number of iterations to perform before switching from the starting momentum to the final momentum. Higher momentums can improve convergence by increasing the step size and smoothing over local oscillations, at the risk of potentially skipping over relevant minima. If |
start.momentum |
Numeric scalar containing the starting momentum, to be used in the iterations before the momentum switch at If |
final.momentum |
Numeric scalar containing the final momentum, to be used in the iterations after the momentum switch at If |
eta |
Numeric scalar containing the learning rate, used to scale the updates for each cell. Larger values can speed up convergence at the cost of skipping over local minima. If |
max.depth |
Integer scalar specifying the maximum depth of the Barnes-Hut quadtree. If neighboring cells cannot be separated before the maximum depth is reached, they will be assigned to the same leaf node of the quadtree. Smaller values (7-10) improve speed by bounding the recursion depth at the cost of accuracy. If |
leaf.approximation |
Logical scalar indicating whether to use the “leaf approximation”.
If If |
max.iterations |
Integer scalar specifying the maximum number of iterations to perform. Larger values improve convergence at the cost of compute time. If |
seed |
Integer scalar specifying the seed to use for generating the initial coordinates. |
num.threads |
Integer scalar specifying the number of threads to use. If |
BNPARAM |
A BiocNeighborParam object specifying the algorithm to use.
Only used if |
For runTsne, a numeric matrix where rows are cells and columns are the two dimensions of the embedding.
For tsnePerplexityToNeighbors, an integer scalar specifying the number of neighbors to use for a given perplexity.
Aaron Lun
van der Maaten LJP and Hinton GE (2008). Visualizing high-dimensional data using t-SNE. Journal of Machine Learning Research_ 9, 2579-2605.
van der Maaten LJP (2014). Accelerating t-SNE using tree-based algorithms. Journal of Machine Learning Research 15, 3221-3245.
https://libscran.github.io/qdtsne/, for an explanation of the approximations.
runTsne.se, to run t-SNE on a SingleCellExperiment.
x <- t(as.matrix(iris[,1:4])) embedding <- runTsne(x) plot(embedding[,1], embedding[,2], col=iris[,5])x <- t(as.matrix(iris[,1:4])) embedding <- runTsne(x) plot(embedding[,1], embedding[,2], col=iris[,5])
Generate a t-SNE visualization from an existing embedding,
by calling runUmap on a reduced dimension entry in SingleCellExperiment.
runTsne.se( x, perplexity = 30, num.threads = 1, more.tsne.args = list(), reddim.type = "PCA", output.name = "TSNE", dim.prefix = "TSNE" )runTsne.se( x, perplexity = 30, num.threads = 1, more.tsne.args = list(), reddim.type = "PCA", output.name = "TSNE", dim.prefix = "TSNE" )
x |
A SingleCellExperiment object or one of its subclasses. Rows correspond to genomic features and columns correspond to cells. |
perplexity |
Perplexity to use in the t-SNE algorithm, passed to |
num.threads |
Number of threads for the neighbor search and optimization, passed to |
more.tsne.args |
Named list of further arguments to pass to |
reddim.type |
Integer or string specifying the existing embedding in the |
output.name |
String containing the name of the output |
dim.prefix |
String containing a prefix for the column names of the t-SNE coordinate matrix.
Each column is named as |
x is returned with the t-SNE coordinates stored in the reducedDim.
Aaron Lun
library(SingleCellExperiment) sce <- getTestRnaData.se("pca") # Using fewer iterations for a faster-running example. sce <- runTsne.se(sce, more.tsne.args=list(max.iterations=50)) head(reducedDim(sce, "TSNE"))library(SingleCellExperiment) sce <- getTestRnaData.se("pca") # Using fewer iterations for a faster-running example. sce <- runTsne.se(sce, more.tsne.args=list(max.iterations=50)) head(reducedDim(sce, "TSNE"))
runTsne
Default parameters from the underlying C++ library.
These may be overridden by defaults in the runTsne function signature.
runTsneDefaults()runTsneDefaults()
Named list containing default values for various function arguments.
Aaron Lun
runTsneDefaults()runTsneDefaults()
Compute UMAP coordinates to visualize similarities between cells.
runUmap( x, num.dim = 2, num.neighbors = NULL, local.connectivity = NULL, bandwidth = NULL, mix.ratio = NULL, spread = NULL, min.dist = NULL, a = NULL, b = NULL, repulsion.strength = NULL, initialize.method = NULL, initial.coordinates = NULL, initialize.random.on.spectral.fail = NULL, initialize.spectral.scale = NULL, initialize.spectral.jitter = NULL, initialize.spectral.jitter.sd = NULL, initialize.random.scale = NULL, initialize.seed = NULL, num.epochs = NULL, learning.rate = NULL, negative.sample.rate = NULL, optimize.seed = NULL, num.threads = NULL, num.threads.optimize = NULL, parallel.optimization = FALSE, BNPARAM = AnnoyParam() )runUmap( x, num.dim = 2, num.neighbors = NULL, local.connectivity = NULL, bandwidth = NULL, mix.ratio = NULL, spread = NULL, min.dist = NULL, a = NULL, b = NULL, repulsion.strength = NULL, initialize.method = NULL, initial.coordinates = NULL, initialize.random.on.spectral.fail = NULL, initialize.spectral.scale = NULL, initialize.spectral.jitter = NULL, initialize.spectral.jitter.sd = NULL, initialize.random.scale = NULL, initialize.seed = NULL, num.epochs = NULL, learning.rate = NULL, negative.sample.rate = NULL, optimize.seed = NULL, num.threads = NULL, num.threads.optimize = NULL, parallel.optimization = FALSE, BNPARAM = AnnoyParam() )
x |
Numeric matrix where rows are dimensions and columns are cells,
typically containing a low-dimensional representation from, e.g., Alternatively, a named list of nearest-neighbor search results like that returned by Alternatively, an index constructed by |
num.dim |
Integer scalar specifying the number of dimensions of the output embedding. |
num.neighbors |
Integer scalar specifying the number of neighbors to use to define the fuzzy sets. Larger values improve connectivity and favor preservation of global structure, at the cost of increased compute time. If This argument is ignored if |
local.connectivity |
Numeric scalar specifying the number of nearest neighbors that are assumed to be always connected, with maximum membership confidence. Larger values increase the connectivity of the embedding and reduce the focus on local structure. This may be a fractional number of neighbors, in which case interpolation is performed when computing the membership confidence. If |
bandwidth |
Numeric scalar specifying the effective bandwidth of the kernel when converting the distance to a neighbor into a fuzzy set membership confidence. Larger values reduce the decay in confidence with respect to distance, increasing connectivity and favoring global structure. If |
mix.ratio |
Numeric scalar between 0 and 1 specifying the mixing ratio when combining fuzzy sets. A mixing ratio of 1 will take the union of confidences, a ratio of 0 will take the intersection, and intermediate values will interpolate between them. Larger values favor connectivity and more global structure. If |
spread |
Numeric scalar specifying the scale of the coordinates of the final low-dimensional embedding. If This argument is ignored if |
min.dist |
Numeric scalar specifying the minimum distance between observations in the final low-dimensional embedding.
Smaller values will increase local clustering while larger values favor a more even distribution of observations throughout the low-dimensional space.
This is interpreted relative to If This argument is ignored if |
a |
Numeric scalar specifying the If this or |
b |
Numeric scalar specifying the If this or |
repulsion.strength |
Numeric scalar specifying the modifier for the repulsive force. Larger values increase repulsion and favor local structure. If |
initialize.method |
String specifying how to initialize the embedding. This should be one of:
If |
initial.coordinates |
Numeric matrix of initial coordinates, with number of rows equal to the number of observations and number of columns equal to This argument is required if |
initialize.random.on.spectral.fail |
Logical scalar indicating whether to fall back to random sampling (i.e., same as If This argument is only used if |
initialize.spectral.scale |
Numeric scalar specifying the maximum absolute magnitude of the coordinates after spectral initialization.
All initial coordinates are scaled such that the maximum of the absolute values is equal to If This argument is only used if |
initialize.spectral.jitter |
Logical scalar indicating whether to jitter coordinates after spectral initialization to separate duplicate observations (e.g., to avoid overplotting).
This is done using normally-distributed noise of mean zero and standard deviation of If This argument is only used if |
initialize.spectral.jitter.sd |
Numeric scalar specifying the standard deviation of the jitter to apply after spectral initialization.
Only relevant if |
initialize.random.scale |
Numeric scalar specifying the scale of the randomly generated initial coordinates.
Coordinates are sampled from a uniform distribution from If This argument is only used if |
initialize.seed |
Numeric scalar specifying the seed for the random number generation during initialization. If This argument is only used if |
num.epochs |
Integer scalar specifying the number of epochs for the gradient descent, i.e., optimization iterations. Larger values improve accuracy at the cost of increased compute time. If
|
learning.rate |
Numeric scalar specifying the initial learning rate used in the gradient descent. Larger values can accelerate convergence but at the risk of skipping over suitable local optima. If |
negative.sample.rate |
Numeric scalar specifying the rate of sampling negative observations to compute repulsive forces. Greater values will improve accuracy but increase compute time. If |
optimize.seed |
Numeric scalar specifying the seed to use for the optimization epochs. If |
num.threads |
Integer scalar specifying the number of threads to use for most UMAP steps. If |
num.threads.optimize |
Integer scalar specifying the number of threads to use during UMAP layout optimization. Parallel optimization uses spinlocks so the number of threads should not be greater than the number of available CPUs. If |
parallel.optimization |
Deprecated, use |
BNPARAM |
A BiocNeighborParam object specifying the algorithm to use.
Only used if |
A numeric matrix where rows are cells and columns are the two dimensions of the embedding.
Aaron Lun
McInnes L, Healy J, Melville J (2020). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv, https://arxiv.org/abs/1802.03426
https://libscran.github.io/umappp/, for details on the underlying implementation.
runUmap.se, to run UMAP on a SingleCellExperiment.
x <- t(as.matrix(iris[,1:4])) embedding <- runUmap(x) plot(embedding[,1], embedding[,2], col=iris[,5])x <- t(as.matrix(iris[,1:4])) embedding <- runUmap(x) plot(embedding[,1], embedding[,2], col=iris[,5])
Generate a UMAP visualization from an existing embedding,
by calling runUmap on a reduced dimension entry in SingleCellExperiment.
runUmap.se( x, num.dim = 2, min.dist = 0.1, num.neighbors = 15, num.threads = 1, more.umap.args = list(), reddim.type = "PCA", output.name = "UMAP", dim.prefix = "UMAP" )runUmap.se( x, num.dim = 2, min.dist = 0.1, num.neighbors = 15, num.threads = 1, more.umap.args = list(), reddim.type = "PCA", output.name = "UMAP", dim.prefix = "UMAP" )
x |
A SingleCellExperiment object or one of its subclasses. Rows correspond to genomic features and columns correspond to cells. |
num.dim |
Number of dimensions in the output embedding, passed to |
min.dist |
Minimum distance between observations, passed to |
num.neighbors |
Number of neighbors for constructing the fuzzy sets, passed to |
num.threads |
Number of threads for the UMAP, passed to |
more.umap.args |
Named list of further arguments to pass to |
reddim.type |
Integer or string specifying the existing embedding in the |
output.name |
String containing the name of the output |
dim.prefix |
String containing a prefix for the column names of the t-SNE coordinate matrix.
Each column is named as |
x is returned with the UMAP coordinates stored in the reducedDim.
Aaron Lun
library(SingleCellExperiment) sce <- getTestRnaData.se("pca") # Using fewer epochs for a faster-running example. sce <- runUmap.se(sce, more.umap.args=list(num.epochs=50)) head(reducedDim(sce, "UMAP"))library(SingleCellExperiment) sce <- getTestRnaData.se("pca") # Using fewer epochs for a faster-running example. sce <- runUmap.se(sce, more.umap.args=list(num.epochs=50)) head(reducedDim(sce, "UMAP"))
runUmap
Default parameters from the underlying C++ library.
These may be overridden by defaults in the runUmap function signature.
runUmapDefaults()runUmapDefaults()
Named list containing default values for various function arguments.
Aaron Lun
runUmapDefaults()runUmapDefaults()
Replace invalid size factors, i.e., zero, negative, infinite or NaN values.
Such size factors can occasionally arise if, e.g., insufficient quality control was performed upstream.
Removing them ensures that the normalized values from normalizeCounts remain finite for sensible downstream processing.
sanitizeSizeFactors( size.factors, replace.zero = NULL, replace.negative = NULL, replace.infinite = NULL, replace.nan = NULL, handle.zero = NULL, handle.negative = NULL, handle.infinite = NULL, handle.nan = NULL )sanitizeSizeFactors( size.factors, replace.zero = NULL, replace.negative = NULL, replace.infinite = NULL, replace.nan = NULL, handle.zero = NULL, handle.negative = NULL, handle.infinite = NULL, handle.nan = NULL )
size.factors |
Numeric vector of size factors across cells. |
replace.zero |
Deprecated, use |
replace.negative |
Deprecated, use |
replace.infinite |
Deprecated, use |
replace.nan |
Deprecated, use |
handle.zero |
String specifying how to handle replace size factors of zero.
If If |
handle.negative |
String specifying how to handle negative size factors.
If If |
handle.infinite |
String specifying how to handle infinite size factors.
If If |
handle.nan |
String specifying how to handle NaN size factors.
If If |
Numeric vector of length equal to size.factors, containing the sanitized size factors.
Aaron Lun
The sanitize_size_factors function in https://libscran.github.io/scran_norm/.
sf <- 2^rnorm(100) sf[1] <- 0 sf[2] <- -1 sf[3] <- Inf sf[4] <- NaN sanitizeSizeFactors(sf)sf <- 2^rnorm(100) sf[1] <- 0 sf[2] <- -1 sf[3] <- Inf sf[4] <- NaN sanitizeSizeFactors(sf)
sanitizeSizeFactors
Default parameters from the underlying C++ library.
These may be overridden by defaults in the sanitizeSizeFactors function signature.
sanitizeSizeFactorsDefaults()sanitizeSizeFactorsDefaults()
Named list containing default values for various function arguments.
Aaron Lun
sanitizeSizeFactorsDefaults()sanitizeSizeFactorsDefaults()
Scale multiple embeddings (usually derived from different modalities for the same cells) so that their within-population variances are comparable, and then combine them into a single embedding matrix for further analyses like clustering, t-SNE, etc. The aim is to equalize uninteresting variance across modalities so that high technical variance in one modality does not drown out interesting biology in another modality.
scaleByNeighbors( x, num.neighbors = NULL, block = NULL, block.weight.policy = NULL, variable.block.weight = NULL, num.threads = NULL, weights = NULL, BNPARAM = AnnoyParam() )scaleByNeighbors( x, num.neighbors = NULL, block = NULL, block.weight.policy = NULL, variable.block.weight = NULL, num.threads = NULL, weights = NULL, BNPARAM = AnnoyParam() )
x |
List of numeric matrices of principal components or other embeddings, one for each modality. For each entry, rows are dimensions and columns are cells. All entries should have the same number of columns but may have different numbers of rows. |
num.neighbors |
Integer scalar specifying the number of neighbors to use to define the scaling factor. If |
block |
Factor specifying the block of origin (e.g., batch, sample) for each cell in |
block.weight.policy |
String specifying the policy to use for weighting different blocks when computing the average scaling factor.
See the argument of the same name in If This argument is only used if |
variable.block.weight |
Numeric vector of length 2, specifying the parameters for variable block weighting.
See the argument of the same name in If This argument is only used if |
num.threads |
Integer scalar specifying the number of threads to use. If |
weights |
Numeric vector of length equal to that of |
BNPARAM |
A BiocNeighborParam object specifying how to perform the neighbor search. |
List containing scaling, a vector of scaling factors to be aplied to each embedding;
and combined, a numeric matrix creating by scaling each entry of x by scaling and then rbinding them together.
Aaron Lun
https://libscran.github.io/mumosa/, for the basis and caveats of this approach.
scaleByNeighbors.se, to combine embeddings in a SingleCellExperiment.
pcs <- list( gene = matrix(rnorm(10000), ncol=200), protein = matrix(rnorm(1000, sd=3), ncol=200), guide = matrix(rnorm(2000, sd=5), ncol=200) ) out <- scaleByNeighbors(pcs) out$scaling dim(out$combined)pcs <- list( gene = matrix(rnorm(10000), ncol=200), protein = matrix(rnorm(1000, sd=3), ncol=200), guide = matrix(rnorm(2000, sd=5), ncol=200) ) out <- scaleByNeighbors(pcs) out$scaling dim(out$combined)
Scale embeddings for different modalities to equalize their intra-population variance, and combine them into a single embedding for downstream analysis.
This calls scaleByNeighbors on the reduced dimensions of the main/alternative experiments in a SingleCellExperiment.
scaleByNeighbors.se( x, altexp.reddims, main.reddims = "PCA", num.neighbors = 20, block = NULL, BNPARAM = AnnoyParam(), num.threads = 1, more.scale.args = list(), output.name = "combined", meta.name = "combined", delayed.transpose = FALSE )scaleByNeighbors.se( x, altexp.reddims, main.reddims = "PCA", num.neighbors = 20, block = NULL, BNPARAM = AnnoyParam(), num.threads = 1, more.scale.args = list(), output.name = "combined", meta.name = "combined", delayed.transpose = FALSE )
x |
A SingleCellExperiment object or one of its subclasses. Rows correspond to genomic features and columns correspond to cells. |
altexp.reddims |
Named list of character or integer vectors.
Each entry is named after an alternative experiment.
Each vector contains the names/indices of the |
main.reddims |
Character or integer vector specifying the names/indices of the |
num.neighbors |
Number of neighbors used to define the scaling factor, passed to |
block |
Block assignment for each cell, passed to |
BNPARAM |
Algorithm for the nearest neighbor search, passed to |
num.threads |
Number of threads for the neighbor search, passed to |
more.scale.args |
Named list of additional arguments to pass to |
output.name |
String containing the name of the |
meta.name |
String containing the name of the |
delayed.transpose |
Logical scalar indicating whether to delay the transposition when storing coordinates in the |
x is returned with the combined embeddings stored in its reducedDims.
The scaling factors for all embeddings are stored in the metadata.
Aaron Lun
library(SingleCellExperiment) sce <- getTestAdtData.se("pca") sce <- scaleByNeighbors.se(sce, altexp.reddims=list(ADT="PCA")) reducedDimNames(sce) metadata(sce)$combinedlibrary(SingleCellExperiment) sce <- getTestAdtData.se("pca") sce <- scaleByNeighbors.se(sce, altexp.reddims=list(ADT="PCA")) reducedDimNames(sce) metadata(sce)$combined
scaleByNeighbors
Default parameters from the underlying C++ library.
These may be overridden by defaults in the scaleByNeighbors function signature.
scaleByNeighborsDefaults(block = NULL)scaleByNeighborsDefaults(block = NULL)
block |
See the argument of the same name in |
Named list containing default values for various function arguments.
These values may change depending on whether block is supplied.
Aaron Lun
scaleByNeighborsDefaults()scaleByNeighborsDefaults()
Compute per-cell scores for a gene set, defined as the column sums of a rank-1 approximation to the submatrix for the gene set. This uses the same approach as the GSDecon package by Jason Hackney, adapted to use an approximate PCA (via IRLBA) and to support blocking.
scoreGeneSet( x, set, rank = NULL, scale = NULL, block = NULL, block.weight.policy = NULL, variable.block.weight = NULL, extra.work = NULL, iterations = NULL, tolerance = NULL, seed = NULL, realized = NULL, warn = TRUE, num.threads = NULL )scoreGeneSet( x, set, rank = NULL, scale = NULL, block = NULL, block.weight.policy = NULL, variable.block.weight = NULL, extra.work = NULL, iterations = NULL, tolerance = NULL, seed = NULL, realized = NULL, warn = TRUE, num.threads = NULL )
x |
A matrix-like object where rows correspond to genes or genomic features and columns correspond to cells. Typically, the matrix is expected to contain log-expression values. |
set |
Vector specifying the rows of |
rank |
Integer scalar specifying the rank of the approximation. The default value of 1 assumes that each gene set only describes a single coordinated biological function. If |
scale |
Logical scalar indicating whether to scale all genes to have the same variance.
This ensures that each gene contributes equally to the PCA, favoring consistent variation across many genes rather than large variation in a few genes.
If If |
block |
Factor specifying the block of origin (e.g., batch, sample) for each cell in |
block.weight.policy |
String specifying the policy to use for weighting the contribution of different blocks to the PCA.
See the argument of the same name in If This argument is only used if |
variable.block.weight |
Numeric vector of length 2, specifying the parameters for variable block weighting.
See the argument of the same name in If This argument is only used if |
extra.work |
Integer scalar specifying the number of extra dimensions for the IRLBA workspace. Larger values improve accuracy at the cost of compute time. If |
iterations |
Integer scalar specifying the maximum number of restart iterations for IRLBA. Larger values improve accuracy at the cost of compute time. If |
tolerance |
Number specifying the tolerance on the approximation error of the singular triplets, to determine IRLBA convergence. Lower values improve accuracy at the cost of compute time. If |
seed |
Integer scalar specifying the seed for the initial random vector in IRLBA. If |
realized |
Logical scalar indicating whether to realize If |
warn |
Boolean specifying whether a warning should be emitted if IRLBA failed to converge. |
num.threads |
Number of threads to use. If |
List containing:
scores, a numeric vector of per-cell scores for each column in x.
weights, a DataFrame containing row, an integer vector of ordered and unique row indices corresponding to the genes in set;
and weight, a numeric vector of per-gene weights for each gene in row.
Aaron Lun
The compute and compute_blocked functions in https://libscran.github.io/gsdecon/.
scoreGeneSet.se, to compute gene set scores from a SummarizedExperiment.
library(Matrix) x <- round(abs(rsparsematrix(1000, 100, 0.1) * 100)) normed <- normalizeCounts(x, size.factors=centerSizeFactors(colSums(x))) scoreGeneSet(normed, set=c(1,3,5,10,20,100))library(Matrix) x <- round(abs(rsparsematrix(1000, 100, 0.1) * 100)) normed <- normalizeCounts(x, size.factors=centerSizeFactors(colSums(x))) scoreGeneSet(normed, set=c(1,3,5,10,20,100))
Compute a gene set activity score for each cell based on the expression values of the genes in the set,
by calling scoreGeneSet on an assay of a SummarizedExperiment.
scoreGeneSet.se( x, set, block = NULL, num.threads = 1, more.score.args = list(), assay.type = "logcounts" )scoreGeneSet.se( x, set, block = NULL, num.threads = 1, more.score.args = list(), assay.type = "logcounts" )
x |
A SummarizedExperiment object or one of its subclasses. Rows correspond to genes and columns correspond to cells. |
set |
Vector containing the gene set, see |
block |
Block assignment for each cell, passed to |
num.threads |
Number of threads for |
more.score.args |
Named list of further arguments to pass to |
assay.type |
Integer or string specifying the relevant assay in |
List containing scores, a numeric vector of the gene set scores across all cells in x;
and weights, a numeric vector of weights for all genes in set.
Aaron Lun
# Defining a gene set of oligodendrocyte genes. library(org.Mm.eg.db) oligo.set <- select(org.Mm.eg.db, keytype="GO", keys="GO:0048709", columns="SYMBOL") oligo.set <- unique(oligo.set$SYMBOL) sce <- getTestRnaData.se("norm") oligo.scores <- scoreGeneSet.se(sce, oligo.set) summary(oligo.scores$scores)# Defining a gene set of oligodendrocyte genes. library(org.Mm.eg.db) oligo.set <- select(org.Mm.eg.db, keytype="GO", keys="GO:0048709", columns="SYMBOL") oligo.set <- unique(oligo.set$SYMBOL) sce <- getTestRnaData.se("norm") oligo.scores <- scoreGeneSet.se(sce, oligo.set) summary(oligo.scores$scores)
scoreGeneSet
Default parameters from the underlying C++ library.
These may be overridden by defaults in the scoreGeneSet function signature.
scoreGeneSetDefaults()scoreGeneSetDefaults()
Named list containing default values for various function arguments.
Aaron Lun
scoreGeneSetDefaults()scoreGeneSetDefaults()
Score marker genes for each group using a variety of effect sizes from pairwise comparisons between groups. This includes Cohen's d, the area under the curve (AUC), the difference in the means (delta-mean) and the difference in the proportion of detected cells (delta-detected). For each group, the strongest markers are those genes with the largest effect sizes (i.e., upregulated) when compared to all other groups.
scoreMarkers( x, groups, block = NULL, block.average.policy = NULL, block.weight.policy = NULL, variable.block.weight = NULL, block.quantile = NULL, compute.group.mean = NULL, compute.group.detected = NULL, compute.delta.mean = NULL, compute.delta.detected = NULL, compute.cohens.d = NULL, compute.auc = NULL, compute.summary.min = NULL, compute.summary.mean = NULL, compute.summary.median = NULL, compute.summary.max = NULL, compute.summary.quantiles = NULL, compute.summary.min.rank = NULL, threshold = NULL, all.pairwise = FALSE, top.index.only = FALSE, min.rank.limit = NULL, num.threads = NULL )scoreMarkers( x, groups, block = NULL, block.average.policy = NULL, block.weight.policy = NULL, variable.block.weight = NULL, block.quantile = NULL, compute.group.mean = NULL, compute.group.detected = NULL, compute.delta.mean = NULL, compute.delta.detected = NULL, compute.cohens.d = NULL, compute.auc = NULL, compute.summary.min = NULL, compute.summary.mean = NULL, compute.summary.median = NULL, compute.summary.max = NULL, compute.summary.quantiles = NULL, compute.summary.min.rank = NULL, threshold = NULL, all.pairwise = FALSE, top.index.only = FALSE, min.rank.limit = NULL, num.threads = NULL )
x |
A matrix-like object where rows correspond to genes or genomic features and columns correspond to cells.
It is typically expected to contain log-expression values, e.g., from |
groups |
A vector specifying the group assignment for each cell in |
block |
Factor specifying the block of origin (e.g., batch, sample) for each cell in |
block.average.policy |
String specifying the policy to use for average statistics across blocks.
This can either be a (weighted) If This argument is only used if |
block.weight.policy |
String specifying the policy to use for weighting different blocks when computing the average for each statistic.
See the argument of the same name in If This argument is only used if |
variable.block.weight |
Numeric vector of length 2, specifying the parameters for variable block weighting.
See the argument of the same name in If THis arugment is only used if |
block.quantile |
Number specifying the probability of the quantile of statistics across blocks. Defaults to 0.5, i.e., the median of per-block statistics. If This argument is only used if |
compute.group.mean |
Boolean indicating whether to compute the group-wise mean expression for each gene. If |
compute.group.detected |
Boolean indicating whether to compute the group-wise proportion of detected cells for each gene. If |
compute.delta.mean |
Boolean indicating whether to compute the delta-means, i.e., the log-fold change when If |
compute.delta.detected |
Boolean indicating whether to compute the delta-detected, i.e., differences in the proportion of cells with detected expression. If |
compute.cohens.d |
Boolean indicating whether to compute Cohen's d. If |
compute.auc |
Boolean indicating whether to compute the AUC.
Setting this to If |
compute.summary.min |
Boolean specifying whether to compute the minimum as a summary statistic for each effect size. If This argument is only used if |
compute.summary.mean |
Boolean specifying whether to compute the mean as a summary statistic for each effect size. If This argument is only used if |
compute.summary.median |
Boolean specifying whether to compute the median as a summary statistic for each effect size. If This argument is only used if |
compute.summary.max |
Boolean specifying whether to compute the maximum as a summary statistic for each effect size. If This argument is only used if |
compute.summary.quantiles |
Numeric vector containing the probabilities of quantiles to compute as summary statistics for each effect size. If This argument is only used if |
compute.summary.min.rank |
Boolean specifying whether to compute the mininum rank as a summary statistic for each effect size. If This argument is only used if |
threshold |
Non-negative number specifying the minimum threshold on the differences in means (i.e., the log-fold change, if If |
all.pairwise |
Boolean indicating whether to report the effect sizes for every pairwise comparison between groups.
Alternatively, an integer indicating the number of top markers to report from each pairwise comparison between groups.
If |
top.index.only |
Boolean indicating whether to only report the indices of the top genes when This argument is ignored for all other values of |
min.rank.limit |
Integer specifying the maximum value of the min-rank to report. Lower values improve memory efficiency at the cost of discarding information about lower-ranked genes. If This argument is only used if |
num.threads |
Integer specifying the number of threads to use. If |
A named list containing:
nrow, integer specifying the number of rows in x.
row.names, character vector or NULL containing the row names of x.
group.ids, vector contaning the identities of the unique groups.
mean, a numeric matrix containing the mean expression for each group.
Each row is a gene and each column is a group in group.ids.
Omitted if compute.group.mean=FALSE.
detected, a numeric matrix containing the proportion of detected cells in each group.
Each row is a gene and each column is a group in group.ids.
Omitted if compute.group.detected=FALSE.
If all.pairwise=FALSE, the list also contains:
cohens.d, a list of DataFrames where each DataFrame corresponds to a group in group.ids.
Each row of a DataFrame represents a gene, while each column contains a summary of Cohen's d from pairwise comparisons to all other groups.
This includes min, mean, median, max, quantile.* and min.rank - check out ?summarizeEffects for details.
Omitted if compute.cohens.d=FALSE.
auc, a list like cohens.d but containing the summaries of the AUCs from each pairwise comparison.
Omitted if compute.auc=FALSE.
delta.mean, a list like cohens.d but containing the summaries of the delta-mean from each pairwise comparison.
Omitted if compute.delta.mean=FALSE.
delta.detected, a list like cohens.d but containing the summaries of the delta-detected from each pairwise comparison.
Omitted if compute.delta.detected=FALSE.
If all.pairwise=TRUE, the list also contains:
cohens.d, a 3-dimensional numeric array containing the Cohen's d from each pairwise comparison between groups.
The extents of the first two dimensions are equal to the number of groups in group.ids, while the extent of the final dimension is equal to the number of genes.
The entry cohens.d[i, j, k] represents Cohen's d from the comparison of group group.ids[j] over group group.ids[i] for gene k.
Omitted if compute.cohens.d=FALSE.
auc, an array like cohens.d but containing the AUCs from each pairwise comparison.
Omitted if compute.auc=FALSE.
delta.mean, an array like cohens.d but containing the delta-mean from each pairwise comparison.
Omitted if compute.delta.mean=FALSE.
delta.detected, an array like cohens.d but containing the delta-detected from each pairwise comparison.
Omitted if compute.delta.detected=FALSE.
If all.pairwise is an integer and top.index.only=FALSE, the list also contains:
cohens.d, a list of list of DataFrames containing the top genes with the largest Cohen's d for each pairwise comparison.
Specifically, cohens.d[[i]][[j]] is a DataFrame that contains the top all.pairwise genes from the comparison of group group.ids[i] over group group.ids[j].
Each DataFrame contains an index column, the row index of the gene; and an effect column, the Cohen's d for that gene.
Omitted if compute.cohens.d=FALSE.
auc, a list of list of DataFrames like cohens.d but containing the AUCs from each pairwise comparison.
Omitted if compute.auc=FALSE.
delta.mean, a list of list of DataFrames like cohens.d but containing the delta-mean from each pairwise comparison.
Omitted if compute.delta.mean=FALSE.
delta.detected, a list of list of DataFrames like cohens.d but containing the delta-detected from each pairwise comparison.
Omitted if compute.delta.detected=FALSE.
If all.pairwise is an integer and top.index.only=TRUE, the list also contains:
cohens.d, a list of list of integer vectors containing the row indices of the top genes with the largest Cohen's d for each pairwise comparison.
Specifically, cohens.d[[i]][[j]] is a vector that contains the top all.pairwise genes from the comparison of group group.ids[i] over group group.ids[j].
Omitted if compute.cohens.d=FALSE.
auc, a list of list of DataFrames like cohens.d but containing the AUCs from each pairwise comparison.
Omitted if compute.auc=FALSE.
delta.mean, a list of list of DataFrames like cohens.d but containing the delta-mean from each pairwise comparison.
Omitted if compute.delta.mean=FALSE.
delta.detected, a list of list of DataFrames like cohens.d but containing the delta-detected from each pairwise comparison.
Omitted if compute.delta.detected=FALSE.
The delta-mean is the difference in the mean expression between groups. This is fairly straightforward to interpret - a positive delta-mean corresponds to increased expression in the first group compared to the second. The delta-mean can also be treated as the log-fold change if the input matrix contains log-transformed normalized expression values.
The delta-detected is the difference in the proportion of cells with detected expression between groups. This lies between 1 and -1, with the extremes occurring when a gene is silent in one group and detected in all cells of the other group. For this interpretation, we assume that the input matrix contains non-negative expression values, where a value of zero corresponds to lack of detectable expression.
Cohen's d is the standardized difference between two groups. This is defined as the difference in the mean for each group scaled by the average standard deviation across the two groups. (Technically, we should use the pooled variance; however, this introduces some unintuitive asymmetry depending on the variance of the larger group, so we take a simple average instead.) A positive value indicates that the gene has increased expression in the first group compared to the second. Cohen's d is analogous to the t-statistic in a two-sample t-test and avoids spuriously large effect sizes from comparisons between highly variable groups. We can also interpret Cohen's d as the number of standard deviations between the two group means.
The area under the curve (AUC) is the probability that a randomly chosen observation in one group is greater than a randomly chosen observation in the other group. Values greater than 0.5 indicate that a gene is upregulated in the first group. The AUC is closely related to the U-statistic used in the Wilcoxon rank sum test. The key difference between the AUC and Cohen's d is that the former is less sensitive to the variance within each group, e.g., if two distributions exhibit no overlap, the AUC is the same regardless of the variance of each distribution. This may or may not be desirable as it improves robustness to outliers but reduces the information available to obtain a fine-grained ranking.
Setting a minimum change threshold (i.e., threshold) prioritizes genes with large shifts in expression instead of those with low variances.
Currently, only positive thresholds are supported, which focuses on genes that are upregulated in the first group compared to the second.
The effect size definitions are generalized when testing against a non-zero threshold:
Cohen's d is redefined as the standardized difference between the difference in means and the specified threshold, analogous to the TREAT method from the limma package. Large positive values are only obtained when the observed difference in means is significantly greater than the threshold. For example, if we had a threshold of 2 and we obtained a Cohen's d of 3, this means that the observed difference in means was 3 standard deviations greater than 2. Note that a negative Cohen's d cannot be intepreted as downregulation, as the difference in means may still be positive but less than the threshold.
The AUC is generalized to the probability of obtaining a random observation in one group that is greater than a random observation plus the threshold in the other group. For example, if we had a threshold of 2 and we obtained an AUC of 0.8, this means that, 80 the random observation from the first group would be greater than a random observation from the second group by 2 or more. Again, AUCs below 0.5 cannot be interpreted as downregulation, as it may be caused by a positive shift that is less than the threshold.
The score_markers_summary, score_markers_pairwise and score_markers_best functions in https://libscran.github.io/scran_markers/.
See their blocked equivalents (e.g., score_markers_summary_blocked) when block is specified.
summarizeEffects, to summarize the pairwise effects returned when all.pairwise=TRUE.
reportGroupMarkerStatistics, to consolidate the statistics for a single group into its own data frame.
scoreMarkers.se, to score markers from a SummarizedExperiment.
# Mocking a matrix: library(Matrix) x <- round(abs(rsparsematrix(1000, 100, 0.1) * 100)) normed <- normalizeCounts(x, size.factors=centerSizeFactors(colSums(x))) # Compute marker summaries for each group: g <- sample(letters[1:4], ncol(x), replace=TRUE) scores <- scoreMarkers(normed, g) names(scores) head(scores$mean) head(scores$cohens.d[["a"]]) # Report marker statistics for a single group: reportGroupMarkerStatistics(scores, "b")# Mocking a matrix: library(Matrix) x <- round(abs(rsparsematrix(1000, 100, 0.1) * 100)) normed <- normalizeCounts(x, size.factors=centerSizeFactors(colSums(x))) # Compute marker summaries for each group: g <- sample(letters[1:4], ncol(x), replace=TRUE) scores <- scoreMarkers(normed, g) names(scores) head(scores$mean) head(scores$cohens.d[["a"]]) # Report marker statistics for a single group: reportGroupMarkerStatistics(scores, "b")
Identify candidate marker genes based on effect sizes from pairwise comparisons between groups of cells,
by calling scoreMarkers on an assay of a SummarizedExperiment.
scoreMarkers.se( x, groups, block = NULL, num.threads = 1, more.marker.args = list(), assay.type = "logcounts", extra.columns = NULL, order.by = TRUE ) formatScoreMarkersResult(marker.res, extra.columns = NULL, order.by = TRUE) previewMarkers( marker.df, columns = c("mean", "detected", lfc = "delta.mean.mean"), pre.columns = NULL, post.columns = NULL, rows = 10, order.by = NULL, include.order.by = !is.null(order.by) )scoreMarkers.se( x, groups, block = NULL, num.threads = 1, more.marker.args = list(), assay.type = "logcounts", extra.columns = NULL, order.by = TRUE ) formatScoreMarkersResult(marker.res, extra.columns = NULL, order.by = TRUE) previewMarkers( marker.df, columns = c("mean", "detected", lfc = "delta.mean.mean"), pre.columns = NULL, post.columns = NULL, rows = 10, order.by = NULL, include.order.by = !is.null(order.by) )
x |
A SummarizedExperiment object or one of its subclasses. Rows correspond to genes and columns correspond to cells. |
groups |
Group assignment for each cell, passed to |
block |
Block assignment for each cell, passed to |
num.threads |
Number of threads for marker scoring, passed to |
more.marker.args |
Named list of additional arguments to pass to |
assay.type |
Integer or string specifying the assay to use for differential comparisons, usually containing log-normalized expression values. |
extra.columns |
DataFrame containing extra columns to add each DataFrame.
This should have the same number of rows as |
order.by |
String specifying the column to order each DataFrame by.
Alternatively |
marker.res |
List containing the result of |
marker.df |
DataFrame containing the marker statistics for a single group. |
columns |
Character vector of the names of columns to retain in the preview. This may be named, in which the names are used as the column names. |
pre.columns, post.columns
|
Character vector of the names of additional columns to retain in the preview.
These are added before or after the columns in |
rows |
Integer specifying the number of rows to show.
If |
include.order.by |
Boolean indicating whether the column specified by |
For scoreMarkers.se and formatScoreMarkersResult, a List of DataFrames is returned.
Each DataFrame corresponds to a unique group in groups.
Each row contains statistics for a gene in x, with the following columns:
mean, the mean expression in the current group.
detected, the proportion of cells with detected expression in the current group.
<effect>.<summary>, a summary statistic for an effect size,
e.g., cohens.d.mean contains the mean Cohen's d across comparisons involving the current group.
For previewMarkers, a DataFrame is returned containing the specified columns and rows.
Aaron Lun
sce <- getTestRnaData.se("cluster") markers <- scoreMarkers.se(sce, sce$clusters) previewMarkers(markers[["1"]])sce <- getTestRnaData.se("cluster") markers <- scoreMarkers.se(sce, sce$clusters) previewMarkers(markers[["1"]])
scoreMarkers
Default parameters from the underlying C++ library.
These may be overridden by defaults in the scoreMarkers function signature.
scoreMarkersDefaults(all.pairwise = FALSE)scoreMarkersDefaults(all.pairwise = FALSE)
all.pairwise |
See the argument of the same name in |
Named list containing default values for various function arguments.
These defaults may change depending on the values for all.pairwise.
scoreMarkersDefaults()scoreMarkersDefaults()
Subsample a dataset by selecting cells to represent all of their nearest neighbors. The aim is to preserve the relative density of the original dataset while guaranteeing representation of low-frequency subpopulations.
subsampleByNeighbors( x, num.neighbors = NULL, min.remaining = NULL, num.threads = NULL, BNPARAM = AnnoyParam() )subsampleByNeighbors( x, num.neighbors = NULL, min.remaining = NULL, num.threads = NULL, BNPARAM = AnnoyParam() )
x |
A numeric matrix where rows are dimensions and columns are cells,
typically containing a low-dimensional representation from, e.g., Alternatively, an index constructed by Alternatively, a list containing existing nearest-neighbor search results. This should contain:
The number of neighbors should be equal to |
num.neighbors |
Integer scalar specifying the number of neighbors to use. Larger values result in stronger downsampling. If This argument is ignored if |
min.remaining |
Integer specifying the minimum number of remaining neighbors that a cell must have in order to be considered for selection.
This should be less than or equal to If |
num.threads |
Integer scalar specifying the number of threads to use for the nearest-neighbor search. If This argument is only used if |
BNPARAM |
A BiocNeighborParam object specifying the algorithm to use. This argument is only used if |
Starting from the densest region in the high-dimensional space, we select an observation for inclusion into the subsampled dataset. Every time we select an observation, we remove it and all of its nearest neighbors from the dataset. We then select the next observation with the most remaining neighbors, with ties broken by density; this is repeated until there are no more observations.
The premise is that each selected observation serves as a representative for its nearest neighbors. This ensures that the subsampled points are well-distributed across the original dataset. Low-frequency subpopulations will always have at least a few representatives if they are sufficiently distant from other subpopulations. We also preserve the relative density of the original dataset as more representatives will be generated from high-density regions.
Integer vector with the indices of the selected cells in the subsample.
Aaron Lun
https://libscran.github.io/nenesub/, for more details on the underlying algorithm.
x <- matrix(rnorm(10000), nrow=2) keep <- subsampleByNeighbors(x, 10) plot(x[1,], x[2,]) points(x[1,keep], x[2,keep], col="red") legend('topright', col=c('black', 'red'), legend=c('all', 'subsample'), pch=1)x <- matrix(rnorm(10000), nrow=2) keep <- subsampleByNeighbors(x, 10) plot(x[1,], x[2,]) points(x[1,keep], x[2,keep], col="red") legend('topright', col=c('black', 'red'), legend=c('all', 'subsample'), pch=1)
subsampleByNeighbors
Default parameters from the underlying C++ library.
These may be overridden by defaults in the subsampleByNeighbors function signature.
subsampleByNeighborsDefaults()subsampleByNeighborsDefaults()
Named list containing default values for various function arguments.
Aaron Lun
subsampleByNeighborsDefaults()subsampleByNeighborsDefaults()
Subsample from a partitioning of cells, preserving the frequency of cells across partitions. The subsample is typically used as a smaller representative dataset that still exhibits the same distribution of cells as the full dataset.
subsampleByPartition(partitions, number, seed = NULL, force.non.empty = NULL)subsampleByPartition(partitions, number, seed = NULL, force.non.empty = NULL)
partitions |
Factor, integer vector or character vector of length equal to the number of cells, containing the assignment of each cell to a partition (e.g., clustering). |
number |
Integer specifying the number of cells to retain in the subsample.
If greater than the length of |
seed |
Integer specifying the seed for the random number generator. If |
force.non.empty |
Boolean indicating whether each partition should have at least one cell in the subsample.
If If |
Integer vector containing the indices of the cells retained in the subsample.
Aaron Lun
subsampleByNeighbors, for an alternative downsampling strategy.
x <- matrix(rnorm(10000), nrow=5) partitions <- clusterKmeans(x, k=20) keep <- subsampleByPartition(partitions$clusters, 100) table(partitions$clusters[keep]) sub.x <- x[,keep] # subsample to be used for expensive downstream steps. # Rare partitions are retained by default: grouping <- c(LETTERS[1:10], sample(LETTERS[11:26], 990, replace=TRUE)) keep <- subsampleByPartition(grouping, 100) table(grouping[keep]) keep <- subsampleByPartition(grouping, 100, force.non.empty=FALSE) table(grouping[keep])x <- matrix(rnorm(10000), nrow=5) partitions <- clusterKmeans(x, k=20) keep <- subsampleByPartition(partitions$clusters, 100) table(partitions$clusters[keep]) sub.x <- x[,keep] # subsample to be used for expensive downstream steps. # Rare partitions are retained by default: grouping <- c(LETTERS[1:10], sample(LETTERS[11:26], 990, replace=TRUE)) keep <- subsampleByPartition(grouping, 100) table(grouping[keep]) keep <- subsampleByPartition(grouping, 100, force.non.empty=FALSE) table(grouping[keep])
subsampleByPartition
Default parameters from the underlying C++ library.
These may be overridden by defaults in the subsampleByPartition function signature.
subsampleByPartitionDefaults()subsampleByPartitionDefaults()
Named list containing default values for various function arguments.
Aaron Lun
subsampleByPartitionDefaults()subsampleByPartitionDefaults()
For each group, summarize the effect sizes for all pairwise comparisons to other groups. This yields a set of summary statistics that can be used to rank marker genes for each group.
summarizeEffects( effects, compute.summary.min = NULL, compute.summary.mean = NULL, compute.summary.median = NULL, compute.summary.max = NULL, compute.summary.quantiles = NULL, compute.summary.min.rank = NULL, num.threads = NULL )summarizeEffects( effects, compute.summary.min = NULL, compute.summary.mean = NULL, compute.summary.median = NULL, compute.summary.max = NULL, compute.summary.quantiles = NULL, compute.summary.min.rank = NULL, num.threads = NULL )
effects |
A 3-dimensional numeric containing the effect sizes from each pairwise comparison between groups.
The extents of the first two dimensions are equal to the number of groups, while the extent of the final dimension is equal to the number of genes.
The entry |
compute.summary.min |
Boolean specifying whether to compute the minimum as a summary statistic. If |
compute.summary.mean |
Boolean specifying whether to compute the mean as a summary statistic. If |
compute.summary.median |
Boolean specifying whether to compute the median as a summary statistic. If |
compute.summary.max |
Boolean specifying whether to compute the maximum as a summary statistic. If |
compute.summary.quantiles |
Numeric scalars containing the probabilities of quantiles to compute as summary statistics. If |
compute.summary.min.rank |
Boolean specifying whether to compute the mininum rank as a summary statistic. If |
num.threads |
Integer scalar specifying the number of threads to use. If |
Each summary statistic can be used to prioritize different sets of marker genes for the group of interest, by ranking them in decreasing order according to said statistic:
min contains the minimum effect size across all comparisons involving the group of interest.
Genes with large values are upregulated in all comparisons.
As such, it is the most stringent summary as markers will only have large values if they are uniquely upregulated in the group of interest compared to every other group.
mean contains the mean effect size across all comparisons involving the group of interest.
Genes with large values are upregulated on average compared to the other groups.
This is a good general-purpose summary statistic.
median contains the median effect size across all comparisons involving the group of interest.
Genes with large values are upregulated compared to most (i.e., at least 50
Compared to the mean, this is more robust to outlier effects but less sensitive to strong effects in a minority of comparisons.
max contains the maximum effect size across all comparisons involving the group of interest.
Using this to define markers will focus on genes that are upregulated in at least one comparison.
As such, it is the least stringent summary as markers can achieve large values if they are upregulated in the group of interest compared to any one other group.
quantile[[P]] contains the quantile P across all comparisons involving the group of interest.
This is a generalization of the minimum, median and maximum for arbitrary quantile probabilities.
For example, a large quantile[["20"]] would mean that the gene is upregulated in the group of interest compared to 80
The exact definition of “large” depends on the choice of effect size.
For signed effects like Cohen's d, delta-mean and delta-detected, the value must be positive to be considered “large”.
For the AUC, a value greater than 0.5 is considered “large”.
This interpretation is also affected by the choice of threshold used to compute each effect size in scoreMarkers,
e.g., a negative Cohen's d cannot be interpreted as downregulation when the threshold is positive.
The min.rank is a more exotic summary statistic, containing the minimum rank for each gene across all comparisons involving the group of interest.
This is defined by ranking the effect sizes across genes within each comparison, and then taking the minimum of these ranks across comparisons.
Taking all genes with min.rank <= T will yield a set containing the top T genes from each comparison.
The idea is to ensure that there are at least T genes that can distinguish the group of interest from any other group.
NaN effect sizes are allowed, e.g., if two groups do not exist in the same block for a blocked analysis in scoreMarkers with block.
This function will ignore NaN values when computing each summary.
If all effects are NaN for a particular group, the summary statistic will also be NaN.
List of DataFrames containing summary statistics for the effect sizes.
Each DataFrame corresponds to a group, each row corresponds to a gene, and each column contains a summary statistic.
If compute.summary.quantiles is provided, the "quantile" column is a nested DataFrame where each column coresponds to a probability in compute.summary.quantiles.
Aaron Lun
The summarize_effects function in https://libscran.github.io/scran_markers/.
scoreMarkers, to compute the pairwise effects in the first place.
# Mocking a matrix: library(Matrix) x <- round(abs(rsparsematrix(1000, 100, 0.1) * 100)) normed <- normalizeCounts(x, size.factors=centerSizeFactors(colSums(x))) g <- sample(letters[1:4], ncol(x), replace=TRUE) effects <- scoreMarkers(normed, g, all.pairwise=TRUE) summarized <- summarizeEffects(effects$cohens.d) str(summarized)# Mocking a matrix: library(Matrix) x <- round(abs(rsparsematrix(1000, 100, 0.1) * 100)) normed <- normalizeCounts(x, size.factors=centerSizeFactors(colSums(x))) g <- sample(letters[1:4], ncol(x), replace=TRUE) effects <- scoreMarkers(normed, g, all.pairwise=TRUE) summarized <- summarizeEffects(effects$cohens.d) str(summarized)
summarizeEffectsDefaults
Default parameters from the underlying C++ library.
These may be overridden by defaults in the summarizeEffects function signature.
summarizeEffectsDefaults()summarizeEffectsDefaults()
Named list containing default values for various function arguments.
Aaron Lun
summarizeEffectsDefaults()summarizeEffectsDefaults()
Perform a hypergeometric test for enrichment of gene sets in a list of interesting genes (e.g., markers).
testEnrichment(x, sets, universe = NULL, log = NULL, num.threads = 1)testEnrichment(x, sets, universe = NULL, log = NULL, num.threads = 1)
x |
Vector of identifiers for some interesting genes, e.g., symbols or Ensembl IDs.
This is usually derived from a selection of top markers, e.g., from |
sets |
List of vectors of identifiers for the pre-defined gene sets.
Each inner vector corresponds to a gene set and should contain the same type of identifiers as |
universe |
Vector of identifiers for the universe of genes in the dataset.
Alternatively, an integer scalar specifying the number of genes in the universe.
This is assumed to be greater than or equal to the number of unique genes in |
log |
Logical scalar indicating whether to report log-transformed p-values. This may be desirable to avoid underflow at near-zero p-values. If |
num.threads |
Integer scalar specifying the number of threads to use. |
DataFrame with one row per gene set and the following columns:
overlap, the overlap between x and each entry of sets, i.e., the number of genes in the intersection.
size, the set of each entry of sets.
p.value, the (possibly log-transformed) p-value for overrepresentation of the gene set in x.
Aaron Lun
phyper and https://libscran.github.io/phyper/,
which is the basis for the underlying calculation.
testEnrichment( x=LETTERS[1:5], sets=list( first=LETTERS[1:10], second=LETTERS[1:5 * 2], third=LETTERS[10:20] ), universe=LETTERS )testEnrichment( x=LETTERS[1:5], sets=list( first=LETTERS[1:10], second=LETTERS[1:5 * 2], third=LETTERS[10:20] ), universe=LETTERS )
testEnrichment
Default parameters from the underlying C++ library.
These may be overridden by defaults in the testEnrichment function signature.
testEnrichmentDefaults()testEnrichmentDefaults()
Named list containing default values for various function arguments.
Aaron Lun
testEnrichmentDefaults()testEnrichmentDefaults()