Title: | Methods for Single-Cell RNA-Seq Data Analysis |
---|---|
Description: | Implements miscellaneous functions for interpretation of single-cell RNA-seq data. Methods are provided for assignment of cell cycle phase, detection of highly variable and significantly correlated genes, identification of marker genes, and other common tasks in routine single-cell analysis workflows. |
Authors: | Aaron Lun [aut, cre], Karsten Bach [aut], Jong Kyoung Kim [ctb], Antonio Scialdone [ctb] |
Maintainer: | Aaron Lun <[email protected]> |
License: | GPL-3 |
Version: | 1.35.0 |
Built: | 2024-12-21 06:15:45 UTC |
Source: | https://github.com/bioc/scran |
Perform a Benjamini-Hochberg correction on log-transformed p-values to get log-adjusted p-values, without the loss of precision from undoing and redoing the log-transformations.
.logBH(log.p.val)
.logBH(log.p.val)
log.p.val |
Numeric vector of log-transformed p-values. |
A numeric vector of the same length as log.p.val
containing log-transformed BH-corrected p-values.
Aaron Lun
log.p.values <- log(runif(1000)) obs <- .logBH(log.p.values) head(obs) ref <- log(p.adjust(exp(log.p.values), method="BH")) head(ref)
log.p.values <- log(runif(1000)) obs <- .logBH(log.p.values) head(obs) ref <- log(p.adjust(exp(log.p.values), method="BH")) head(ref)
SingleCellExperiment-friendly wrapper around the makeSNNGraph
and makeKNNGraph
functions for creating nearest-neighbor graphs.
buildSNNGraph(x, ...) ## S4 method for signature 'ANY' buildSNNGraph( x, ..., d = 50, transposed = FALSE, subset.row = NULL, BSPARAM = bsparam(), BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' buildSNNGraph(x, ..., assay.type = "logcounts") ## S4 method for signature 'SingleCellExperiment' buildSNNGraph(x, ..., use.dimred = NULL) buildKNNGraph(x, ...) ## S4 method for signature 'ANY' buildKNNGraph( x, ..., d = 50, transposed = FALSE, subset.row = NULL, BSPARAM = bsparam(), BPPARAM = SerialParam() ) ## S4 method for signature 'SingleCellExperiment' buildKNNGraph(x, ..., use.dimred = NULL) ## S4 method for signature 'SingleCellExperiment' buildKNNGraph(x, ..., use.dimred = NULL)
buildSNNGraph(x, ...) ## S4 method for signature 'ANY' buildSNNGraph( x, ..., d = 50, transposed = FALSE, subset.row = NULL, BSPARAM = bsparam(), BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' buildSNNGraph(x, ..., assay.type = "logcounts") ## S4 method for signature 'SingleCellExperiment' buildSNNGraph(x, ..., use.dimred = NULL) buildKNNGraph(x, ...) ## S4 method for signature 'ANY' buildKNNGraph( x, ..., d = 50, transposed = FALSE, subset.row = NULL, BSPARAM = bsparam(), BPPARAM = SerialParam() ) ## S4 method for signature 'SingleCellExperiment' buildKNNGraph(x, ..., use.dimred = NULL) ## S4 method for signature 'SingleCellExperiment' buildKNNGraph(x, ..., use.dimred = NULL)
x |
A matrix-like object containing expression values for each gene (row) in each cell (column).
These dimensions can be transposed if Alternatively, a SummarizedExperiment or SingleCellExperiment containing such an expression matrix.
If |
... |
For the generics, additional arguments to pass to the specific methods. For the ANY methods, additional arguments to pass to For the SummarizedExperiment methods, additional arguments to pass to the corresponding ANY method. For the SingleCellExperiment methods, additional arguments to pass to the corresponding SummarizedExperiment method. |
d |
An integer scalar specifying the number of dimensions to use for a PCA on the expression matrix prior to the nearest neighbor search.
Ignored for the ANY method if |
transposed |
A logical scalar indicating whether |
subset.row |
See |
BSPARAM |
A BiocSingularParam object specifying the algorithm to use for PCA, if |
BPPARAM |
A BiocParallelParam object to use for parallel processing. |
assay.type |
A string specifying which assay values to use. |
use.dimred |
A string specifying whether existing values in |
A graph where nodes are cells and edges represent connections between nearest neighbors,
see ?makeSNNGraph
for more details.
Aaron Lun
makeSNNGraph
and makeKNNGraph
, for the underlying functions that do the work.
See cluster_walktrap
and related functions in igraph for clustering based on the produced graph.
clusterCells
, for a more succinct way of performing graph-based clustering.
library(scuttle) sce <- mockSCE(ncells=500) sce <- logNormCounts(sce) g <- buildSNNGraph(sce) clusters <- igraph::cluster_fast_greedy(g)$membership table(clusters) # Any clustering method from igraph can be used: clusters <- igraph::cluster_walktrap(g)$membership table(clusters) # Smaller 'k' usually yields finer clusters: g <- buildSNNGraph(sce, k=5) clusters <- igraph::cluster_walktrap(g)$membership table(clusters) # Graph can be built off existing reducedDims results: sce <- scater::runPCA(sce) g <- buildSNNGraph(sce, use.dimred="PCA") clusters <- igraph::cluster_fast_greedy(g)$membership table(clusters)
library(scuttle) sce <- mockSCE(ncells=500) sce <- logNormCounts(sce) g <- buildSNNGraph(sce) clusters <- igraph::cluster_fast_greedy(g)$membership table(clusters) # Any clustering method from igraph can be used: clusters <- igraph::cluster_walktrap(g)$membership table(clusters) # Smaller 'k' usually yields finer clusters: g <- buildSNNGraph(sce, k=5) clusters <- igraph::cluster_walktrap(g)$membership table(clusters) # Graph can be built off existing reducedDims results: sce <- scater::runPCA(sce) g <- buildSNNGraph(sce, use.dimred="PCA") clusters <- igraph::cluster_fast_greedy(g)$membership table(clusters)
A SingleCellExperiment-compatible wrapper around clusterRows
from the bluster package.
clusterCells( x, assay.type = NULL, use.dimred = NULL, BLUSPARAM = NNGraphParam(), ... )
clusterCells( x, assay.type = NULL, use.dimred = NULL, BLUSPARAM = NNGraphParam(), ... )
x |
A SummarizedExperiment or SingleCellExperiment object containing cells in the columns. |
assay.type |
Integer or string specifying the assay values to use for clustering, typically log-normalized expression. |
use.dimred |
Integer or string specifying the reduced dimensions to use for clustering, typically PC scores.
Only used when |
BLUSPARAM |
A BlusterParam object specifying the clustering algorithm to use, defaults to a graph-based method. |
... |
Further arguments to pass to |
This is largely a convenience wrapper to avoid the need to manually extract the relevant assays or reduced dimensions from x
.
Altering BLUSPARAM
can easily change the parameters or algorithm used for clustering -
see ?"BlusterParam-class"
for more details.
A factor of cluster identities for each cell in x
,
or a list containing such a factor - see the return value of ?clusterRows
.
Aaron Lun
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # From log-expression values: clusters <- clusterCells(sce, assay.type="logcounts") # From PCs: sce <- scater::runPCA(sce) clusters2 <- clusterCells(sce, use.dimred="PCA") # With different parameters: library(bluster) clusters3 <- clusterCells(sce, use.dimred="PCA", BLUSPARAM=NNGraphParam(k=5)) # With different algorithms: clusters4 <- clusterCells(sce, use.dimred="PCA", BLUSPARAM=KmeansParam(centers=10))
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # From log-expression values: clusters <- clusterCells(sce, assay.type="logcounts") # From PCs: sce <- scater::runPCA(sce) clusters2 <- clusterCells(sce, use.dimred="PCA") # With different parameters: library(bluster) clusters3 <- clusterCells(sce, use.dimred="PCA", BLUSPARAM=NNGraphParam(k=5)) # With different algorithms: clusters4 <- clusterCells(sce, use.dimred="PCA", BLUSPARAM=KmeansParam(centers=10))
Combine DataFrames of statistics computed separately for each block. This usually refers to feature-level statistics and sample-level blocks.
combineBlocks( blocks, ave.fields, pval.field, method, geometric, equiweight, weights, valid )
combineBlocks( blocks, ave.fields, pval.field, method, geometric, equiweight, weights, valid )
blocks |
A list of DataFrames containing blockwise statistics. These should have the same number of rows and the same set of columns. |
ave.fields |
Character vector specifying the columns of |
pval.field |
String specifying the column of |
method |
String specifying how p-values should be combined, see |
geometric |
Logical scalar indicating whether the geometric mean should be computed when averaging |
equiweight |
Logical scalar indicating whether each block should be given equal weight. |
weights |
Numeric vector of length equal to |
valid |
Logical vector indicating whether each block is valid.
Invalid blocks are still stored in the |
A DataFrame containing all fields in ave.fields
and the p-values,
where each column is created by combining the corresponding block-specific columns.
A per.block
column is also reported, containing a DataFrame of the DataFrames of blockwise statistics.
Aaron Lun
This function is used in modelGeneVar
and friends, combineVar
and testLinearModel
.
library(scuttle) sce <- mockSCE() y1 <- sce[,1:100] y1 <- logNormCounts(y1) # normalize separately after subsetting. results1 <- modelGeneVar(y1) y2 <- sce[,1:100 + 100] y2 <- logNormCounts(y2) # normalize separately after subsetting. results2 <- modelGeneVar(y2) # A manual implementation of combineVar: combineBlocks(list(results1, results2), ave.fields=c("mean", "total", "bio", "tech"), pval.field='p.value', method='fisher', geometric=FALSE, equiweight=TRUE, weights=NULL, valid=c(TRUE, TRUE))
library(scuttle) sce <- mockSCE() y1 <- sce[,1:100] y1 <- logNormCounts(y1) # normalize separately after subsetting. results1 <- modelGeneVar(y1) y2 <- sce[,1:100 + 100] y2 <- logNormCounts(y2) # normalize separately after subsetting. results2 <- modelGeneVar(y2) # A manual implementation of combineVar: combineBlocks(list(results1, results2), ave.fields=c("mean", "total", "bio", "tech"), pval.field='p.value', method='fisher', geometric=FALSE, equiweight=TRUE, weights=NULL, valid=c(TRUE, TRUE))
Combine multiple pairwise differential expression comparisons between groups or clusters into a single ranked list of markers for each cluster.
combineMarkers( de.lists, pairs, pval.field = "p.value", effect.field = "logFC", pval.type = c("any", "some", "all"), min.prop = NULL, log.p.in = FALSE, log.p.out = log.p.in, output.field = NULL, full.stats = FALSE, sorted = TRUE, flatten = TRUE, BPPARAM = SerialParam() )
combineMarkers( de.lists, pairs, pval.field = "p.value", effect.field = "logFC", pval.type = c("any", "some", "all"), min.prop = NULL, log.p.in = FALSE, log.p.out = log.p.in, output.field = NULL, full.stats = FALSE, sorted = TRUE, flatten = TRUE, BPPARAM = SerialParam() )
de.lists |
A list-like object where each element is a data.frame or DataFrame. Each element should represent the results of a pairwise comparison between two groups/clusters, in which each row should contain the statistics for a single gene/feature. Rows should be named by the feature name in the same order for all elements. |
pairs |
A matrix, data.frame or DataFrame with two columns and number of rows equal to the length of |
pval.field |
A string specifying the column name of each element of |
effect.field |
A string specifying the column name of each element of |
pval.type |
A string specifying how p-values are to be combined across pairwise comparisons for a given group/cluster. |
min.prop |
Numeric scalar specifying the minimum proportion of significant comparisons per gene,
Defaults to 0.5 when |
log.p.in |
A logical scalar indicating if the p-values in |
log.p.out |
A logical scalar indicating if log-transformed p-values/FDRs should be returned. |
output.field |
A string specifying the prefix of the field names containing the effect sizes.
Defaults to |
full.stats |
A logical scalar indicating whether all statistics in |
sorted |
Logical scalar indicating whether each output DataFrame should be sorted by a statistic relevant to |
flatten |
Logical scalar indicating whether the individual effect sizes should be flattened in the output DataFrame.
If |
BPPARAM |
A BiocParallelParam object indicating whether and how parallelization should be performed across genes. |
An obvious strategy to characterizing differences between clusters is to look for genes that are differentially expressed (DE) between them.
However, this entails a number of comparisons between all pairs of clusters to comprehensively identify genes that define each cluster.
For all pairwise comparisons involving a single cluster, we would like to consolidate the DE results into a single list of candidate marker genes.
Doing so is the purpose of the combineMarkers
function.
DE statistics from any testing regime can be supplied to this function - see the Examples for how this is done with t-tests from pairwiseTTests
.
The effect size field in the output will vary according to the type of input statistics, for example:
logFC.Y
from pairwiseTTests
, containing log-fold changes in mean expression (usually in base 2).
AUC.Y
from pairwiseWilcox
, containing the area under the curve, i.e., the concordance probability.
logFC.Y
from pairwiseBinom
, containing log2-fold changes in the expressing proportion.
A named List of DataFrames where each DataFrame contains the consolidated marker statistics for each gene (row) for the cluster of the same name.
The DataFrame for cluster contains the fields:
Top
:Integer, the minimum rank across all pairwise comparisons - see ?computeMinRank
for details.
This is only reported if pval.type="any"
.
p.value
:Numeric, the combined p-value across all comparisons if log.p.out=FALSE
.
FDR
:Numeric, the BH-adjusted p-value for each gene if log.p.out=FALSE
.
log.p.value
:Numeric, the (natural) log-transformed version p-value.
Replaces the p.value
field if log.p.out=TRUE
.
log.FDR
:Numeric, the (natural) log-transformed adjusted p-value.
Replaces the FDR
field if log.p.out=TRUE
.
summary.<OUTPUT>
:Numeric, named by replacing <OUTPUT>
with output.field
.
This contains the summary effect size, obtained by combining effect sizes from all pairwise comparison into a single value.
Only reported when effect.field
is not NULL
.
<OUTPUT>.Y
:Comparison-specific statistics, named by replacing <OUTPUT>
with output.field
.
One of these fields is present for every other cluster in
clusters
and contains statistics for the comparison of to
.
If
full.stats=FALSE
, each field is numeric and contains the effect size of the comparison of over
.
Otherwise, each field is a nested DataFrame containing the full statistics for that comparison (i.e., the same asthe corresponding entry of
de.lists
).
Only reported if flatten=FALSE
and (for full.stats=FALSE
) if effect.field
is not NULL
.
each.<OUTPUT>
:A nested DataFrame of comparison-specific statistics, named by replacing <OUTPUT>
with output.field
.
If full.stats=FALSE
, one column is present for every other cluster in
clusters
and contains the effect size of the comparison of to
.
Otherwise, each column contains another nested DataFrame containing the full set of statistics for that comparison.
Only reported if
flatten=FALSE
and (for full.stats=FALSE
) if effect.field
is not NULL
.
When pval.type="any"
, each DataFrame is sorted by the min-rank in the Top
column.
Taking all rows with Top
values less than or equal to yields a marker set containing the top
genes (ranked by significance) from each pairwise comparison.
This guarantees the inclusion of genes that can distinguish between any two clusters.
Also see
?computeMinRank
for more details on the rationale behind this metric.
For each gene and cluster, the summary effect size is defined as the effect size from the pairwise comparison with the lowest p-value. The combined p-value is computed by applying Simes' method to all p-values. Neither of these values are directly used for ranking and are only reported for the sake of the user.
If pval.type="all"
, the null hypothesis is that the gene is not DE in all contrasts.
A combined p-value for each gene is computed using Berger's intersection union test (IUT).
Ranking based on the IUT p-value will focus on genes that are DE in that cluster compared to all other clusters.
This strategy is particularly effective when dealing with distinct clusters that have a unique expression profile.
In such cases, it yields a highly focused marker set that concisely captures the differences between clusters.
However, it can be too stringent if the cluster's separation is driven by combinations of gene expression.
For example, consider a situation involving four clusters expressing each combination of two marker genes A and B.
With pval.type="all"
, neither A nor B would be detected as markers as it is not uniquely defined in any one cluster.
This is especially detrimental with overclustering where an otherwise acceptable marker is discarded if it is not DE between two adjacent clusters.
For each gene and cluster, the summary effect size is defined as the effect size from the pairwise comparison with the largest p-value. This reflects the fact that, with this approach, a gene is only as significant as its weakest DE. Again, this value is not directly used for ranking and are only reported for the sake of the user.
The pval.type="some"
setting serves as a compromise between "all"
and "any"
.
A combined p-value is calculated by taking the middlemost value of the Holm-corrected p-values for each gene.
(By default, this the median for odd numbers of contrasts and one-after-the-median for even numbers, but the exact proportion can be changed by setting min.prop
- see ?combineParallelPValues
.)
Here, the null hypothesis is that the gene is not DE in at least half of the contrasts.
Genes are then ranked by the combined p-value.
The aim is to provide a more focused marker set without being overly stringent, though obviously it loses the theoretical guarantees of the more extreme settings.
For example, there is no guarantee that the top set contains genes that can distinguish a cluster from any other cluster, which would have been possible with pval.type="any"
.
For each gene and cluster, the summary effect size is defined as the effect size from the pairwise comparison with the min.prop
-smallest p-value.
This mirrors the p-value calculation but, again, is reported only for the benefit of the user.
A slightly different flavor of the “some cluster” approach is achieved by setting method="any"
with min.prop
set to some positive value in (0, 1).
A gene will only be high-ranked if it is among the top-ranked genes in at least min.prop
of the pairwise comparisons.
For example, if min.prop=0.3
, any gene with a value of Top
less than or equal to 5 will be in the top 5 DEGs of at least 30% of the comparisons.
This method increases the stringency of the "any"
setting in a safer manner than pval.type="some"
.
Specifically, we avoid comparing p-values across pairwise comparisons, which can be problematic if there are power differences across comparisons, e.g., due to differences in the number of cells across the other clusters.
Note that the value of min.prop
does not affect the combined p-value and summary effect size calculations for pval.type="any"
.
The BH method is then applied on the consolidated p-values across all genes to obtain the FDR
field.
The reported FDRs are intended only as a rough measure of significance.
Properly correcting for multiple testing is not generally possible when clusters
is determined from the same x
used for DE testing.
If log.p=TRUE
, log-transformed p-values and FDRs will be reported.
This may be useful in over-powered studies with many cells, where directly reporting the raw p-values would result in many zeroes due to the limits of machine precision.
Within each DataFrame, if sorted=TRUE
, genes are ranked by the Top
column if available and the p.value
(or log.p.value
) if not.
Otherwise, the input order of the genes is preserved.
For the DataFrame corresponding to cluster , the
<OUTPUT>.Y
columns are sorted according to the order of cluster IDs in pairs[,2]
for all rows where pairs[,1]
is .
In the output List, the DataFrames themselves are sorted according to the order of cluster IDs in pairs[,1]
.
Note that DataFrames are only created for clusters present in pairs[,1]
.
Clusters unique to pairs[,2]
will only be present within a DataFrame as .
Aaron Lun
Simes RJ (1986). An improved Bonferroni procedure for multiple tests of significance. Biometrika 73:751-754.
Berger RL and Hsu JC (1996). Bioequivalence trials, intersection-union tests and equivalence confidence sets. Statist. Sci. 11, 283-319.
pairwiseTTests
and pairwiseWilcox
, for functions that can generate de.lists
and pairs
.
findMarkers
, which automatically performs combineMarkers
on the t-test or Wilcoxon test results.
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Any clustering method is okay. kout <- kmeans(t(logcounts(sce)), centers=3) clusters <- paste0("Cluster", kout$cluster) out <- pairwiseTTests(logcounts(sce), groups=clusters) comb <- combineMarkers(out$statistics, out$pairs) comb[["Cluster1"]] out <- pairwiseWilcox(logcounts(sce), groups=clusters) comb <- combineMarkers(out$statistics, out$pairs, effect.field="AUC") comb[["Cluster2"]] out <- pairwiseBinom(logcounts(sce), groups=clusters) comb <- combineMarkers(out$statistics, out$pairs) comb[["Cluster3"]]
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Any clustering method is okay. kout <- kmeans(t(logcounts(sce)), centers=3) clusters <- paste0("Cluster", kout$cluster) out <- pairwiseTTests(logcounts(sce), groups=clusters) comb <- combineMarkers(out$statistics, out$pairs) comb[["Cluster1"]] out <- pairwiseWilcox(logcounts(sce), groups=clusters) comb <- combineMarkers(out$statistics, out$pairs, effect.field="AUC") comb[["Cluster2"]] out <- pairwiseBinom(logcounts(sce), groups=clusters) comb <- combineMarkers(out$statistics, out$pairs) comb[["Cluster3"]]
Combine p-values from independent or dependent hypothesis tests using a variety of meta-analysis methods.
This is deprecated in favor of combineParallelPValues
from the metapod package.
combinePValues( ..., method = c("fisher", "z", "simes", "berger", "holm-middle"), weights = NULL, log.p = FALSE, min.prop = 0.5 )
combinePValues( ..., method = c("fisher", "z", "simes", "berger", "holm-middle"), weights = NULL, log.p = FALSE, min.prop = 0.5 )
... |
Two or more numeric vectors of p-values of the same length. |
method |
A string specifying the combining strategy to use. |
weights |
A numeric vector of positive weights, with one value per vector in |
log.p |
Logical scalar indicating whether the p-values in |
min.prop |
Numeric scalar in [0, 1] specifying the minimum proportion of tests to reject for each set of p-values when |
This function will operate across elements on ...
in parallel to combine p-values.
That is, the set of first p-values from all vectors will be combined, followed by the second p-values and so on.
This is useful for combining p-values for each gene across different hypothesis tests.
Fisher's method, Stouffer's Z method and Simes' method test the global null hypothesis that all of the individual null hypotheses in the set are true. The global null is rejected if any of the individual nulls are rejected. However, each test has different characteristics:
Fisher's method requires independence of the test statistic. It is useful in asymmetric scenarios, i.e., when the null is only rejected in one of the tests in the set. Thus, a low p-value in any test is sufficient to obtain a low combined p-value.
Stouffer's Z method require independence of the test statistic. It favours symmetric rejection and is less sensitive to a single low p-value, requiring more consistently low p-values to yield a low combined p-value. It can also accommodate weighting of the different p-values.
Simes' method technically requires independence but tends to be quite robust to dependencies between tests. See Sarkar and Chung (1997) for details, as well as work on the related Benjamini-Hochberg method. It favours asymmetric rejection and is less powerful than the other two methods under independence.
Berger's intersection-union test examines a different global null hypothesis - that at least one of the individual null hypotheses are true. Rejection in the IUT indicates that all of the individual nulls have been rejected. This is the statistically rigorous equivalent of a naive intersection operation.
In the Holm-middle approach, the global null hypothesis is that more than 1 - min.prop
proportion of the individual nulls in the set are true.
We apply the Holm-Bonferroni correction to all p-values in the set and take the ceiling(min.prop * N)
-th smallest value where N
is the size of the set (excluding NA
values).
This method works correctly in the presence of correlations between p-values.
A numeric vector containing the combined p-values.
Aaron Lun
Fisher, R.A. (1925). Statistical Methods for Research Workers. Oliver and Boyd (Edinburgh).
Whitlock MC (2005). Combining probability from independent tests: the weighted Z-method is superior to Fisher's approach. J. Evol. Biol. 18, 5:1368-73.
Simes RJ (1986). An improved Bonferroni procedure for multiple tests of significance. Biometrika 73:751-754.
Berger RL and Hsu JC (1996). Bioequivalence trials, intersection-union tests and equivalence confidence sets. Statist. Sci. 11, 283-319.
Sarkar SK and Chung CK (1997). The Simes method for multiple hypothesis testing with positively dependent test statistics. J. Am. Stat. Assoc. 92, 1601-1608.
p1 <- runif(10000) p2 <- runif(10000) p3 <- runif(10000) fish <- combinePValues(p1, p2, p3) hist(fish) z <- combinePValues(p1, p2, p3, method="z", weights=1:3) hist(z) simes <- combinePValues(p1, p2, p3, method="simes") hist(simes) berger <- combinePValues(p1, p2, p3, method="berger") hist(berger)
p1 <- runif(10000) p2 <- runif(10000) p3 <- runif(10000) fish <- combinePValues(p1, p2, p3) hist(fish) z <- combinePValues(p1, p2, p3, method="z", weights=1:3) hist(z) simes <- combinePValues(p1, p2, p3, method="simes") hist(simes) berger <- combinePValues(p1, p2, p3, method="berger") hist(berger)
Combine the results of multiple variance decompositions, usually generated for the same genes across separate batches of cells.
combineVar( ..., method = "fisher", pval.field = "p.value", other.fields = NULL, equiweight = TRUE, ncells = NULL ) combineCV2( ..., method = "fisher", pval.field = "p.value", other.fields = NULL, equiweight = TRUE, ncells = NULL )
combineVar( ..., method = "fisher", pval.field = "p.value", other.fields = NULL, equiweight = TRUE, ncells = NULL ) combineCV2( ..., method = "fisher", pval.field = "p.value", other.fields = NULL, equiweight = TRUE, ncells = NULL )
... |
Two or more DataFrames of variance modelling results.
For Alternatively, one or more lists of DataFrames containing variance modelling results. Mixed inputs are also acceptable, e.g., lists of DataFrames alongside the DataFrames themselves. |
method |
String specifying how p-values are to be combined, see |
pval.field |
A string specifying the column name of each element of |
other.fields |
A character vector specifying the fields containing other statistics to combine. |
equiweight |
Logical scalar indicating whether each result is to be given equal weight in the combined statistics. |
ncells |
Numeric vector containing the number of cells used to generate each element of |
These functions are designed to merge results from separate calls to modelGeneVar
, modelGeneCV2
or related functions, where each result is usually computed for a different batch of cells.
Separate variance decompositions are necessary in cases where the mean-variance relationships vary across batches (e.g., different concentrations of spike-in have been added to the cells in each batch), which precludes the use of a common trend fit.
By combining these results into a single set of statistics, we can apply standard strategies for feature selection in multi-batch integrated analyses.
By default, statistics in other.fields
contain all common non-numeric fields that are not pval.field
or "FDR"
.
This usually includes "mean"
, "total"
, "bio"
(for combineVar
) or "ratio"
(for combineCV2
).
For combineVar
, statistics are combined by averaging them across all input DataFrames.
For combineCV2
, statistics are combined by taking the geometric mean across all inputs.
This difference between functions reflects the method by which the relevant measure of overdispersion is computed.
For example, "bio"
is computed by subtraction, so taking the average bio
remains consistent with subtraction of the total and technical averages.
Similarly, "ratio"
is computed by division, so the combined ratio
is consistent with division of the geometric means of the total and trend values.
If equiweight=FALSE
, each per-batch statistic is weighted by the number of cells used to compute it.
The number of cells can be explicitly set using ncells
, and is otherwise assumed to be equal for all batches.
No weighting is performed by default, which ensures that all batches contribute equally to the combined statistics and avoids situations where batches with many cells dominate the output.
The combineParallelPValues
function is used to combine p-values across batches.
The default is to use Fisher's method, which will achieve a low p-value if a gene is highly variable in any batch.
Only method="stouffer"
will perform any weighting of batches, and only if weights
is set.
A DataFrame with the same numeric fields as that produced by modelGeneVar
or modelGeneCV2
.
Each row corresponds to an input gene.
Each field contains the (weighted) arithmetic/geometric mean across all batches except for p.value
, which contains the combined p-value based on method
;
and FDR
, which contains the adjusted p-value using the BH method.
Aaron Lun
modelGeneVar
and modelGeneCV2
, for two possible inputs into this function.
combineParallelPValues
, for details on how the p-values are combined.
library(scuttle) sce <- mockSCE() y1 <- sce[,1:100] y1 <- logNormCounts(y1) # normalize separately after subsetting. results1 <- modelGeneVar(y1) y2 <- sce[,1:100 + 100] y2 <- logNormCounts(y2) # normalize separately after subsetting. results2 <- modelGeneVar(y2) head(combineVar(results1, results2)) head(combineVar(results1, results2, method="simes")) head(combineVar(results1, results2, method="berger"))
library(scuttle) sce <- mockSCE() y1 <- sce[,1:100] y1 <- logNormCounts(y1) # normalize separately after subsetting. results1 <- modelGeneVar(y1) y2 <- sce[,1:100 + 100] y2 <- logNormCounts(y2) # normalize separately after subsetting. results2 <- modelGeneVar(y2) head(combineVar(results1, results2)) head(combineVar(results1, results2, method="simes")) head(combineVar(results1, results2, method="berger"))
Compute the minimum rank in a matrix of statistics, usually effect sizes from a set of differential comparisons.
computeMinRank(x, ties.method = "min", decreasing = TRUE)
computeMinRank(x, ties.method = "min", decreasing = TRUE)
x |
A matrix of statistics from multiple differential comparisons (columns) and genes (rows). |
ties.method |
String specifying how ties should be handled. |
decreasing |
Logical scalar indicating whether to obtain ranks for decreasing magnitude of values in |
For each gene, the minimum rank, a.k.a., “min-rank” is defined by ranking values within each column of x
, and then taking the minimum rank value across columns.
This is most useful when the columns of x
contain significance statistics or effect sizes from a single differential comparison, where larger values represent stronger differences.
In this setting, the min-rank represents the highest rank that each gene achieves in any comparison.
Taking all genes with min-ranks less than or equal to yields the union of the top
DE genes from all comparisons.
To illustrate, the set of genes with min-rank values of 1 will contain the top gene from each pairwise comparison to every other cluster.
If we instead take all genes with min-ranks less than or equal to, say, , the set will consist of the union of the top 5 genes from each pairwise comparison.
Multiple genes can have the same min-rank as different genes may have the same rank across different pairwise comparisons.
Conversely, the marker set may be smaller than the product of
and the number of other clusters, as the same gene may be shared across different comparisons.
In the context of marker detection with pairwise comparisons between groups of cells, sorting by the min-rank guarantees the inclusion of genes that can distinguish between any two groups. More specifically, this approach does not explicitly favour genes that are uniquely expressed in a cluster. Rather, it focuses on combinations of genes that - together - drive separation of a cluster from the others. This is more general and robust but tends to yield a less focused marker set compared to the other methods of ranking potential markers.
A numeric vector containing the minimum (i.e., top) rank for each gene across all comparisons.
scoreMarkers
, where this function is used to compute one of the effect size summaries.
combineMarkers
, where the same principle is used for the Top
field.
# Get min-rank by log-FC: lfcs <- matrix(rnorm(100), ncol=5) computeMinRank(lfcs) # Get min-rank by p-value: pvals <- matrix(runif(100), ncol=5) computeMinRank(pvals, decreasing=FALSE)
# Get min-rank by log-FC: lfcs <- matrix(rnorm(100), ncol=5) computeMinRank(lfcs) # Get min-rank by p-value: pvals <- matrix(runif(100), ncol=5) computeMinRank(pvals, decreasing=FALSE)
Scaling normalization of single-cell RNA-seq data by deconvolving size factors from cell pools. These functions have been moved to the scuttle package and are just retained here for compatibility.
computeSumFactors(...) calculateSumFactors(...)
computeSumFactors(...) calculateSumFactors(...)
... |
Further arguments to pass to |
For calculateSumFactors
, a numeric vector of size factors returned by pooledSizeFactors
.
For computeSumFactors
, a SingleCellExperiment containing the size factors in its sizeFactors
,
as returned by computePooledFactors
.
Aaron Lun
Convert a SingleCellExperiment object into other classes for entry into other analysis pipelines.
convertTo( x, type = c("edgeR", "DESeq2", "monocle"), ..., assay.type = 1, subset.row = NULL )
convertTo( x, type = c("edgeR", "DESeq2", "monocle"), ..., assay.type = 1, subset.row = NULL )
x |
A SingleCellExperiment object. |
type |
A string specifying the analysis for which the object should be prepared. |
... |
Other arguments to be passed to pipeline-specific constructors. |
assay.type |
A string specifying which assay of |
subset.row |
See |
This function converts an SingleCellExperiment object into various other classes in preparation for entry into other analysis pipelines, as specified by type
.
For type="edgeR"
, a DGEList object is returned containing the count matrix.
Size factors are converted to normalization factors.
Gene-specific rowData
is stored in the genes
element, and cell-specific colData
is stored in the samples
element.
For type="DESeq2"
, a DESeqDataSet object is returned containing the count matrix and size factors.
Additional gene- and cell-specific data is stored in the mcols
and colData
respectively.
Aaron Lun
DGEList
,
DESeqDataSetFromMatrix
for specific class constructors.
library(scuttle) sce <- mockSCE() # Adding some additional embellishments. sizeFactors(sce) <- 2^rnorm(ncol(sce)) rowData(sce)$SYMBOL <- paste0("X", seq_len(nrow(sce))) sce$other <- sample(LETTERS, ncol(sce), replace=TRUE) # Converting to various objects. convertTo(sce, type="edgeR") convertTo(sce, type="DESeq2")
library(scuttle) sce <- mockSCE() # Adding some additional embellishments. sizeFactors(sce) <- 2^rnorm(ncol(sce)) rowData(sce)$SYMBOL <- paste0("X", seq_len(nrow(sce))) sce$other <- sample(LETTERS, ncol(sce), replace=TRUE) # Converting to various objects. convertTo(sce, type="edgeR") convertTo(sce, type="DESeq2")
Compute per-gene correlation statistics by combining results from gene pair correlations.
correlateGenes(stats)
correlateGenes(stats)
stats |
A DataFrame of pairwise correlation statistics, returned by |
For each gene, all of its pairs are identified and the corresponding p-values are combined using Simes' method. This tests whether the gene is involved in significant correlations to any other gene. Per-gene statistics are useful for identifying correlated genes without regard to what they are correlated with (e.g., during feature selection).
A DataFrame with one row per unique gene in stats
and containing the fields:
gene
:A field of the same type as stats$gene1
specifying the gene identity.
rho
:Numeric, the correlation with the largest magnitude across all gene pairs involving the corresponding gene.
p.value
:Numeric, the Simes p-value for this gene.
FDR
:Numeric, the adjusted p.value
across all rows.
Aaron Lun
Simes RJ (1986). An improved Bonferroni procedure for multiple tests of significance. Biometrika 73:751-754.
correlatePairs
, to compute stats
.
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) pairs <- correlatePairs(sce, iters=1e5, subset.row=1:100) g.out <- correlateGenes(pairs) head(g.out)
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) pairs <- correlatePairs(sce, iters=1e5, subset.row=1:100) g.out <- correlateGenes(pairs) head(g.out)
Build a distribution of correlations under the null hypothesis of independent expression between pairs of genes.
This is now deprecated as correlatePairs
uses an approximation instead.
correlateNull( ncells, iters = 1e+06, block = NULL, design = NULL, equiweight = TRUE, BPPARAM = SerialParam() )
correlateNull( ncells, iters = 1e+06, block = NULL, design = NULL, equiweight = TRUE, BPPARAM = SerialParam() )
ncells |
An integer scalar indicating the number of cells in the data set. |
iters |
An integer scalar specifying the number of values in the null distribution. |
block |
A factor specifying the blocking level for each cell. |
design |
A numeric design matrix containing uninteresting factors to be ignored. |
equiweight |
A logical scalar indicating whether statistics from each block should be given equal weight.
Otherwise, each block is weighted according to its number of cells.
Only used if |
BPPARAM |
A BiocParallelParam object that specifies the manner of parallel processing to use. |
The correlateNull
function constructs an empirical null distribution for Spearman's rank correlation when it is computed with ncells
cells.
This is done by shuffling the ranks, calculating the correlation and repeating until iters
values are obtained.
No consideration is given to tied ranks, which has implications for the accuracy of p-values in correlatePairs
.
If block
is specified, a null correlation is created within each level of block
using the shuffled ranks.
The final correlation is then defined as the average of the per-level correlations,
weighted by the number of cells in that level if equiweight=FALSE
.
Levels with fewer than 3 cells are ignored, and if no level has 3 or more cells, all returned correlations will be NA
.
If design
is specified, the same process is performed on ranks derived from simulated residuals computed by fitting the linear model to a vector of normally distributed values.
If there are not at least 3 residual d.f., all returned correlations will be NA
.
The design
argument cannot be used at the same time as block
.
A numeric vector of length iters
is returned containing the sorted correlations under the null hypothesis of no correlations.
Aaron Lun
correlatePairs
, where the null distribution is used to compute p-values.
set.seed(0) ncells <- 100 # Simplest case: null.dist <- correlateNull(ncells, iters=10000) hist(null.dist) # With a blocking factor: block <- sample(LETTERS[1:3], ncells, replace=TRUE) null.dist <- correlateNull(block=block, iters=10000) hist(null.dist) # With a design matrix. cov <- runif(ncells) X <- model.matrix(~cov) null.dist <- correlateNull(design=X, iters=10000) hist(null.dist)
set.seed(0) ncells <- 100 # Simplest case: null.dist <- correlateNull(ncells, iters=10000) hist(null.dist) # With a blocking factor: block <- sample(LETTERS[1:3], ncells, replace=TRUE) null.dist <- correlateNull(block=block, iters=10000) hist(null.dist) # With a design matrix. cov <- runif(ncells) X <- model.matrix(~cov) null.dist <- correlateNull(design=X, iters=10000) hist(null.dist)
Identify pairs of genes that are significantly correlated in their expression profiles, based on Spearman's rank correlation.
correlatePairs(x, ...) ## S4 method for signature 'ANY' correlatePairs( x, null.dist = NULL, ties.method = NULL, iters = NULL, block = NULL, design = NULL, equiweight = TRUE, use.names = TRUE, subset.row = NULL, pairings = NULL, BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' correlatePairs(x, ..., assay.type = "logcounts")
correlatePairs(x, ...) ## S4 method for signature 'ANY' correlatePairs( x, null.dist = NULL, ties.method = NULL, iters = NULL, block = NULL, design = NULL, equiweight = TRUE, use.names = TRUE, subset.row = NULL, pairings = NULL, BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' correlatePairs(x, ..., assay.type = "logcounts")
x |
A numeric matrix-like object of log-normalized expression values, where rows are genes and columns are cells. Alternatively, a SummarizedExperiment object containing such a matrix. |
... |
For the generic, additional arguments to pass to specific methods. For the SummarizedExperiment method, additional methods to pass to the ANY method. |
null.dist , ties.method , iters
|
Deprecated arguments, ignored. |
block |
A factor specifying the blocking level for each cell in |
design |
A numeric design matrix containing uninteresting factors to be ignored. |
equiweight |
A logical scalar indicating whether statistics from each block should be given equal weight.
Otherwise, each block is weighted according to its number of cells.
Only used if |
use.names |
A logical scalar specifying whether the row names of |
subset.row |
See |
pairings |
A |
BPPARAM |
A BiocParallelParam object that specifies the manner of parallel processing to use. |
assay.type |
A string specifying which assay values to use. |
The correlatePairs
function identifies significant correlations between all pairs of genes in x
.
This allows prioritization of genes that are driving systematic substructure in the data set.
By definition, such genes should be correlated as they are behaving in the same manner across cells.
In contrast, genes driven by random noise should not exhibit any correlations with other genes.
We use Spearman's rho to quantify correlations robustly based on ranked gene expression.
To identify correlated gene pairs, the significance of non-zero correlations is assessed using rhoToPValue
.
The null hypothesis is that the ranking of normalized expression across cells should be independent between genes.
Correction for multiple testing is done using the BH method.
For the SingleCellExperiment method, normalized expression values should be specified by assay.type
.
A DataFrame is returned with one row per gene pair and the following fields:
gene1, gene2
:Character or integer fields specifying the genes in the pair.
If use.names=FALSE
, integers are returned representing row indices of x
, otherwise gene names are returned.
rho
:A numeric field containing the approximate Spearman's rho.
p.value, FDR
:Numeric fields containing the approximate p-value and its BH-corrected equivalent.
Rows are sorted by increasing p.value
and, if tied, decreasing absolute size of rho
.
The exception is if subset.row
is a matrix, in which case each row in the dataframe correspond to a row of subset.row
.
If the experiment has known (and uninteresting) factors of variation, these can be included in design
or block
.
correlatePairs
will then attempt to ensure that these factors do not drive strong correlations between genes.
Examples might be to block on batch effects or cell cycle phase, which may have substantial but uninteresting effects on expression.
The approach used to remove these factors depends on whether design
or block
is used.
If there is only one factor, e.g., for plate or animal of origin, block
should be used.
Each level of the factor is defined as a separate group of cells.
For each pair of genes, correlations are computed within each group, and a mean of the correlations is taken across all groups.
If equiweight=FALSE
, a weighted mean is computed based on the size of each group.
Similarly, parallelStouffer
is used to combine the (one-sided) p-values across all groups.
This is done for each direction and a final p-value is computed for each gene pair using this Bonferri method.
The idea is to ensure that the final p-value is only low when correlations are in the same direction across groups.
If equiweight=FALSE
, each p-value is weighted by the size of the corresponding group.
For experiments containing multiple factors or covariates, a design matrix should be passed into design
.
The correlation between each pair of genes is then computed from the residuals of the fitted model.
However, we recommend using block
wherever possible as design
assumes normality of errors and deals poorly with ties.
Specifically, zero counts within or across groups may no longer be tied when converted to residuals, potentially resulting in spuriously large correlations.
If any level of block
has fewer than 3 cells, it is ignored.
If all levels of block
have fewer than 3 cells, all output statistics are set to NA
.
Similarly, if design
has fewer than 3 residual d.f., all output statistics are set to NA
.
The pairings
argument specifies the pairs of genes that should be used to compute correlations.
This can be:
NULL
, in which case correlations will be computed between all pairs of genes in x
.
Genes that occur earlier in x
are labelled as gene1
in the output DataFrame.
Redundant permutations are not reported.
A list of two vectors, where each list element defines a subset of genes in x
as an integer, character or logical vector.
In this case, correlations will be computed between one gene in the first vector and another gene in the second vector.
This improves efficiency if the only correlations of interest are those between two pre-defined sets of genes.
Genes in the first vector are always reported as gene1
.
An integer/character matrix of two columns.
In this case, each row is assumed to specify a gene pair based on the row indices (integer) or row names (character) of x
.
Correlations will then be computed for only those gene pairs, and the returned dataframe will not be sorted by p-value.
Genes in the first column of the matrix are always reported as gene1
.
If subset.row
is not NULL
, only the genes in the selected subset are used to compute correlations - see ?"scran-gene-selection"
.
This will interact properly with pairings
, such that genes in pairings
and not in subset.row
will be ignored.
We recommend setting subset.row
and/or pairings
to contain only the subset of genes of interest.
This reduces computational time and memory usage by only computing statistics for the gene pairs of interest.
For example, we could select only HVGs to focus on genes contributing to cell-to-cell heterogeneity (and thus more likely to be involved in driving substructure).
There is no need to account for HVG pre-selection in multiple testing, because rank correlations are unaffected by the variance.
Aaron Lun
Lun ATL (2019). Some thoughts on testing for correlations. https://ltla.github.io/SingleCellThoughts/software/correlations/corsim.html
Compare to cor
for the standard Spearman's calculation.
Use correlateGenes
to get per-gene correlation statistics.
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Basic pairwise application. out <- correlatePairs(sce, subset.row=1:100) head(out) # Computing between specific subsets of genes: out <- correlatePairs(sce, pairings=list(1:10, 110:120)) head(out) # Computing between specific pairs: out <- correlatePairs(sce, pairings=rbind(c(1,10), c(2, 50))) head(out)
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Basic pairwise application. out <- correlatePairs(sce, subset.row=1:100) head(out) # Computing between specific subsets of genes: out <- correlatePairs(sce, pairings=list(1:10, 110:120)) head(out) # Computing between specific pairs: out <- correlatePairs(sce, pairings=rbind(c(1,10), c(2, 50))) head(out)
Classify single cells into their cell cycle phases based on gene expression data.
cyclone(x, ...) ## S4 method for signature 'ANY' cyclone( x, pairs, gene.names = rownames(x), iter = 1000, min.iter = 100, min.pairs = 50, BPPARAM = SerialParam(), verbose = FALSE, subset.row = NULL ) ## S4 method for signature 'SummarizedExperiment' cyclone(x, ..., assay.type = "counts")
cyclone(x, ...) ## S4 method for signature 'ANY' cyclone( x, pairs, gene.names = rownames(x), iter = 1000, min.iter = 100, min.pairs = 50, BPPARAM = SerialParam(), verbose = FALSE, subset.row = NULL ) ## S4 method for signature 'SummarizedExperiment' cyclone(x, ..., assay.type = "counts")
x |
A numeric matrix-like object of gene expression values where rows are genes and columns are cells. Alternatively, a SummarizedExperiment object containing such a matrix. |
... |
For the generic, additional arguments to pass to specific methods. For the SummarizedExperiment method, additional arguments to pass to the ANY method. |
pairs |
A list of data.frames produced by |
gene.names |
A character vector of gene names, with one value per row in |
iter |
An integer scalar specifying the number of iterations for random sampling to obtain a cycle score. |
min.iter |
An integer scalar specifying the minimum number of iterations for score estimation. |
min.pairs |
An integer scalar specifying the minimum number of pairs for cycle estimation. |
BPPARAM |
A BiocParallelParam object to use for parallel processing across cells. |
verbose |
A logical scalar specifying whether diagnostics should be printed to screen. |
subset.row |
See |
assay.type |
A string specifying which assay values to use, e.g., |
This function implements the classification step of the pair-based prediction method described by Scialdone et al. (2015).
To illustrate, consider classification of cells into G1 phase.
Pairs of marker genes are identified with sandbag
, where the expression of the first gene in the training data is greater than the second in G1 phase but less than the second in all other phases.
For each cell, cyclone
calculates the proportion of all marker pairs where the expression of the first gene is greater than the second in the new data x
(pairs with the same expression are ignored).
A high proportion suggests that the cell is likely to belong in G1 phase, as the expression ranking in the new data is consistent with that in the training data.
Proportions are not directly comparable between phases due to the use of different sets of gene pairs for each phase.
Instead, proportions are converted into scores (see below) that account for the size and precision of the proportion estimate.
The same process is repeated for all phases, using the corresponding set of marker pairs in pairs
.
Cells with G1 or G2M scores above 0.5 are assigned to the G1 or G2M phases, respectively.
(If both are above 0.5, the higher score is used for assignment.)
Cells can be assigned to S phase based on the S score, but a more reliable approach is to define S phase cells as those with G1 and G2M scores below 0.5.
Pre-trained classifiers are provided for mouse and human datasets, see ?sandbag
for more details.
However, note that the classifier may not be accurate for data that are substantially different from those used in the training set, e.g., due to the use of a different protocol.
In such cases, users can construct a custom classifier from their own training data using the sandbag
function.
This is usually necessary for other model organisms where pre-trained classifiers are not available.
Users should not filter out low-abundance genes before applying cyclone
.
Even if a gene is not expressed in any cell, it may still be useful for classification if it is phase-specific.
Its lack of expression relative to other genes will still yield informative pairs, and filtering them out would reduce power.
A list is returned containing:
phases
:A character vector containing the predicted phase for each cell.
scores
:A data frame containing the numeric phase scores for each phase and cell (i.e., each row is a cell).
normalized.scores
:A data frame containing the row-normalized scores (i.e., where the row sum for each cell is equal to 1).
To make the proportions comparable between phases, a distribution of proportions is constructed by shuffling the expression values within each cell and recalculating the proportion. The phase score is defined as the lower tail probability at the observed proportion. High scores indicate that the proportion is greater than what is expected by chance if the expression of marker genes were independent (i.e., with no cycle-induced correlations between marker pairs within each cell).
By default, shuffling is performed iter
times to obtain the distribution from which the score is estimated.
However, some iterations may not be used if there are fewer than min.pairs
pairs with different expression, such that the proportion cannot be calculated precisely.
A score is only returned if the distribution is large enough for stable calculation of the tail probability, i.e., consists of results from at least min.iter
iterations.
Note that the score calculation in cyclone
is slightly different from that described originally by Scialdone et al.
The original code shuffles all expression values within each cell, while in this implementation, only the expression values of genes in the marker pairs are shuffled.
This modification aims to use the most relevant expression values to build the null score distribution.
Antonio Scialdone, with modifications by Aaron Lun
Scialdone A, Natarajana KN, Saraiva LR et al. (2015). Computational assignment of cell-cycle stage from single-cell transcriptome data. Methods 85:54–61
sandbag
, to generate the pairs from reference data.
set.seed(1000) library(scuttle) sce <- mockSCE(ncells=200, ngenes=1000) # Constructing a classifier: is.G1 <- which(sce$Cell_Cycle %in% c("G1", "G0")) is.S <- which(sce$Cell_Cycle=="S") is.G2M <- which(sce$Cell_Cycle=="G2M") out <- sandbag(sce, list(G1=is.G1, S=is.S, G2M=is.G2M)) # Classifying a new dataset: test <- mockSCE(ncells=50) assignments <- cyclone(test, out) head(assignments$scores) table(assignments$phases)
set.seed(1000) library(scuttle) sce <- mockSCE(ncells=200, ngenes=1000) # Constructing a classifier: is.G1 <- which(sce$Cell_Cycle %in% c("G1", "G0")) is.S <- which(sce$Cell_Cycle=="S") is.G2M <- which(sce$Cell_Cycle=="G2M") out <- sandbag(sce, list(G1=is.G1, S=is.S, G2M=is.G2M)) # Classifying a new dataset: test <- mockSCE(ncells=50) assignments <- cyclone(test, out) head(assignments$scores) table(assignments$phases)
Decide which tests (i.e., genes) are significant for differential expression between conditions in each label,
using the output of pseudoBulkDGE
.
This mimics the decideTests
functionality from limma.
decideTestsPerLabel( results, method = c("separate", "global"), threshold = 0.05, pval.field = NULL, lfc.field = "logFC" ) summarizeTestsPerLabel(results, ...)
decideTestsPerLabel( results, method = c("separate", "global"), threshold = 0.05, pval.field = NULL, lfc.field = "logFC" ) summarizeTestsPerLabel(results, ...)
results |
A List containing the output of For |
method |
String specifying whether the Benjamini-Hochberg correction should be applied across all clustesr or separately within each label. |
threshold |
Numeric scalar specifying the FDR threshold to consider genes as significant. |
pval.field |
String containing the name of the column containing the p-value in each entry of |
lfc.field |
String containing the name of the column containing the log-fold change.
Ignored if the column is not available Defaults to |
... |
Further arguments to pass to |
If a log-fold change field is available and specified in lfc.field
, values of 1
, -1
and 0
indicate that the gene is significantly upregulated, downregulated or not significant, respectively.
Note, the interpretation of “up” and “down” depends on the design and contrast in pseudoBulkDGE
.
Otherwise, if no log-fold change is available or if lfc.field=NULL
,
values of 1
or 0
indicate that a gene is significantly DE or not, respectively.
NA
values indicate either that the relevant gene was low-abundance for a particular label and filtered out,
or that the DE comparison for that label was not possible (e.g., no residual d.f.).
For decideTestsPerLabel
,
an integer matrix indicating whether each gene (row) is significantly DE between conditions for each label (column).
For summarizeTestsPerLabel
,
an integer matrix containing the number of genes of each DE status (column) in each label (row).
Aaron Lun
pseudoBulkDGE
, which generates the input to this function.
decideTests
, which inspired this function.
example(pseudoBulkDGE) head(decideTestsPerLabel(out)) summarizeTestsPerLabel(out)
example(pseudoBulkDGE) head(decideTestsPerLabel(out)) summarizeTestsPerLabel(out)
Functions that have passed on to the function afterlife. Their successors are also listed.
trendVar(...) decomposeVar(...) testVar(...) improvedCV2(...) technicalCV2(...) makeTechTrend(...) multiBlockVar(...) multiBlockNorm(...) overlapExprs(...) parallelPCA(...) bootstrapCluster(...) clusterModularity(...) clusterPurity(...) clusterKNNGraph(...) clusterSNNGraph(...) coassignProb(...) createClusterMST(...) connectClusterMST(...) orderClusterMST(...) quickPseudotime(...) testPseudotime(...) doubletCells(...) doubletCluster(...) doubletRecovery(...)
trendVar(...) decomposeVar(...) testVar(...) improvedCV2(...) technicalCV2(...) makeTechTrend(...) multiBlockVar(...) multiBlockNorm(...) overlapExprs(...) parallelPCA(...) bootstrapCluster(...) clusterModularity(...) clusterPurity(...) clusterKNNGraph(...) clusterSNNGraph(...) coassignProb(...) createClusterMST(...) connectClusterMST(...) orderClusterMST(...) quickPseudotime(...) testPseudotime(...) doubletCells(...) doubletCluster(...) doubletRecovery(...)
... |
Ignored arguments. |
All functions error out with a defunct message pointing towards its descendent (if available).
trendVar
, decomposeVar
and testVar
are succeeded by a suite of funtions related to modelGeneVar
and fitTrendVar
.
improvedCV2
and technicalCV2
are succeeded by modelGeneCV2
and fitTrendCV2
.
makeTechTrend
is succeeded by modelGeneVarByPoisson
.
multiBlockVar
is succeeded by the block
argument in many of the modelling functions, and multiBlockNorm
is no longer necessary.
bootstrapCluster
has been moved over to the bluster package, as the bootstrapStability
function.
neighborsToSNNGraph
and neighborsToKNNGraph
have been moved over to the bluster package.
clusterModularity
has been moved over to the bluster package, as the pairwiseModularity
function.
clusterPurity
has been moved over to the bluster package, as the neighborPurity
function.
clusterSNNGraph
and clusterKNNGraph
have been replaced by clusterRows
with NNGraphParam or TwoStepParam from the bluster package.
coassignProb
and clusterRand
have been replaced by pairwiseRand
from the bluster package.
createClusterMST
, quickPseudotime
and testPseudotime
have been moved over to the TSCAN package.
connectClusterMST
has been moved over to the TSCAN package, as the reportEdges
function.
orderClusterMST
has been moved over to the TSCAN package, as the orderCells
function.
doubletCells
has been moved over to the scDblFinder package, as the computeDoubletDensity
function.
doubletCluster
has been moved over to the scDblFinder package, as the findDoubletClusters
function.
doubletRecovery
has been moved over to the scDblFinder package, as the recoverDoublets
function.
overlapExprs
is succeeded by findMarkers
with test.type="wilcox"
.
parallelPCA
has been moved over to the PCAtools package.
Aaron Lun
try(trendVar())
try(trendVar())
Denoise log-expression data by removing principal components corresponding to technical noise.
getDenoisedPCs(x, ...) ## S4 method for signature 'ANY' getDenoisedPCs( x, technical, subset.row, min.rank = 5, max.rank = 50, fill.missing = FALSE, BSPARAM = bsparam(), BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' getDenoisedPCs(x, ..., assay.type = "logcounts") denoisePCA( x, ..., value = c("pca", "lowrank"), preserve.shape = TRUE, assay.type = "logcounts", name = NULL ) denoisePCANumber(var.exp, var.tech, var.total)
getDenoisedPCs(x, ...) ## S4 method for signature 'ANY' getDenoisedPCs( x, technical, subset.row, min.rank = 5, max.rank = 50, fill.missing = FALSE, BSPARAM = bsparam(), BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' getDenoisedPCs(x, ..., assay.type = "logcounts") denoisePCA( x, ..., value = c("pca", "lowrank"), preserve.shape = TRUE, assay.type = "logcounts", name = NULL ) denoisePCANumber(var.exp, var.tech, var.total)
x |
For For |
... |
For the For the |
technical |
An object containing the technical components of variation for each gene in
|
subset.row |
A logical, character or integer vector specifying the rows of |
min.rank , max.rank
|
Integer scalars specifying the minimum and maximum number of PCs to retain. |
fill.missing |
Logical scalar indicating whether entries in the rotation matrix should be imputed for genes that were not used in the PCA.
Only relevant if |
BSPARAM |
A BiocSingularParam object specifying the algorithm to use for PCA. |
BPPARAM |
A BiocParallelParam object to use for parallel processing. |
assay.type |
A string specifying which assay values to use. |
value |
String specifying the type of value to return.
|
preserve.shape |
Logical scalar indicating whether or not the output SingleCellExperiment should be subsetted to |
name |
String containing the name which which to store the results.
Defaults to |
var.exp |
A numeric vector of the variances explained by successive PCs, starting from the first (but not necessarily containing all PCs). |
var.tech |
A numeric scalar containing the variance attributable to technical noise. |
var.total |
A numeric scalar containing the total variance in the data. |
This function performs a principal components analysis to eliminate random technical noise in the data. Random noise is uncorrelated across genes and should be captured by later PCs, as the variance in the data explained by any single gene is low. In contrast, biological processes should be captured by earlier PCs as more variance can be explained by the correlated behavior of sets of genes in a particular pathway. The idea is to discard later PCs to remove noise and improve resolution of population structure. This also has the benefit of reducing computational work for downstream steps.
The choice of the number of PCs to discard is based on the estimates of technical variance in technical
.
This argument accepts a number of different values, depending on how the technical noise is calculated - this generally involves functions such as modelGeneVarWithSpikes
or modelGeneVarByPoisson
.
The percentage of variance explained by technical noise is estimated by summing the technical components across genes and dividing by the summed total variance.
Genes with negative biological components are ignored during downstream analyses to ensure that the total variance is greater than the overall technical estimate.
Now, consider the retention of the first PCs.
For a given value of
, we compute the variance explained by all of the later PCs.
We aim to find the smallest value of
such that the sum of variances explained by the later PCs is still less than the variance attributable to technical noise.
This choice of
represents a lower bound on the number of PCs that can be retained before biological variation is definitely lost.
We use this value to obtain a “reasonable” dimensionality for the PCA output.
Note that will be coerced to lie between
min.rank
and max.rank
.
This mitigates the effect of occasional extreme results when the percentage of noise is very high or low.
For getDenoisedPCs
, a list is returned containing:
components
, a numeric matrix containing the selected PCs (columns) for all cells (rows).
This has number of columns between min.rank
and max.rank
inclusive.
rotation
, a numeric matrix containing rotation vectors (columns) for some or all genes (rows).
This has number of columns between min.rank
and max.rank
inclusive.
var.explained
, a numeric vector containing the variance explained by the first max.rank
PCs.
percent.var
, a numeric vector containing the percentage of variance explained by the first max.rank
PCs.
Note that this may not sum to 100% if max.rank
is smaller than the total number of PCs.
used.rows
, a integer vector specifying the rows of x
that were used in the PCA.
denoisePCA
will return a modified x
with:
the PC results stored in the reducedDims
as a "PCA"
entry, if type="pca"
.
a low-rank approximation as a new "lowrank"
assay, if type="lowrank"
.
This is represented as a LowRankMatrix.
denoisePCANumber
will return an integer scalar specifying the number of PCs to retain.
This is equivalent to the output from getDenoisedPCs
after setting value="n"
, but ignoring any setting of min.rank
or max.rank
.
We can use subset.row
to perform the PCA on a subset of genes of interest.
This is typically used to subset to HVGs to reduce computational time and increase the signal-to-noise ratio of downstream analyses.
Note that only rows with positive components are actually used in the PCA, even if we explicitly specified them in subset.row
.
The final set of genes used in the PCA is returned in used.rows
.
If fill.missing=TRUE
, entries of the rotation matrix are imputed for all genes in x
.
This includes “unselected” genes, i.e., with negative biological components or that were not selected with subset.row
.
Rotation vectors are extrapolated to these genes by projecting their expression profiles into the low-dimensional space defined by the SVD on the selected genes.
This is useful for guaranteeing that any low-rank approximation has the same dimensions as the input x
.
For example, calling denoisePCA
with preserve.shape=TRUE
will use fill.missing=TRUE
internally,
which guarantees that any value="lowrank"
setting will be of the same dimensions as the input x
.
Otherwise, if fill.missing=FALSE
and preserve.shape=FALSE
, the output is exactly the same as if the function had been run on x[subset.row,]
.
The function's choice of is only optimal if the early PCs capture all the biological variation with minimal noise.
This is unlikely to be true as the PCA cannot distinguish between technical noise and weak biological signal in the later PCs.
In practice, the chosen
can only be treated as a lower bound for the retention of signal, and it is debatable whether this has any particular relation to the “best” choice of the number of PCs.
For example, many aspects of biological variation are not that interesting (e.g., transcriptional bursting, metabolic fluctuations) and it is often the case that we do not need to retain this signal, in which case the chosen
- despite being a lower bound - may actually be higher than necessary.
Interpretation of the choice of is even more complex if
technical
was generated with modelGeneVar
rather than modelGeneVarWithSpikes
or modelGeneVarByPoisson
.
The former includes “uninteresting” biological variation in its technical component estimates, increasing the proportion of variance attributed to technical noise and yielding a lower value of .
Indeed, use of results from
modelGeneVar
often results in being set to to
min.rank
, which can be problematic if secondary factors of biological variation are discarded.
Aaron Lun
Lun ATL (2018). Discussion of PC selection methods for scRNA-seq data. https://github.com/LTLA/PCSelection2018
modelGeneVarWithSpikes
and modelGeneVarByPoisson
, for methods of computing technical components.
runSVD
, for the underlying SVD algorithm(s).
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Modelling the variance: var.stats <- modelGeneVar(sce) hvgs <- getTopHVGs(var.stats, n=5000) # Denoising: pcs <- getDenoisedPCs(sce, technical=var.stats) head(pcs$components) head(pcs$rotation) head(pcs$percent.var) # Automatically storing the results. sce <- denoisePCA(sce, technical=var.stats, subset.row=hvgs) reducedDimNames(sce)
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Modelling the variance: var.stats <- modelGeneVar(sce) hvgs <- getTopHVGs(var.stats, n=5000) # Denoising: pcs <- getDenoisedPCs(sce, technical=var.stats) head(pcs$components) head(pcs$rotation) head(pcs$percent.var) # Automatically storing the results. sce <- denoisePCA(sce, technical=var.stats, subset.row=hvgs) reducedDimNames(sce)
Compute the distance-to-median statistic for the CV2 residuals of all genes
DM(mean, cv2, win.size=51)
DM(mean, cv2, win.size=51)
mean |
A numeric vector of average counts for each gene. |
cv2 |
A numeric vector of squared coefficients of variation for each gene. |
win.size |
An integer scalar specifying the window size for median-based smoothing. This should be odd or will be incremented by 1. |
This function will compute the distance-to-median (DM) statistic described by Kolodziejczyk et al. (2015).
Briefly, a median-based trend is fitted to the log-transformed cv2
against the log-transformed mean
using runmed
.
The DM is defined as the residual from the trend for each gene.
This statistic is a measure of the relative variability of each gene, after accounting for the empirical mean-variance relationship.
Highly variable genes can then be identified as those with high DM values.
A numeric vector of DM statistics for all genes.
Jong Kyoung Kim, with modifications by Aaron Lun
Kolodziejczyk AA, Kim JK, Tsang JCH et al. (2015). Single cell RNA-sequencing of pluripotent states unlocks modular transcriptional variation. Cell Stem Cell 17(4), 471–85.
# Mocking up some data ngenes <- 1000 ncells <- 100 gene.means <- 2^runif(ngenes, 0, 10) dispersions <- 1/gene.means + 0.2 counts <- matrix(rnbinom(ngenes*ncells, mu=gene.means, size=1/dispersions), nrow=ngenes) # Computing the DM. means <- rowMeans(counts) cv2 <- apply(counts, 1, var)/means^2 dm.stat <- DM(means, cv2) head(dm.stat)
# Mocking up some data ngenes <- 1000 ncells <- 100 gene.means <- 2^runif(ngenes, 0, 10) dispersions <- 1/gene.means + 0.2 counts <- matrix(rnbinom(ngenes*ncells, mu=gene.means, size=1/dispersions), nrow=ngenes) # Computing the DM. means <- rowMeans(counts) cv2 <- apply(counts, 1, var)/means^2 dm.stat <- DM(means, cv2) head(dm.stat)
Find candidate marker genes for groups of cells (e.g., clusters) by testing for differential expression between pairs of groups.
findMarkers(x, ...) ## S4 method for signature 'ANY' findMarkers( x, groups, test.type = c("t", "wilcox", "binom"), ..., pval.type = c("any", "some", "all"), min.prop = NULL, log.p = FALSE, full.stats = FALSE, sorted = TRUE, row.data = NULL, add.summary = FALSE, BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' findMarkers(x, ..., assay.type = "logcounts") ## S4 method for signature 'SingleCellExperiment' findMarkers(x, groups = colLabels(x, onAbsence = "error"), ...)
findMarkers(x, ...) ## S4 method for signature 'ANY' findMarkers( x, groups, test.type = c("t", "wilcox", "binom"), ..., pval.type = c("any", "some", "all"), min.prop = NULL, log.p = FALSE, full.stats = FALSE, sorted = TRUE, row.data = NULL, add.summary = FALSE, BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' findMarkers(x, ..., assay.type = "logcounts") ## S4 method for signature 'SingleCellExperiment' findMarkers(x, groups = colLabels(x, onAbsence = "error"), ...)
x |
A numeric matrix-like object of expression values, where each column corresponds to a cell and each row corresponds to an endogenous gene. This is expected to be normalized log-expression values for most tests - see Details. Alternatively, a SummarizedExperiment or SingleCellExperiment object containing such a matrix. |
... |
For the generic, further arguments to pass to specific methods. For the ANY method:
Common arguments for all testing functions include For the SummarizedExperiment method, further arguments to pass to the ANY method. For the SingleCellExperiment method, further arguments to pass to the SummarizedExperiment method. |
groups |
A vector of length equal to |
test.type |
String specifying the type of pairwise test to perform -
a t-test with |
pval.type |
A string specifying how p-values are to be combined across pairwise comparisons for a given group/cluster. |
min.prop |
Numeric scalar specifying the minimum proportion of significant comparisons per gene,
Defaults to 0.5 when |
log.p |
A logical scalar indicating if log-transformed p-values/FDRs should be returned. |
full.stats |
A logical scalar indicating whether all statistics in |
sorted |
Logical scalar indicating whether each output DataFrame should be sorted by a statistic relevant to |
row.data |
A DataFrame containing additional row metadata for each gene in Alternatively, a list containing one such DataFrame per level of |
add.summary |
Logical scalar indicating whether statistics from |
BPPARAM |
A BiocParallelParam object indicating whether and how parallelization should be performed across genes. |
assay.type |
A string specifying which assay values to use, usually |
This function provides a convenience wrapper for marker gene identification between groups of cells,
based on running pairwiseTTests
or related functions and passing the result to combineMarkers
.
All of the arguments above are supplied directly to one of these two functions -
refer to the relevant function's documentation for more details.
If x
contains log-normalized expression values generated with a pseudo-count of 1,
it can be used in any of the pairwise testing procedures.
If x
is scale-normalized but not log-transformed, it can be used with test.type="wilcox"
and test.type="binom"
.
If x
contains raw counts, it can only be used with test.type="binom"
.
Note that log.p
only affects the combined p-values and FDRs.
If full.stats=TRUE
, the p-values for each individual pairwise comparison will always be log-transformed,
regardless of the value of log.p
.
Log-transformed p-values and FDRs are reported using the natural base.
The choice of pval.type
determines whether the highly ranked genes are those that are DE between the current group and:
any other group ("any"
)
all other groups ("all"
)
some other groups ("some"
)
See ?combineMarkers
for more details.
A named list of DataFrames, each of which contains a sorted marker gene list for the corresponding group.
In each DataFrame, the top genes are chosen to enable separation of that group from all other groups.
See ?combineMarkers
for more details on the output format.
If row.data
is provided, the additional fields are added to the front of the DataFrame for each cluster.
If add.summary=TRUE
, extra statistics for each cluster are also computed and added.
Any log-fold changes are reported as differences in average x
between groups
(usually in base 2, depending on the transformation applied to x
).
Aaron Lun
pairwiseTTests
,
pairwiseWilcox
,
pairwiseBinom
,
for the underlying functions that compute the pairwise DE statistics.
combineMarkers
, to combine pairwise statistics into a single marker list per cluster.
summaryMarkerStats
, to incorporate additional summary statistics per cluster.
getMarkerEffects
, to easily extract a matrix of effect sizes from each DataFrame.
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Any clustering method is okay, only using k-means for convenience. kout <- kmeans(t(logcounts(sce)), centers=4) out <- findMarkers(sce, groups=kout$cluster) names(out) out[[1]] # More customization of the tests: out <- findMarkers(sce, groups=kout$cluster, test.type="wilcox") out[[1]] out <- findMarkers(sce, groups=kout$cluster, lfc=1, direction="up") out[[1]] out <- findMarkers(sce, groups=kout$cluster, pval.type="all") out[[1]]
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Any clustering method is okay, only using k-means for convenience. kout <- kmeans(t(logcounts(sce)), centers=4) out <- findMarkers(sce, groups=kout$cluster) names(out) out[[1]] # More customization of the tests: out <- findMarkers(sce, groups=kout$cluster, test.type="wilcox") out[[1]] out <- findMarkers(sce, groups=kout$cluster, lfc=1, direction="up") out[[1]] out <- findMarkers(sce, groups=kout$cluster, pval.type="all") out[[1]]
Fit a mean-dependent trend to the squared coefficient of variation, computed from count data after size factor normalization.
fitTrendCV2( means, cv2, ncells, min.mean = 0.1, nls.args = list(), simplified = TRUE, nmads = 6, max.iter = 50 )
fitTrendCV2( means, cv2, ncells, min.mean = 0.1, nls.args = list(), simplified = TRUE, nmads = 6, max.iter = 50 )
means |
A numeric vector containing mean normalized expression values for all genes. |
cv2 |
A numeric vector containing the squared coefficient of variation computed from normalized expression values for all genes. |
ncells |
Integer scalar specifying the number of cells used to compute |
min.mean |
Numeric scalar specifying the minimum mean to use for trend fitting. |
nls.args |
A list of parameters to pass to |
simplified |
Logical scalar indicating whether the function can automatically use a simpler trend if errors are encountered for the usual paramterization. |
nmads |
Numeric scalar specifying the number of MADs to use to compute the tricube bandwidth during robustification. |
max.iter |
Integer scalar specifying the maximum number of robustness iterations to perform. |
This function fits a mean-dependent trend to the CV2 of normalized expression values for the selected features. Specifically, it fits a trend of the form
using an iteratively reweighted least-squares approach implemented via nls
.
This trend is based on a similar formulation from DESeq2 and generally captures the mean-CV2 trend well.
Trend fitting is performed after weighting each observation according to the inverse of the density of observations at the same mean.
This avoids problems with differences in the distribution of means that would otherwise favor good fits in highly dense intervals at the expense of sparser intervals.
Low-abundance genes with means below min.mean
are also removed prior to fitting, to avoid problems with discreteness and the upper bound on the CV2 at low counts.
Robustness iterations are also performed to protect against outliers.
An initial fit is performed and each observation is weighted using tricube-transformed standardized residuals (in addition to the existing inverse-density weights).
The bandwidth of the tricube scheme is defined as nmads
multiplied by the median standardized residual.
Iterations are performed until convergence or max.iters
is reached.
Occasionally, there are not enough high-abundance points to uniquely determine the parameter.
In such cases, the function collapses back to fitting a simpler trend
to avoid errors about singular gradients in nls
.
If simplified=FALSE
, this simplification is not allowed and the error is directly reported.
A named list is returned containing:
trend
:A function that returns the fitted value of the trend at any value of the mean.
std.dev
:A numeric scalar containing the robust standard deviation of the ratio of var
to the fitted value of the trend across all features used for trend fitting.
Aaron Lun
Brennecke P, Anders S, Kim JK et al. (2013). Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods 10:1093-95
modelGeneCV2
and modelGeneCV2WithSpikes
, where this function is used.
library(scuttle) sce <- mockSCE() normcounts <- normalizeCounts(sce, log=FALSE) # Fitting a trend: library(DelayedMatrixStats) means <- rowMeans(normcounts) cv2 <- rowVars(normcounts)/means^2 fit <- fitTrendCV2(means, cv2, ncol(sce)) # Examining the trend fit: plot(means, cv2, pch=16, cex=0.5, xlab="Mean", ylab="CV2", log="xy") curve(fit$trend(x), add=TRUE, col="dodgerblue", lwd=3)
library(scuttle) sce <- mockSCE() normcounts <- normalizeCounts(sce, log=FALSE) # Fitting a trend: library(DelayedMatrixStats) means <- rowMeans(normcounts) cv2 <- rowVars(normcounts)/means^2 fit <- fitTrendCV2(means, cv2, ncol(sce)) # Examining the trend fit: plot(means, cv2, pch=16, cex=0.5, xlab="Mean", ylab="CV2", log="xy") curve(fit$trend(x), add=TRUE, col="dodgerblue", lwd=3)
Create a mean-variance trend for log-normalized expression values derived from Poisson-distributed counts.
fitTrendPoisson( means, size.factors, npts = 1000, dispersion = 0, pseudo.count = 1, BPPARAM = SerialParam(), ... )
fitTrendPoisson( means, size.factors, npts = 1000, dispersion = 0, pseudo.count = 1, BPPARAM = SerialParam(), ... )
means |
A numeric vector of length 2 or more, containing the range of mean counts observed in the dataset. |
size.factors |
A numeric vector of size factors for all cells in the dataset. |
npts |
An integer scalar specifying the number of interpolation points to use. |
dispersion |
A numeric scalar specifying the dispersion for the NB distribution. If zero, a Poisson distribution is used. |
pseudo.count |
A numeric scalar specifying the pseudo-count to be added to the scaled counts before log-transformation. |
BPPARAM |
A BiocParallelParam object indicating how parallelization should be performed across interpolation points. |
... |
Further arguments to pass to |
This function is useful for modelling technical noise in highly diverse datasets without spike-ins, where fitting a trend to the endogenous genes would not be appropriate given the strong biological heterogeneity. It is mostly intended for UMI datasets where the technical noise is close to Poisson-distributed.
This function operates by simulating Poisson or negative binomial-distributed counts,
computing log-transformed normalized expression values from those counts,
calculating the mean and variance and then passing those metrics to fitTrendVar
.
The log-transformation ensures that variance is modelled in the same space that is used for downstream analyses like PCA.
Simulations are performed across a range of values in means
to achieve reliable interpolation,
with the stability of the trend determined by the number of simulation points in npts
.
The number of cells is determined from the length of size.factors
,
which are used to scale the distribution means prior to sampling counts.
A named list is returned containing:
trend
:A function that returns the fitted value of the trend at any value of the mean.
std.dev
:A numeric scalar containing the robust standard deviation of the ratio of var
to the fitted value of the trend across all features used for trend fitting.
Aaron Lun
fitTrendVar
, which is used to fit the trend.
# Mocking up means and size factors: sf <- 2^rnorm(1000, sd=0.1) sf <- sf/mean(sf) means <- rexp(100, 0.1) # Using these to construct a Poisson trend: out <- fitTrendPoisson(means, sf) curve(out$trend(x), xlim=c(0, 10))
# Mocking up means and size factors: sf <- 2^rnorm(1000, sd=0.1) sf <- sf/mean(sf) means <- rexp(100, 0.1) # Using these to construct a Poisson trend: out <- fitTrendPoisson(means, sf) curve(out$trend(x), xlim=c(0, 10))
Fit a mean-dependent trend to the variances of the log-normalized expression values derived from count data.
fitTrendVar( means, vars, min.mean = 0.1, parametric = TRUE, lowess = TRUE, density.weights = TRUE, nls.args = list(), ... )
fitTrendVar( means, vars, min.mean = 0.1, parametric = TRUE, lowess = TRUE, density.weights = TRUE, nls.args = list(), ... )
means |
A numeric vector containing the mean log-expression value for each gene. |
vars |
A numeric vector containing the variance of log-expression values for each gene. |
min.mean |
A numeric scalar specifying the minimum mean to use for trend fitting. |
parametric |
A logical scalar indicating whether a parametric fit should be attempted. |
lowess |
A logical scalar indicating whether a LOWESS fit should be attempted. |
density.weights |
A logical scalar indicating whether to use inverse density weights. |
nls.args |
A list of parameters to pass to |
... |
Further arguments to pass to |
This function fits a mean-dependent trend to the variance of the log-normalized expression for the selected features.
The fitted trend can then be used to decompose the variance of each gene into biological and technical components,
as done in modelGeneVar
and modelGeneVarWithSpikes
.
The default fitting process follows a two-step procedure when parametric=TRUE
and lowess=TRUE
:
A non-linear curve of the form
is fitted to the variances against the means using nls
.
Starting values and the number of iterations are automatically set if not explicitly specified in nls.args
.
weightedLowess
is applied to the log-ratios of the variance of each gene to the corresponding fitted value from the non-linear curve.
The final trend is defined as the product of the fitted values from the non-linear curve and the exponential of the LOWESS fitted value.
If any tuning is necessary, the most important parameter here is span
, which can be passed in the ...
arguments.
The aim is to use the parametric curve to reduce the sharpness of the expected mean-variance relationship for easier smoothing. Conversely, the parametric form is not exact, so the smoothers will model any remaining trends in the residuals.
If parametric=FALSE
, LOWESS smoothing is performed directly on the log-variances using weightedLowess
.
This may be helpful in situations where the data does not follow the expected parametric curve,
e.g., for transformed data that spans negative values where the expression is not defined.
(Indeed, the expression above is purely empirical, chosen simply as it matched the shape of the observed trend in many datasets.)
If lowess=FALSE
, the LOWESS smoothing step is omitted and the parametric fit is used directly.
This may be necessary in situations where the LOWESS overfits, e.g., because of very few points at high abundances.
A named list is returned containing:
trend
:A function that returns the fitted value of the trend at any value of the mean.
std.dev
:A numeric scalar containing the robust standard deviation of the ratio of var
to the fitted value of the trend across all features used for trend fitting.
Genes with mean log-expression below min.mean
are not used in trend fitting.
This aims to remove the majority of low-abundance genes and preserve the sensitivity of span-based smoothers at moderate-to-high abundances.
It also protects against discreteness, which can interfere with estimation of the variability of the variance estimates and accurate scaling of the trend.
Filtering is applied on the mean log-expression to avoid introducing spurious trends at the filter boundary. The default threshold is chosen based on the point at which discreteness is observed in variance estimates from Poisson-distributed counts. For heterogeneous droplet data, a lower threshold of 0.001-0.01 may be more appropriate, though this usually does not matter all too much.
When extrapolating to values below the smallest observed mean (or min.mean
), the output function will approach zero as the mean approaches zero.
This reflects the fact that the variance should be zero at a log-expression of zero (assuming a pseudo-count of 1 was used).
When extrapolating to values above the largest observed mean, the output function will be set to the fitted value of the trend at the largest mean.
All fitting (with nls
and weightedLowess
) is performed by weighting each observation according to the inverse of the density of observations at the same mean.
This ensures that parts of the curve with few points are fitted as well as parts of the trend with many points.
Otherwise, differences in the distribution of means would favor good fits in highly dense intervals at the expense of sparser intervals.
(Note that these densities are computed after filtering on min.mean
, so the high density of points at zero has no effect.)
That being said, the density weighting can give inappropriate weight to very sparse intervals, especially those at high abundance.
This results in overfitting where the trend is compelled to pass through each point at these intervals.
For most part, this is harmless as most high-abundance genes are not highly variable so an overfitted trend is actually appropriate.
However, if high-abundance genes are variable, it may be better to set density.weights=FALSE
to avoid this overfitting effect.
Aaron Lun
modelGeneVar
and modelGeneVarWithSpikes
, where this function is used.
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Fitting a trend: library(DelayedMatrixStats) means <- rowMeans(logcounts(sce)) vars <- rowVars(logcounts(sce)) fit <- fitTrendVar(means, vars) # Comparing the two trend fits: plot(means, vars, pch=16, cex=0.5, xlab="Mean", ylab="Variance") curve(fit$trend(x), add=TRUE, col="dodgerblue", lwd=3)
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Fitting a trend: library(DelayedMatrixStats) means <- rowMeans(logcounts(sce)) vars <- rowVars(logcounts(sce)) fit <- fitTrendVar(means, vars) # Comparing the two trend fits: plot(means, vars, pch=16, cex=0.5, xlab="Mean", ylab="Variance") curve(fit$trend(x), add=TRUE, col="dodgerblue", lwd=3)
Perform a PCA where the desired number of components is known ahead of time.
fixedPCA( x, rank = 50, value = c("pca", "lowrank"), subset.row, preserve.shape = TRUE, assay.type = "logcounts", name = NULL, BSPARAM = bsparam(), BPPARAM = SerialParam() )
fixedPCA( x, rank = 50, value = c("pca", "lowrank"), subset.row, preserve.shape = TRUE, assay.type = "logcounts", name = NULL, BSPARAM = bsparam(), BPPARAM = SerialParam() )
x |
A SingleCellExperiment object containing a log-expression amtrix. |
rank |
Integer scalar specifying the number of components. |
value |
String specifying the type of value to return.
|
subset.row |
A logical, character or integer vector specifying the rows of |
preserve.shape |
Logical scalar indicating whether or not the output SingleCellExperiment should be subsetted to |
assay.type |
A string specifying which assay values to use. |
name |
String containing the name which which to store the results.
Defaults to |
BSPARAM |
A BiocSingularParam object specifying the algorithm to use for PCA. |
BPPARAM |
A BiocParallelParam object to use for parallel processing. |
In theory, there is an optimal number of components for any given application,
but in practice, the criterion for the optimum is difficult to define.
As a result, it is often satisfactory to take an a priori-defined “reasonable” number of PCs for downstream analyses.
A good rule of thumb is to set this to the upper bound on the expected number of subpopulations in the dataset
(see the reasoning in getClusteredPCs
.
We can use subset.row
to perform the PCA on a subset of genes.
This is typically used to subset to HVGs to reduce computational time and increase the signal-to-noise ratio of downstream analyses.
If preserve.shape=TRUE
, the rotation matrix is extrapolated to include loadings for “unselected” genes, i.e., not in subset.row
.
This is done by projecting their expression profiles into the low-dimensional space defined by the SVD on the selected genes.
By doing so, we ensure that the output always has the same number of rows as x
such that any value="lowrank"
can fit into the assays.
Otherwise, if preserve.shape=FALSE
, the output is subsetted by any non-NULL
value of subset.row
.
This is equivalent to the return value after calling the function on x[subset.row,]
.
A modified x
with:
the PC results stored in the reducedDims
as a "PCA"
entry, if type="pca"
.
The attributes contain the rotation matrix, the variance explained and the percentage of variance explained.
(Note that the last may not sum to 100% if max.rank
is smaller than the total number of PCs.)
a low-rank approximation stored as a new "lowrank"
assay, if type="lowrank"
.
This is represented as a LowRankMatrix.
Aaron Lun
denoisePCA
, where the number of PCs is automatically chosen.
getClusteredPCs
, another method to choose the number of PCs.
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Modelling the variance: var.stats <- modelGeneVar(sce) hvgs <- getTopHVGs(var.stats, n=1000) # Defaults to pulling out the top 50 PCs. set.seed(1000) sce <- fixedPCA(sce, subset.row=hvgs) reducedDimNames(sce) # Get the percentage of variance explained. attr(reducedDim(sce), "percentVar")
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Modelling the variance: var.stats <- modelGeneVar(sce) hvgs <- getTopHVGs(var.stats, n=1000) # Defaults to pulling out the top 50 PCs. set.seed(1000) sce <- fixedPCA(sce, subset.row=hvgs) reducedDimNames(sce) # Get the percentage of variance explained. attr(reducedDim(sce), "percentVar")
Details on how gene selection is performed in almost all scran functions.
For functions accepting some gene-by-cell matrix x
, we can choose to perform calculations only on a subset of rows (i.e., genes) with the subset.row
argument.
This can be a logical, integer or character vector indicating the rows of x
to use.
If a character vector, it must contain the names of the rows in x
.
Future support will be added for more esoteric subsetting vectors like the Bioconductor Rle classes.
The output of running a function with subset.row
will always be the same as the output of subsetting x
beforehand and passing it into the function.
However, it is often more efficient to use subset.row
as we can avoid constructing an intermediate subsetted matrix.
The same reasoning applies for any x
that is a SingleCellExperiment object.
Some functions will have a min.mean
argument to filter out low-abundance genes prior to processing.
Depending on the function, the filter may be applied to the average library size-adjusted count computed by calculateAverage
, the average log-count,
or some other measure of abundance - see the documentation for each function for details.
Any filtering on min.mean
is automatically intersected with a specified subset.row
.
For example, only subsetted genes that pass the filter are retained if subset.row
is specified alongside min.mean
.
Aaron Lun
Cluster cells after using varying number of PCs, and pick the number of PCs using a heuristic based on the number of clusters.
getClusteredPCs( pcs, FUN = NULL, ..., BLUSPARAM = NNGraphParam(), min.rank = 5, max.rank = ncol(pcs), by = 1 )
getClusteredPCs( pcs, FUN = NULL, ..., BLUSPARAM = NNGraphParam(), min.rank = 5, max.rank = ncol(pcs), by = 1 )
pcs |
A numeric matrix of PCs, where rows are cells and columns are dimensions representing successive PCs. |
FUN |
A clustering function that takes a numeric matrix with rows as cells and
returns a vector containing a cluster label for each cell.
Defaults to |
... |
Further arguments to pass to |
BLUSPARAM |
A BlusterParam object specifying the clustering to use when |
min.rank |
Integer scalar specifying the minimum number of PCs to use. |
max.rank |
Integer scalar specifying the maximum number of PCs to use. |
by |
Integer scalar specifying what intervals should be tested between |
Assume that the data contains multiple subpopulations, each of which is separated from the others on a different axis.
For example, each subpopulation could be defined by a unique set of marker genes that drives separation on its own PC.
If we had subpopulations, we would need at least
PCs to successfully distinguish all of them.
This motivates the choice of the number of PCs provided we know the number of subpopulations in the data.
In practice, we do not know the number of subpopulations so we use the number of clusters as a proxy instead.
We apply a clustering function FUN
on the first PCs,
and only consider the values of
that yield no more than
clusters.
If we see more clusters with fewer dimensions,
we consider this to represent overclustering rather than distinct subpopulations,
as multiple subpopulations should not be distinguishable on the same axes (based on the assumption above).
We choose that satisfies the constraint above and maximizes the number of clusters.
The idea is that more PCs should include more biological signal, allowing
FUN
to detect more distinct subpopulations;
until the point that the extra signal outweights the added noise at high dimensions,
such that resolution decreases and it becomes more difficult for FUN
to distinguish between subpopulations.
Any FUN
can be used that automatically chooses the number of clusters based on the data.
The default is a graph-based clustering method using makeSNNGraph
and cluster_walktrap
,
where arguments in ...
are passed to the former.
Users should not supply FUN
where the number of clusters is fixed in advance,
(e.g., k-means, hierarchical clustering with known k
in cutree
).
The identities of the output clusters are returned at each step for comparison, e.g., using methods like clustree.
A DataFrame with one row per tested number of PCs. This contains the fields:
n.pcs
:Integer scalar specifying the number of PCs used.
n.clusters
:Integer scalar specifying the number of clusters identified.
clusters
:A List containing the cluster identities for this number of PCs.
The metadata of the DataFrame contains chosen
,
an integer scalar specifying the “ideal” number of PCs to use.
Aaron Lun
runPCA
, to compute the PCs in the first place.
clusterRows
and BlusterParam, for possible choices of BLUSPARAM
.
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) sce <- scater::runPCA(sce) output <- getClusteredPCs(reducedDim(sce)) output metadata(output)$chosen
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) sce <- scater::runPCA(sce) output <- getClusteredPCs(reducedDim(sce)) output metadata(output)$chosen
Utility function to extract the marker effect sizes as a matrix from the output of findMarkers
.
getMarkerEffects(x, prefix = "logFC", strip = TRUE, remove.na.col = FALSE)
getMarkerEffects(x, prefix = "logFC", strip = TRUE, remove.na.col = FALSE)
x |
A DataFrame containing marker statistics for a given group/cluster,
usually one element of the List returned by |
prefix |
String containing the prefix for the columns containing the effect size. |
strip |
Logical scalar indicating whether the prefix should be removed from the output column names. |
remove.na.col |
Logical scalar indicating whether to remove columns containing any |
Setting remove.na.col=TRUE
may be desirable in applications involving blocked comparisons,
where some pairwise comparisons are not possible if the relevant levels occur in different blocks.
In such cases, the resulting column is filled with NA
s that may interfere with downstream steps like clustering.
A numeric matrix containing the effect sizes for the comparison to every other group/cluster.
Aaron Lun
findMarkers
and combineMarkers
, to generate the DataFrames.
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) kout <- kmeans(t(logcounts(sce)), centers=4) out <- findMarkers(sce, groups=kout$cluster) eff1 <- getMarkerEffects(out[[1]]) str(eff1)
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) kout <- kmeans(t(logcounts(sce)), centers=4) out <- findMarkers(sce, groups=kout$cluster) eff1 <- getMarkerEffects(out[[1]]) str(eff1)
Define a set of highly variable genes, based on variance modelling statistics
from modelGeneVar
or related functions.
getTopHVGs( stats, var.field = "bio", n = NULL, prop = NULL, var.threshold = 0, fdr.field = "FDR", fdr.threshold = NULL, row.names = !is.null(rownames(stats)) )
getTopHVGs( stats, var.field = "bio", n = NULL, prop = NULL, var.threshold = 0, fdr.field = "FDR", fdr.threshold = NULL, row.names = !is.null(rownames(stats)) )
stats |
A DataFrame of variance modelling statistics with one row per gene.
Alternatively, a SummarizedExperiment object, in which case it is supplied to |
var.field |
String specifying the column of |
n |
Integer scalar specifying the number of top HVGs to report. |
prop |
Numeric scalar specifying the proportion of genes to report as HVGs. |
var.threshold |
Numeric scalar specifying the minimum threshold on the metric of variation. |
fdr.field |
String specifying the column of |
fdr.threshold |
Numeric scalar specifying the FDR threshold. |
row.names |
Logical scalar indicating whether row names should be reported. |
This function will identify all genes where the relevant metric of variation is greater than var.threshold
.
By default, this means that we retain all genes with positive values in the var.field
column of stats
.
If var.threshold=NULL
, the minimum threshold on the value of the metric is not applied.
If fdr.threshold
is specified, we further subset to genes that have FDR less than or equal to fdr.threshold
.
By default, FDR thresholding is turned off as modelGeneVar
and related functions
determine significance of large variances relative to other genes.
This can be overly conservative if many genes are highly variable.
If n=NULL
and prop=NULL
, the resulting subset of genes is directly returned.
Otherwise, the top set of genes with the largest values of the variance metric are returned,
where the size of the set is defined as the larger of n
and prop*nrow(stats)
.
A character vector containing the names of the most variable genes, if row.names=TRUE
.
Otherwise, an integer vector specifying the indices of stats
containing the most variable genes.
Aaron Lun
modelGeneVar
and friends, to generate stats
.
modelGeneCV2
and friends, to also generate stats
.
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) stats <- modelGeneVar(sce) str(getTopHVGs(stats)) str(getTopHVGs(stats, fdr.threshold=0.05)) # more stringent # Or directly pass in the SingleCellExperiment: str(getTopHVGs(sce)) # Alternatively, use with the coefficient of variation: stats2 <- modelGeneCV2(sce) str(getTopHVGs(stats2, var.field="ratio"))
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) stats <- modelGeneVar(sce) str(getTopHVGs(stats)) str(getTopHVGs(stats, fdr.threshold=0.05)) # more stringent # Or directly pass in the SingleCellExperiment: str(getTopHVGs(sce)) # Alternatively, use with the coefficient of variation: stats2 <- modelGeneCV2(sce) str(getTopHVGs(stats2, var.field="ratio"))
Obtain the top markers for each pairwise comparison between clusters, or for each cluster.
getTopMarkers( de.lists, pairs, n = 10, pval.field = "p.value", fdr.field = "FDR", pairwise = TRUE, pval.type = c("any", "some", "all"), fdr.threshold = 0.05, ... )
getTopMarkers( de.lists, pairs, n = 10, pval.field = "p.value", fdr.field = "FDR", pairwise = TRUE, pval.type = c("any", "some", "all"), fdr.threshold = 0.05, ... )
de.lists |
A list-like object where each element is a data.frame or DataFrame. Each element should represent the results of a pairwise comparison between two groups/clusters, in which each row should contain the statistics for a single gene/feature. Rows should be named by the feature name in the same order for all elements. |
pairs |
A matrix, data.frame or DataFrame with two columns and number of rows equal to the length of |
n |
Integer scalar specifying the number of markers to obtain from each pairwise comparison, if Otherwise, the number of top genes to take from each cluster's combined marker set, see Details. |
pval.field |
String specifying the column of each DataFrame in |
fdr.field |
String specifying the column containing the adjusted p-values. |
pairwise |
Logical scalar indicating whether top markers should be returned for every pairwise comparison.
If |
pval.type |
String specifying how markers from pairwise comparisons are to be combined if |
fdr.threshold |
Numeric scalar specifying the FDR threshold for filtering.
If |
... |
Further arguments to pass to |
This is a convenience utility that converts the results of pairwise comparisons into a marker list
that can be used in downstream functions, e.g., as the marker sets in SingleR.
By default, it returns a list of lists containing the top genes for every pairwise comparison,
which is useful for feature selection to select genes distinguishing between closely related clusters.
The top n
genes are chosen with adjusted p-values below fdr.threshold
.
If pairwise=FALSE
, combineMarkers
is called on de.lists
and pairs
to obtain a per-cluster ranking of genes from all pairwise comparisons involving that cluster.
If pval.type="any"
, the top genes with Top
values no greater than n
are retained;
this is equivalent to taking the union of the top n
genes from each pairwise comparison for each cluster.
Otherwise, the top n
genes with the smallest p-values are retained.
In both cases, genes are further filtered by fdr.threshold
.
If pairwise=TRUE
, a List of Lists of character vectors is returned.
Each element of the outer list corresponds to cluster X, each element of the inner list corresponds to another cluster Y,
and each character vector specifies the marker genes that distinguish X from Y.
If pairwise=FALSE
, a List of character vectors is returned.
Each character vector contains the marker genes that distinguish X from any, some or all other clusters,
depending on combine.type
.
Aaron Lun
pairwiseTTests
and friends, to obtain de.lists
and pairs
.
combineMarkers
, for another function that consolidates pairwise DE comparisons.
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Any clustering method is okay. kout <- kmeans(t(logcounts(sce)), centers=3) out <- pairwiseTTests(logcounts(sce), groups=paste0("Cluster", kout$cluster)) # Getting top pairwise markers: top <- getTopMarkers(out$statistics, out$pairs) top[[1]] top[[1]][[2]] # Getting top per-cluster markers: top <- getTopMarkers(out$statistics, out$pairs, pairwise=FALSE) top[[1]]
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Any clustering method is okay. kout <- kmeans(t(logcounts(sce)), centers=3) out <- pairwiseTTests(logcounts(sce), groups=paste0("Cluster", kout$cluster)) # Getting top pairwise markers: top <- getTopMarkers(out$statistics, out$pairs) top[[1]] top[[1]][[2]] # Getting top per-cluster markers: top <- getTopMarkers(out$statistics, out$pairs, pairwise=FALSE) top[[1]]
Model the squared coefficient of variation (CV2) of the normalized expression profiles for each gene, fitting a trend to account for the mean-variance relationship across genes.
modelGeneCV2(x, ...) ## S4 method for signature 'ANY' modelGeneCV2( x, size.factors = NULL, block = NULL, subset.row = NULL, subset.fit = NULL, ..., equiweight = TRUE, method = "fisher", BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' modelGeneCV2(x, ..., assay.type = "counts") ## S4 method for signature 'SingleCellExperiment' modelGeneCV2(x, size.factors = NULL, ...)
modelGeneCV2(x, ...) ## S4 method for signature 'ANY' modelGeneCV2( x, size.factors = NULL, block = NULL, subset.row = NULL, subset.fit = NULL, ..., equiweight = TRUE, method = "fisher", BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' modelGeneCV2(x, ..., assay.type = "counts") ## S4 method for signature 'SingleCellExperiment' modelGeneCV2(x, size.factors = NULL, ...)
x |
A numeric matrix of counts where rows are genes and columns are cells. Alternatively, a SummarizedExperiment or SingleCellExperiment containing such a matrix. |
... |
For the generic, further arguments to pass to each method. For the ANY method, further arguments to pass to For the SummarizedExperiment method, further arguments to pass to the ANY method. For the SingleCellExperiment method, further arguments to pass to the SummarizedExperiment method. |
size.factors |
A numeric vector of size factors for each cell in |
block |
A factor specifying the blocking levels for each cell in |
subset.row |
See |
subset.fit |
An argument similar to |
equiweight |
A logical scalar indicating whether statistics from each block should be given equal weight.
Otherwise, each block is weighted according to its number of cells.
Only used if |
method |
String specifying how p-values should be combined when |
BPPARAM |
A BiocParallelParam object indicating whether parallelization should be performed across genes. |
assay.type |
String or integer scalar specifying the assay containing the counts. |
For each gene, we compute the CV2 and mean of the counts after scaling them by size.factors
.
A trend is fitted to the CV2 against the mean for all genes using fitTrendCV2
.
The fitted value for each gene is used as a proxy for the technical noise, assuming that most genes exhibit a low baseline level of variation that is not biologically interesting.
The ratio of the total CV2 to the trend is used as a metric to rank interesting genes, with larger ratios being indicative of strong biological heterogeneity.
By default, the trend is fitted using all of the genes in x
.
If subset.fit
is specified, the trend is fitted using only the specified subset,
and the values of trend
for all other genes are determined by extrapolation or interpolation.
This could be used to perform the fit based on genes that are known to have low variance, thus weakening the assumption above.
Note that this does not refer to spike-in transcripts, which should be handled via modelGeneCV2WithSpikes
.
If no size factors are supplied, they are automatically computed depending on the input type:
If size.factors=NULL
for the ANY method, the sum of counts for each cell in x
is used to compute a size factor via the librarySizeFactors
function.
If size.factors=NULL
for the SingleCellExperiment method, sizeFactors(x)
is used if available.
Otherwise, it defaults to library size-derived size factors.
If size.factors
are supplied, they will override any size factors present in x
.
A DataFrame is returned where each row corresponds to a gene in x
(or in subset.row
, if specified).
This contains the numeric fields:
mean
:Mean normalized expression per gene.
total
:Squared coefficient of variation of the normalized expression per gene.
trend
:Fitted value of the trend.
ratio
:Ratio of total
to trend
.
p.value, FDR
:Raw and adjusted p-values for the test against the null hypothesis that ratio<=1
.
If block
is not specified,
the metadata
of the DataFrame contains the output of running fitTrendCV2
on the specified features,
along with the mean
and cv2
used to fit the trend.
If block
is specified,
the output contains another per.block
field.
This field is itself a DataFrame of DataFrames, where each internal DataFrame contains statistics for the variance modelling within each block and has the same format as described above.
Each internal DataFrame's metadata
contains the output of fitTrendCV2
for the cells of that block.
The p-value for each gene is computed by assuming that the CV2 estimates are normally distributed around the trend, and that the standard deviation of the CV2 distribution is proportional to the value of the trend.
This is used to construct a one-sided test for each gene based on its ratio
, under the null hypothesis that the ratio is equal to or less than 1.
The proportionality constant for the standard deviation is set to the std.dev
returned by fitTrendCV2
.
This is estimated from the spread of per-gene CV2 values around the trend, so the null hypothesis effectively becomes “is this gene more variable than other genes of the same abundance?”
Setting block
will estimate the mean and variance of each gene for cells in each level of block
separately.
The trend is fitted separately for each level, and the variance decomposition is also performed separately.
Per-level statistics are then combined to obtain a single value per gene:
For means and CV2 values, this is done by taking the geometric mean across blocking levels.
If equiweight=FALSE
, a weighted average is used where the value for each level is weighted by the number of cells.
By default, all levels are equally weighted when combining statistics.
Per-level p-values are combined using combineParallelPValues
according to method
.
By default, Fisher's method is used to identify genes that are highly variable in any batch.
Whether or not this is responsive to equiweight
depends on the chosen method.
Blocks with fewer than 2 cells are completely ignored and do not contribute to the combined mean, variance component or p-value.
Aaron Lun
library(scuttle) sce <- mockSCE() # Simple case: spk <- modelGeneCV2(sce) spk plot(spk$mean, spk$total, pch=16, log="xy") curve(metadata(spk)$trend(x), add=TRUE, col="dodgerblue") # With blocking: block <- sample(LETTERS[1:2], ncol(sce), replace=TRUE) blk <- modelGeneCV2(sce, block=block) blk par(mfrow=c(1,2)) for (i in colnames(blk$per.block)) { current <- blk$per.block[[i]] plot(current$mean, current$total, pch=16, log="xy") curve(metadata(current)$trend(x), add=TRUE, col="dodgerblue") }
library(scuttle) sce <- mockSCE() # Simple case: spk <- modelGeneCV2(sce) spk plot(spk$mean, spk$total, pch=16, log="xy") curve(metadata(spk)$trend(x), add=TRUE, col="dodgerblue") # With blocking: block <- sample(LETTERS[1:2], ncol(sce), replace=TRUE) blk <- modelGeneCV2(sce, block=block) blk par(mfrow=c(1,2)) for (i in colnames(blk$per.block)) { current <- blk$per.block[[i]] plot(current$mean, current$total, pch=16, log="xy") curve(metadata(current)$trend(x), add=TRUE, col="dodgerblue") }
Model the squared coefficient of variation (CV2) of the normalized expression profiles for each gene, using spike-ins to estimate the baseline level of technical noise at each abundance.
modelGeneCV2WithSpikes(x, ...) ## S4 method for signature 'ANY' modelGeneCV2WithSpikes( x, spikes, size.factors = NULL, spike.size.factors = NULL, block = NULL, subset.row = NULL, ..., equiweight = TRUE, method = "fisher", BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' modelGeneCV2WithSpikes(x, ..., assay.type = "counts") ## S4 method for signature 'SingleCellExperiment' modelGeneCV2WithSpikes( x, spikes, size.factors = NULL, spike.size.factors = NULL, ..., assay.type = "counts" )
modelGeneCV2WithSpikes(x, ...) ## S4 method for signature 'ANY' modelGeneCV2WithSpikes( x, spikes, size.factors = NULL, spike.size.factors = NULL, block = NULL, subset.row = NULL, ..., equiweight = TRUE, method = "fisher", BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' modelGeneCV2WithSpikes(x, ..., assay.type = "counts") ## S4 method for signature 'SingleCellExperiment' modelGeneCV2WithSpikes( x, spikes, size.factors = NULL, spike.size.factors = NULL, ..., assay.type = "counts" )
x |
A numeric matrix of counts where rows are (usually endogenous) genes and columns are cells. Alternatively, a SummarizedExperiment or SingleCellExperiment containing such a matrix. |
... |
For the generic, further arguments to pass to each method. For the ANY method, further arguments to pass to For the SummarizedExperiment method, further arguments to pass to the ANY method. For the SingleCellExperiment method, further arguments to pass to the SummarizedExperiment method. |
spikes |
A numeric matrix of counts where each row corresponds to a spike-in transcript.
This should have the same number of columns as Alternatively, for the SingleCellExperiment method,
this can be a string or an integer scalar specifying the |
size.factors |
A numeric vector of size factors for each cell in |
spike.size.factors |
A numeric vector of size factors for each cell in |
block |
A factor specifying the blocking levels for each cell in |
subset.row |
See |
equiweight |
A logical scalar indicating whether statistics from each block should be given equal weight.
Otherwise, each block is weighted according to its number of cells.
Only used if |
method |
String specifying how p-values should be combined when |
BPPARAM |
A BiocParallelParam object indicating whether parallelization should be performed across genes. |
assay.type |
String or integer scalar specifying the assay containing the counts. For the SingleCellExperiment method, this is used to retrieve both the endogenous and spike-in counts. |
For each gene and spike-in transcript, we compute the variance and CV2 of the normalized expression values.
A trend is fitted to the CV2 against the mean for spike-in transcripts using fitTrendCV2
.
The value of the trend at the abundance of each gene is used to define the variation attributable to technical noise.
The ratio to the trend is then used to define overdispersion corresponding to interesting biological heterogeneity.
This function is almost the same as modelGeneCV2
, with the only theoretical difference being that the trend is fitted on spike-in CV2 rather than using the means and CV2 of endogenous genes.
This is because there are certain requirements for how normalization is performed when comparing spike-in transcripts with endogenous genes - see comments in “Explanation for size factor rescaling”.
We enforce this by centering the size factors for both sets of features and recomputing normalized expression values.
A DataFrame is returned where each row corresponds to a gene in x
(or in subset.row
, if specified).
This contains the numeric fields:
mean
:Mean normalized expression per gene.
total
:Squared coefficient of variation of the normalized expression per gene.
trend
:Fitted value of the trend.
ratio
:Ratio of total
to trend
.
p.value, FDR
:Raw and adjusted p-values for the test against the null hypothesis that ratio<=1
.
If block
is not specified,
the metadata
of the DataFrame contains the output of running fitTrendCV2
on the spike-in transcripts,
along with the mean
and cv2
used to fit the trend.
If block
is specified, the output contains another per.block
field.
This field is itself a DataFrame of DataFrames, where each internal DataFrame contains statistics for the variance modelling within each block and has the same format as described above.
Each internal DataFrame's metadata
contains the output of fitTrendCV2
for the cells of that block.
The p-value for each gene is computed by assuming that the CV2 estimates are normally distributed around the trend, and that the standard deviation of the CV2 distribution is proportional to the value of the trend.
This is used to construct a one-sided test for each gene based on its ratio
, under the null hypothesis that the ratio is equal to 1.
The proportionality constant for the standard deviation is set to the std.dev
returned by fitTrendCV2
.
This is estimated from the spread of CV2 values for spike-in transcripts, so the null hypothesis effectively becomes “is this gene more variable than spike-in transcripts of the same abundance?”
If no size factors are supplied, they are automatically computed depending on the input type:
If size.factors=NULL
for the ANY method, the sum of counts for each cell in x
is used to compute a size factor via the librarySizeFactors
function.
If spike.size.factors=NULL
for the ANY method, the sum of counts for each cell in spikes
is used to compute a size factor via the librarySizeFactors
function.
If size.factors=NULL
for the SingleCellExperiment method, sizeFactors(x)
is used if available.
Otherwise, it defaults to library size-derived size factors.
If spike.size.factors=NULL
for the SingleCellExperiment method and spikes
is not a matrix, sizeFactors(altExp(x, spikes)
is used if available.
Otherwise, it defaults to library size-derived size factors.
If size.factors
or spike.size.factors
are supplied, they will override any size factors present in x
.
The use of a spike-in-derived trend makes several assumptions. The first is that a constant amount of spike-in RNA was added to each cell, such that any differences in observed expression of the spike-in transcripts can be wholly attributed to technical noise. The second is that endogenous genes and spike-in transcripts follow the same mean-variance relationship, i.e., a spike-in transcript captures the technical noise of an endogenous gene at the same mean count.
Here, the spike-in size factors across all cells are scaled so that their mean is equal to that of the gene-based size factors for the same set of cells. This ensures that the average normalized abundances of the spike-in transcripts are comparable to those of the endogenous genes, allowing the trend fitted to the former to be used to determine the biological component of the latter. Otherwise, differences in scaling of the size factors would shift the normalized expression values of the former away from the latter, violating the second assumption.
If block
is specified, rescaling is performed separately for all cells in each block.
This aims to avoid problems from (frequent) violations of the first assumption where there are differences in the quantity of spike-in RNA added to each batch.
Without scaling, these differences would lead to systematic shifts in the spike-in abundances from the endogenous genes when fitting a batch-specific trend
(even if there is no global difference in scaling across all batches).
Setting block
will estimate the mean and variance of each gene for cells in each level of block
separately.
The trend is fitted separately for each level, and the variance decomposition is also performed separately.
Per-level statistics are then combined to obtain a single value per gene:
For means and CV2 values, this is done by taking the geometric mean across blocking levels.
If equiweight=FALSE
, a weighted average is used where the value for each level is weighted by the number of cells.
By default, all levels are equally weighted when combining statistics.
Per-level p-values are combined using combineParallelPValues
according to method
.
By default, Fisher's method is used to identify genes that are highly variable in any batch.
Whether or not this is responsive to equiweight
depends on the chosen method.
Blocks with fewer than 2 cells are completely ignored and do not contribute to the combined mean, variance component or p-value.
Aaron Lun
fitTrendCV2
, for the trend fitting options.
modelGeneCV2
, for modelling variance without spike-in controls.
library(scuttle) sce <- mockSCE() # Using spike-ins. spk <- modelGeneCV2WithSpikes(sce, "Spikes") spk plot(spk$mean, spk$total) points(metadata(spk)$mean, metadata(spk)$var, col="red", pch=16) curve(metadata(spk)$trend(x), add=TRUE, col="dodgerblue") # With blocking (and spike-ins). block <- sample(LETTERS[1:2], ncol(sce), replace=TRUE) blk <- modelGeneCV2WithSpikes(sce, "Spikes", block=block) blk par(mfrow=c(1,2)) for (i in colnames(blk$per.block)) { current <- blk$per.block[[i]] plot(current$mean, current$total) points(metadata(current)$mean, metadata(current)$var, col="red", pch=16) curve(metadata(current)$trend(x), add=TRUE, col="dodgerblue") }
library(scuttle) sce <- mockSCE() # Using spike-ins. spk <- modelGeneCV2WithSpikes(sce, "Spikes") spk plot(spk$mean, spk$total) points(metadata(spk)$mean, metadata(spk)$var, col="red", pch=16) curve(metadata(spk)$trend(x), add=TRUE, col="dodgerblue") # With blocking (and spike-ins). block <- sample(LETTERS[1:2], ncol(sce), replace=TRUE) blk <- modelGeneCV2WithSpikes(sce, "Spikes", block=block) blk par(mfrow=c(1,2)) for (i in colnames(blk$per.block)) { current <- blk$per.block[[i]] plot(current$mean, current$total) points(metadata(current)$mean, metadata(current)$var, col="red", pch=16) curve(metadata(current)$trend(x), add=TRUE, col="dodgerblue") }
Model the variance of the log-expression profiles for each gene, decomposing it into technical and biological components based on a fitted mean-variance trend.
## S4 method for signature 'ANY' modelGeneVar( x, block = NULL, design = NULL, subset.row = NULL, subset.fit = NULL, ..., equiweight = TRUE, method = "fisher", BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' modelGeneVar(x, ..., assay.type = "logcounts")
## S4 method for signature 'ANY' modelGeneVar( x, block = NULL, design = NULL, subset.row = NULL, subset.fit = NULL, ..., equiweight = TRUE, method = "fisher", BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' modelGeneVar(x, ..., assay.type = "logcounts")
x |
A numeric matrix of log-normalized expression values where rows are genes and columns are cells. Alternatively, a SummarizedExperiment containing such a matrix. |
block |
A factor specifying the blocking levels for each cell in |
design |
A numeric matrix containing blocking terms for uninteresting factors of variation. |
subset.row |
See |
subset.fit |
An argument similar to |
... |
For the generic, further arguments to pass to each method. For the ANY method, further arguments to pass to For the SummarizedExperiment method, further arguments to pass to the ANY method. |
equiweight |
A logical scalar indicating whether statistics from each block should be given equal weight.
Otherwise, each block is weighted according to its number of cells.
Only used if |
method |
String specifying how p-values should be combined when |
BPPARAM |
A BiocParallelParam object indicating whether parallelization should be performed across genes. |
assay.type |
String or integer scalar specifying the assay containing the log-expression values. |
For each gene, we compute the variance and mean of the log-expression values.
A trend is fitted to the variance against the mean for all genes using fitTrendVar
.
The fitted value for each gene is used as a proxy for the technical component of variation for each gene,
under the assumption that most genes exhibit a low baseline level of variation that is not biologically interesting.
The biological component of variation for each gene is defined as the the residual from the trend.
Ranking genes by the biological component enables identification of interesting genes for downstream analyses in a manner that accounts for the mean-variance relationship. We use log-transformed expression values to blunt the impact of large positive outliers and to ensure that large variances are driven by strong log-fold changes between cells rather than differences in counts. Log-expression values are also used in downstream analyses like PCA, so modelling them here avoids inconsistencies with different quantifications of variation across analysis steps.
By default, the trend is fitted using all of the genes in x
.
If subset.fit
is specified, the trend is fitted using only the specified subset,
and the technical components for all other genes are determined by extrapolation or interpolation.
This could be used to perform the fit based on genes that are known to have low variance, thus weakening the assumption above.
Note that this does not refer to spike-in transcripts, which should be handled via modelGeneVarWithSpikes
.
A DataFrame is returned where each row corresponds to a gene in x
(or in subset.row
, if specified).
This contains the numeric fields:
mean
:Mean normalized log-expression per gene.
total
:Variance of the normalized log-expression per gene.
bio
:Biological component of the variance.
tech
:Technical component of the variance.
p.value, FDR
:Raw and adjusted p-values for the test against the null hypothesis that bio<=0
.
If block
is not specified,
the metadata
of the DataFrame contains the output of running fitTrendVar
on the specified features,
along with the mean
and var
used to fit the trend.
If block
is specified,
the output contains another per.block
field.
This field is itself a DataFrame of DataFrames, where each internal DataFrame contains statistics for the variance modelling within each block and has the same format as described above.
Each internal DataFrame's metadata
contains the output of fitTrendVar
for the cells of that block.
Setting block
will estimate the mean and variance of each gene for cells in each level of block
separately.
The trend is fitted separately for each level, and the variance decomposition is also performed separately.
Per-level statistics are then combined to obtain a single value per gene:
For means and variance components, this is done by averaging values across levels.
If equiweight=FALSE
, a weighted average is used where the value for each level is weighted by the number of cells.
By default, all levels are equally weighted when combining statistics.
Per-level p-values are combined using combineParallelPValues
according to method
.
By default, Fisher's method is used to identify genes that are highly variable in any batch.
Whether or not this is responsive to equiweight
depends on the chosen method.
Blocks with fewer than 2 cells are completely ignored and do not contribute to the combined mean, variance component or p-value.
Use of block
is the recommended approach for accounting for any uninteresting categorical factor of variation.
In addition to accounting for systematic differences in expression between levels of the blocking factor,
it also accommodates differences in the mean-variance relationships.
Alternatively, uninteresting factors can be used to construct a design matrix to pass to the function via design
.
In this case, a linear model is fitted to the expression profile for each gene and the residual variance is calculated.
This approach is useful for covariates or additive models that cannot be expressed as a one-way layout for use in block
.
However, it assumes that the error is normally distributed with equal variance for all observations of a given gene.
Use of block
and design
together is currently not supported and will lead to an error.
The p-value for each gene is computed by assuming that the variance estimates are normally distributed around the trend, and that the standard deviation of the variance distribution is proportional to the value of the trend.
This is used to construct a one-sided test for each gene based on its bio
, under the null hypothesis that the biological component is equal to zero.
The proportionality constant for the standard deviation is set to the std.dev
returned by fitTrendVar
.
This is estimated from the spread of per-gene variance estimates around the trend, so the null hypothesis effectively becomes “is this gene more variable than other genes of the same abundance?”
Aaron Lun
fitTrendVar
, for the trend fitting options.
modelGeneVarWithSpikes
, for modelling variance with spike-in controls.
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Fitting to all features. allf <- modelGeneVar(sce) allf plot(allf$mean, allf$total) curve(metadata(allf)$trend(x), add=TRUE, col="dodgerblue") # Using a subset of features for fitting. subf <- modelGeneVar(sce, subset.fit=1:100) subf plot(subf$mean, subf$total) curve(metadata(subf)$trend(x), add=TRUE, col="dodgerblue") points(metadata(subf)$mean, metadata(subf)$var, col="red", pch=16) # With blocking. block <- sample(LETTERS[1:2], ncol(sce), replace=TRUE) blk <- modelGeneVar(sce, block=block) blk par(mfrow=c(1,2)) for (i in colnames(blk$per.block)) { current <- blk$per.block[[i]] plot(current$mean, current$total) curve(metadata(current)$trend(x), add=TRUE, col="dodgerblue") }
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Fitting to all features. allf <- modelGeneVar(sce) allf plot(allf$mean, allf$total) curve(metadata(allf)$trend(x), add=TRUE, col="dodgerblue") # Using a subset of features for fitting. subf <- modelGeneVar(sce, subset.fit=1:100) subf plot(subf$mean, subf$total) curve(metadata(subf)$trend(x), add=TRUE, col="dodgerblue") points(metadata(subf)$mean, metadata(subf)$var, col="red", pch=16) # With blocking. block <- sample(LETTERS[1:2], ncol(sce), replace=TRUE) blk <- modelGeneVar(sce, block=block) blk par(mfrow=c(1,2)) for (i in colnames(blk$per.block)) { current <- blk$per.block[[i]] plot(current$mean, current$total) curve(metadata(current)$trend(x), add=TRUE, col="dodgerblue") }
Model the variance of the log-expression profiles for each gene, decomposing it into technical and biological components based on a mean-variance trend corresponding to Poisson noise.
modelGeneVarByPoisson(x, ...) ## S4 method for signature 'ANY' modelGeneVarByPoisson( x, size.factors = NULL, block = NULL, design = NULL, subset.row = NULL, npts = 1000, dispersion = 0, pseudo.count = 1, ..., equiweight = TRUE, method = "fisher", BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' modelGeneVarByPoisson(x, ..., assay.type = "counts") ## S4 method for signature 'SingleCellExperiment' modelGeneVarByPoisson(x, size.factors = sizeFactors(x), ...)
modelGeneVarByPoisson(x, ...) ## S4 method for signature 'ANY' modelGeneVarByPoisson( x, size.factors = NULL, block = NULL, design = NULL, subset.row = NULL, npts = 1000, dispersion = 0, pseudo.count = 1, ..., equiweight = TRUE, method = "fisher", BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' modelGeneVarByPoisson(x, ..., assay.type = "counts") ## S4 method for signature 'SingleCellExperiment' modelGeneVarByPoisson(x, size.factors = sizeFactors(x), ...)
x |
A numeric matrix of counts where rows are (usually endogenous) genes and columns are cells. Alternatively, a SummarizedExperiment or SingleCellExperiment containing such a matrix. |
... |
For the generic, further arguments to pass to each method. For the ANY method, further arguments to pass to For the SummarizedExperiment method, further arguments to pass to the ANY method. For the SingleCellExperiment method, further arguments to pass to the SummarizedExperiment method. |
size.factors |
A numeric vector of size factors for each cell in |
block |
A factor specifying the blocking levels for each cell in |
design |
A numeric matrix containing blocking terms for uninteresting factors of variation. |
subset.row |
See |
npts |
An integer scalar specifying the number of interpolation points to use. |
dispersion |
A numeric scalar specifying the dispersion for the NB distribution. If zero, a Poisson distribution is used. |
pseudo.count |
Numeric scalar specifying the pseudo-count to add prior to log-transformation. |
equiweight |
A logical scalar indicating whether statistics from each block should be given equal weight.
Otherwise, each block is weighted according to its number of cells.
Only used if |
method |
String specifying how p-values should be combined when |
BPPARAM |
A BiocParallelParam object indicating whether parallelization should be performed across genes. |
assay.type |
String or integer scalar specifying the assay containing the counts. |
For each gene, we compute the variance and mean of the log-expression values.
A trend is fitted to the variance against the mean for simulated Poisson counts as described in fitTrendPoisson
.
The technical component for each gene is defined as the value of the trend at that gene's mean abundance.
The biological component is then defined as the residual from the trend.
This function is similar to modelGeneVarWithSpikes
, with the only difference being that the trend is fitted on simulated Poisson count-derived variances rather than spike-ins.
The assumption is that the technical component is Poisson-distributed, or at least negative binomial-distributed with a known constant dispersion.
This is useful for UMI count data sets that do not have spike-ins and are too heterogeneous to assume that most genes exhibit negligible biological variability.
If no size factors are supplied, they are automatically computed depending on the input type:
If size.factors=NULL
for the ANY method, the sum of counts for each cell in x
is used to compute a size factor via the librarySizeFactors
function.
If size.factors=NULL
for the SingleCellExperiment method, sizeFactors(x)
is used if available.
Otherwise, it defaults to library size-derived size factors.
If size.factors
are supplied, they will override any size factors present in x
.
A DataFrame is returned where each row corresponds to a gene in x
(or in subset.row
, if specified).
This contains the numeric fields:
mean
:Mean normalized log-expression per gene.
total
:Variance of the normalized log-expression per gene.
bio
:Biological component of the variance.
tech
:Technical component of the variance.
p.value, FDR
:Raw and adjusted p-values for the test against the null hypothesis that bio<=0
.
If block
is not specified,
the metadata
of the DataFrame contains the output of running fitTrendVar
on the simulated counts,
along with the mean
and var
used to fit the trend.
If block
is specified, the output contains another per.block
field.
This field is itself a DataFrame of DataFrames, where each internal DataFrame contains statistics for the variance modelling within each block and has the same format as described above.
Each internal DataFrame's metadata
contains the output of fitTrendVar
for the cells of that block.
The p-value for each gene is computed by assuming that the variance estimates are normally distributed around the trend, and that the standard deviation of the variance distribution is proportional to the value of the trend.
This is used to construct a one-sided test for each gene based on its bio
, under the null hypothesis that the biological component is equal to zero.
The proportionality constant for the standard deviation is set to the std.dev
returned by fitTrendVar
.
This is estimated from the spread of variance estimates for the simulated Poisson-distributed counts, so the null hypothesis effectively becomes “is this gene more variable than a hypothetical gene with only Poisson noise?”
Setting block
will estimate the mean and variance of each gene for cells in each level of block
separately.
The trend is fitted separately for each level, and the variance decomposition is also performed separately.
Per-level statistics are then combined to obtain a single value per gene:
For means and variance components, this is done by averaging values across levels.
If equiweight=FALSE
, a weighted average is used where the value for each level is weighted by the number of cells.
By default, all levels are equally weighted when combining statistics.
Per-level p-values are combined using combineParallelPValues
according to method
.
By default, Fisher's method is used to identify genes that are highly variable in any batch.
Whether or not this is responsive to equiweight
depends on the chosen method.
Blocks with fewer than 2 cells are completely ignored and do not contribute to the combined mean, variance component or p-value.
Use of block
is the recommended approach for accounting for any uninteresting categorical factor of variation.
In addition to accounting for systematic differences in expression between levels of the blocking factor,
it also accommodates differences in the mean-variance relationships.
Alternatively, uninteresting factors can be used to construct a design matrix to pass to the function via design
.
In this case, a linear model is fitted to the expression profile for each gene and the residual variance is calculated.
This approach is useful for covariates or additive models that cannot be expressed as a one-way layout for use in block
.
However, it assumes that the error is normally distributed with equal variance for all observations of a given gene.
Use of block
and design
together is currently not supported and will lead to an error.
Aaron Lun
fitTrendVar
, for the trend fitting options.
modelGeneVar
, for modelling variance without spike-in controls.
library(scuttle) sce <- mockSCE() # Using spike-ins. pois <- modelGeneVarByPoisson(sce) pois plot(pois$mean, pois$total, ylim=c(0, 10)) points(metadata(pois)$mean, metadata(pois)$var, col="red", pch=16) curve(metadata(pois)$trend(x), add=TRUE, col="dodgerblue") # With blocking. block <- sample(LETTERS[1:2], ncol(sce), replace=TRUE) blk <- modelGeneVarByPoisson(sce, block=block) blk par(mfrow=c(1,2)) for (i in colnames(blk$per.block)) { current <- blk$per.block[[i]] plot(current$mean, current$total, ylim=c(0, 10)) points(metadata(current)$mean, metadata(current)$var, col="red", pch=16) curve(metadata(current)$trend(x), add=TRUE, col="dodgerblue") }
library(scuttle) sce <- mockSCE() # Using spike-ins. pois <- modelGeneVarByPoisson(sce) pois plot(pois$mean, pois$total, ylim=c(0, 10)) points(metadata(pois)$mean, metadata(pois)$var, col="red", pch=16) curve(metadata(pois)$trend(x), add=TRUE, col="dodgerblue") # With blocking. block <- sample(LETTERS[1:2], ncol(sce), replace=TRUE) blk <- modelGeneVarByPoisson(sce, block=block) blk par(mfrow=c(1,2)) for (i in colnames(blk$per.block)) { current <- blk$per.block[[i]] plot(current$mean, current$total, ylim=c(0, 10)) points(metadata(current)$mean, metadata(current)$var, col="red", pch=16) curve(metadata(current)$trend(x), add=TRUE, col="dodgerblue") }
Model the variance of the log-expression profiles for each gene, decomposing it into technical and biological components based on a mean-variance trend fitted to spike-in transcripts.
modelGeneVarWithSpikes(x, ...) ## S4 method for signature 'ANY' modelGeneVarWithSpikes( x, spikes, size.factors = NULL, spike.size.factors = NULL, block = NULL, design = NULL, subset.row = NULL, pseudo.count = 1, ..., equiweight = TRUE, method = "fisher", BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' modelGeneVarWithSpikes(x, ..., assay.type = "counts") ## S4 method for signature 'SingleCellExperiment' modelGeneVarWithSpikes( x, spikes, size.factors = NULL, spike.size.factors = NULL, ..., assay.type = "counts" )
modelGeneVarWithSpikes(x, ...) ## S4 method for signature 'ANY' modelGeneVarWithSpikes( x, spikes, size.factors = NULL, spike.size.factors = NULL, block = NULL, design = NULL, subset.row = NULL, pseudo.count = 1, ..., equiweight = TRUE, method = "fisher", BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' modelGeneVarWithSpikes(x, ..., assay.type = "counts") ## S4 method for signature 'SingleCellExperiment' modelGeneVarWithSpikes( x, spikes, size.factors = NULL, spike.size.factors = NULL, ..., assay.type = "counts" )
x |
A numeric matrix of counts where rows are (usually endogenous) genes and columns are cells. Alternatively, a SummarizedExperiment or SingleCellExperiment containing such a matrix. |
... |
For the generic, further arguments to pass to each method. For the ANY method, further arguments to pass to For the SummarizedExperiment method, further arguments to pass to the ANY method. For the SingleCellExperiment method, further arguments to pass to the SummarizedExperiment method. |
spikes |
A numeric matrix of counts where each row corresponds to a spike-in transcript.
This should have the same number of columns as Alternatively, for the SingleCellExperiment method,
this can be a string or an integer scalar specifying the |
size.factors |
A numeric vector of size factors for each cell in |
spike.size.factors |
A numeric vector of size factors for each cell in |
block |
A factor specifying the blocking levels for each cell in |
design |
A numeric matrix containing blocking terms for uninteresting factors of variation. |
subset.row |
See |
pseudo.count |
Numeric scalar specifying the pseudo-count to add prior to log-transformation. |
equiweight |
A logical scalar indicating whether statistics from each block should be given equal weight.
Otherwise, each block is weighted according to its number of cells.
Only used if |
method |
String specifying how p-values should be combined when |
BPPARAM |
A BiocParallelParam object indicating whether parallelization should be performed across genes. |
assay.type |
String or integer scalar specifying the assay containing the counts. For the SingleCellExperiment method, this is used to retrieve both the endogenous and spike-in counts. |
For each gene and spike-in transcript, we compute the variance and mean of the log-expression values.
A trend is fitted to the variance against the mean for spike-in transcripts using fitTrendVar
.
The technical component for each gene is defined as the value of the trend at that gene's mean abundance.
The biological component is then defined as the residual from the trend.
This function is almost the same as modelGeneVar
, with the only difference being that the trend is fitted on spike-in variances rather than using the means and variances of endogenous genes.
It assumes that a constant amount of spike-in RNA was added to each cell, such that any differences in observed expression of the spike-in transcripts can be wholly attributed to technical noise;
and that endogenous genes and spike-in transcripts follow the same mean-variance relationship.
Unlike modelGeneVar
, modelGeneVarWithSpikes
starts from a count matrix (for both genes and spike-ins) and requires size factors and a pseudo-count specification to compute the log-expression values.
This is because there are certain requirements for how normalization is performed when comparing spike-in transcripts with endogenous genes - see comments in “Explanation for size factor rescaling”.
We enforce this by centering the size factors for both sets of features and recomputing the log-expression values prior to computing means and variances.
A DataFrame is returned where each row corresponds to a gene in x
(or in subset.row
, if specified).
This contains the numeric fields:
mean
:Mean normalized log-expression per gene.
total
:Variance of the normalized log-expression per gene.
bio
:Biological component of the variance.
tech
:Technical component of the variance.
p.value, FDR
:Raw and adjusted p-values for the test against the null hypothesis that bio<=0
.
If block
is not specified,
the metadata
of the DataFrame contains the output of running fitTrendVar
on the spike-in transcripts,
along with the mean
and var
used to fit the trend.
If block
is specified, the output contains another per.block
field.
This field is itself a DataFrame of DataFrames, where each internal DataFrame contains statistics for the variance modelling within each block and has the same format as described above.
Each internal DataFrame's metadata
contains the output of fitTrendVar
for the cells of that block.
If no size factors are supplied, they are automatically computed depending on the input type:
If size.factors=NULL
for the ANY method, the sum of counts for each cell in x
is used to compute a size factor via the librarySizeFactors
function.
If spike.size.factors=NULL
for the ANY method, the sum of counts for each cell in spikes
is used to compute a size factor via the librarySizeFactors
function.
If size.factors=NULL
for the SingleCellExperiment method, sizeFactors(x)
is used if available.
Otherwise, it defaults to library size-derived size factors.
If spike.size.factors=NULL
for the SingleCellExperiment method and spikes
is not a matrix, sizeFactors(altExp(x, spikes)
is used if available.
Otherwise, it defaults to library size-derived size factors.
If size.factors
or spike.size.factors
are supplied, they will override any size factors present in x
.
The use of a spike-in-derived trend makes several assumptions. The first is that a constant amount of spike-in RNA was added to each cell, such that any differences in observed expression of the spike-in transcripts can be wholly attributed to technical noise. The second is that endogenous genes and spike-in transcripts follow the same mean-variance relationship, i.e., a spike-in transcript captures the technical noise of an endogenous gene at the same mean count.
Here, the spike-in size factors across all cells are scaled so that their mean is equal to that of the gene-based size factors for the same set of cells. This ensures that the average normalized abundances of the spike-in transcripts are comparable to those of the endogenous genes, allowing the trend fitted to the former to be used to determine the biological component of the latter. Otherwise, differences in scaling of the size factors would shift the normalized expression values of the former away from the latter, violating the second assumption.
If block
is specified, rescaling is performed separately for all cells in each block.
This aims to avoid problems from (frequent) violations of the first assumption where there are differences in the quantity of spike-in RNA added to each batch.
Without scaling, these differences would lead to systematic shifts in the spike-in abundances from the endogenous genes when fitting a batch-specific trend
(even if there is no global difference in scaling across all batches).
The p-value for each gene is computed by assuming that the variance estimates are normally distributed around the trend, and that the standard deviation of the variance distribution is proportional to the value of the trend.
This is used to construct a one-sided test for each gene based on its bio
, under the null hypothesis that the biological component is equal to zero.
The proportionality constant for the standard deviation is set to the std.dev
returned by fitTrendVar
.
This is estimated from the spread of variance estimates for spike-in transcripts, so the null hypothesis effectively becomes “is this gene more variable than spike-in transcripts of the same abundance?”
Setting block
will estimate the mean and variance of each gene for cells in each level of block
separately.
The trend is fitted separately for each level, and the variance decomposition is also performed separately.
Per-level statistics are then combined to obtain a single value per gene:
For means and variance components, this is done by averaging values across levels.
If equiweight=FALSE
, a weighted average is used where the value for each level is weighted by the number of cells.
By default, all levels are equally weighted when combining statistics.
Per-level p-values are combined using combineParallelPValues
according to method
.
By default, Fisher's method is used to identify genes that are highly variable in any batch.
Whether or not this is responsive to equiweight
depends on the chosen method.
Blocks with fewer than 2 cells are completely ignored and do not contribute to the combined mean, variance component or p-value.
Use of block
is the recommended approach for accounting for any uninteresting categorical factor of variation.
In addition to accounting for systematic differences in expression between levels of the blocking factor,
it also accommodates differences in the mean-variance relationships.
Alternatively, uninteresting factors can be used to construct a design matrix to pass to the function via design
.
In this case, a linear model is fitted to the expression profile for each gene and the residual variance is calculated.
This approach is useful for covariates or additive models that cannot be expressed as a one-way layout for use in block
.
However, it assumes that the error is normally distributed with equal variance for all observations of a given gene.
Use of block
and design
together is currently not supported and will lead to an error.
Aaron Lun
fitTrendVar
, for the trend fitting options.
modelGeneVar
, for modelling variance without spike-in controls.
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Using spike-ins. spk <- modelGeneVarWithSpikes(sce, "Spikes") spk plot(spk$mean, spk$total, log="xy") points(metadata(spk)$mean, metadata(spk)$cv2, col="red", pch=16) curve(metadata(spk)$trend(x), add=TRUE, col="dodgerblue") # With blocking (and spike-ins). block <- sample(LETTERS[1:2], ncol(sce), replace=TRUE) blk <- modelGeneVarWithSpikes(sce, "Spikes", block=block) blk par(mfrow=c(1,2)) for (i in colnames(blk$per.block)) { current <- blk$per.block[[i]] plot(current$mean, current$total) points(metadata(current)$mean, metadata(current)$cv2, col="red", pch=16) curve(metadata(current)$trend(x), add=TRUE, col="dodgerblue") }
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Using spike-ins. spk <- modelGeneVarWithSpikes(sce, "Spikes") spk plot(spk$mean, spk$total, log="xy") points(metadata(spk)$mean, metadata(spk)$cv2, col="red", pch=16) curve(metadata(spk)$trend(x), add=TRUE, col="dodgerblue") # With blocking (and spike-ins). block <- sample(LETTERS[1:2], ncol(sce), replace=TRUE) blk <- modelGeneVarWithSpikes(sce, "Spikes", block=block) blk par(mfrow=c(1,2)) for (i in colnames(blk$per.block)) { current <- blk$per.block[[i]] plot(current$mean, current$total) points(metadata(current)$mean, metadata(current)$cv2, col="red", pch=16) curve(metadata(current)$trend(x), add=TRUE, col="dodgerblue") }
Combine multiple sets of marker statistics, typically from different tests, into a single DataFrame for convenient inspection.
multiMarkerStats(..., repeated = NULL, sorted = TRUE)
multiMarkerStats(..., repeated = NULL, sorted = TRUE)
... |
Two or more lists or Lists produced by The names of each List should be the same; the universe of genes in each DataFrame should be the same;
and the same number of columns in each DataFrame should be named.
All elements in |
repeated |
Character vector of columns that are present in one or more DataFrames but should only be reported once. Typically used to avoid reporting redundant copies of annotation-related columns. |
sorted |
Logical scalar indicating whether each output DataFrame should be sorted by some relevant statistic. |
The combined statistics are designed to favor a gene that is highly ranked in each of the individual test results. This is highly conservative and aims to identify robust DE that is significant under all testing schemes.
A combined Top
value of T indicates that the gene is among the top T genes of one or more pairwise comparisons
in each of the testing schemes.
(We can be even more aggressive if the individual results were generated with a larger min.prop
value.)
In effect, a gene can only achieve a low Top
value if it is consistently highly ranked in each test.
If sorted=TRUE
, this is used to order the genes in the output DataFrame.
The combined p.value
is effectively the result of applying an intersection-union test to the per-test results.
This will only be low if the gene has a low p-value in each of the test results.
If sorted=TRUE
and Top
is not present, this will be used to order the genes in the output DataFrame.
A named List of DataFrames with one DataFrame per group/cluster.
Each DataFrame contains statistics from the corresponding entry of each List in ...
,
prefixed with the name of the List.
In addition, several combined statistics are reported:
Top
, the largest rank of each gene across all DataFrames for that group.
This is only reported if each list in ...
was generated with pval.type="any"
in combineMarkers
.
p.value
, the largest p-value of each gene across all DataFrames for that group.
This is replaced by log.p.value
if p-values in ...
are log-transformed.
FDR
, the BH-adjusted value of p.value
.
This is replaced by log.FDR
if p-values in ...
are log-transformed.
Aaron Lun
findMarkers
and combineMarkers
, to generate elements in ...
.
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Any clustering method is okay, only using k-means for convenience. kout <- kmeans(t(logcounts(sce)), centers=4) tout <- findMarkers(sce, groups=kout$cluster, direction="up") wout <- findMarkers(sce, groups=kout$cluster, direction="up", test="wilcox") combined <- multiMarkerStats(t=tout, wilcox=wout) colnames(combined[[1]])
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Any clustering method is okay, only using k-means for convenience. kout <- kmeans(t(logcounts(sce)), centers=4) tout <- findMarkers(sce, groups=kout$cluster, direction="up") wout <- findMarkers(sce, groups=kout$cluster, direction="up", test="wilcox") combined <- multiMarkerStats(t=tout, wilcox=wout) colnames(combined[[1]])
Perform pairwise binomial tests between groups of cells, possibly after blocking on uninteresting factors of variation.
pairwiseBinom(x, ...) ## S4 method for signature 'ANY' pairwiseBinom( x, groups, block = NULL, restrict = NULL, exclude = NULL, direction = c("any", "up", "down"), threshold = 1e-08, lfc = 0, log.p = FALSE, gene.names = NULL, subset.row = NULL, BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' pairwiseBinom(x, ..., assay.type = "logcounts") ## S4 method for signature 'SingleCellExperiment' pairwiseBinom(x, groups = colLabels(x, onAbsence = "error"), ...)
pairwiseBinom(x, ...) ## S4 method for signature 'ANY' pairwiseBinom( x, groups, block = NULL, restrict = NULL, exclude = NULL, direction = c("any", "up", "down"), threshold = 1e-08, lfc = 0, log.p = FALSE, gene.names = NULL, subset.row = NULL, BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' pairwiseBinom(x, ..., assay.type = "logcounts") ## S4 method for signature 'SingleCellExperiment' pairwiseBinom(x, groups = colLabels(x, onAbsence = "error"), ...)
x |
A numeric matrix-like object of counts, where each column corresponds to a cell and each row corresponds to a gene. |
... |
For the generic, further arguments to pass to specific methods. For the SummarizedExperiment method, further arguments to pass to the ANY method. For the SingleCellExperiment method, further arguments to pass to the SummarizedExperiment method. |
groups |
A vector of length equal to |
block |
A factor specifying the blocking level for each cell. |
restrict |
A vector specifying the levels of |
exclude |
A vector specifying the levels of |
direction |
A string specifying the direction of effects to be considered for the alternative hypothesis. |
threshold |
Numeric scalar specifying the value below which a gene is presumed to be not expressed. |
lfc |
Numeric scalar specifying the minimum absolute log-ratio in the proportion of expressing genes between groups. |
log.p |
A logical scalar indicating if log-transformed p-values/FDRs should be returned. |
gene.names |
Deprecated, use |
subset.row |
See |
BPPARAM |
A BiocParallelParam object indicating whether and how parallelization should be performed across genes. |
assay.type |
A string specifying which assay values to use, usually |
This function performs exact binomial tests to identify marker genes between pairs of groups of cells.
Here, the null hypothesis is that the proportion of cells expressing a gene is the same between groups.
A list of tables is returned where each table contains the statistics for all genes for a comparison between each pair of groups.
This can be examined directly or used as input to combineMarkers
for marker gene detection.
Effect sizes for each comparison are reported as log2-fold changes in the proportion of expressing cells in one group over the proportion in another group. We add a pseudo-count that squeezes the log-FCs towards zero to avoid undefined values when one proportion is zero. This is closely related to but somewhat more interpretable than the log-odds ratio, which would otherwise be the more natural statistic for a proportion-based test.
If restrict
is specified, comparisons are only performed between pairs of groups in restrict
.
This can be used to focus on DEGs distinguishing between a subset of the groups (e.g., closely related cell subtypes).
Similarly, if any entries of groups
are NA
, the corresponding cells are are ignored.
x
can be a count matrix or any transformed counterpart where zeroes remain zero and non-zeroes remain non-zero.
This is true of any scaling normalization and monotonic transformation like the log-transform.
If the transformation breaks this rule, some adjustment of threshold
is necessary.
A consequence of the transformation-agnostic behaviour of this function is that it will not respond to normalization. Differences in library size will not be considered by this function. However, this is not necessarily problematic for marker gene detection - users can treat this as retaining information about the total RNA content, analogous to spike-in normalization.
A list is returned containing statistics
and pairs
.
The statistics
element is itself a list of DataFrames.
Each DataFrame contains the statistics for a comparison between a pair of groups,
including the overlap proportions, p-values and false discovery rates.
The pairs
element is a DataFrame with one row corresponding to each entry of statistics
.
This contains the fields first
and second
,
specifying the two groups under comparison in the corresponding DataFrame in statistics
.
In each DataFrame in statistics
, the log-fold change represents the log-ratio of the proportion of expressing cells in the first
group compared to the expressing proportion in the second
group.
If direction="any"
, two-sided binomial tests will be performed for each pairwise comparisons between groups of cells.
For other direction
, one-sided tests in the specified direction will be used instead.
This can be used to focus on genes that are upregulated in each group of interest, which is often easier to interpret.
In practice, the two-sided test is approximated by combining two one-sided tests using a Bonferroni correction.
This is done for various logistical purposes;
it is also the only way to combine p-values across blocks in a direction-aware manner.
As a result, the two-sided p-value reported here will not be the same as that from binom.test
.
In practice, they are usually similar enough that this is not a major concern.
To interpret the setting of direction
, consider the DataFrame for group X, in which we are comparing to another group Y.
If direction="up"
, genes will only be significant in this DataFrame if they are upregulated in group X compared to Y.
If direction="down"
, genes will only be significant if they are downregulated in group X compared to Y.
See ?binom.test
for more details on the interpretation of one-sided Wilcoxon rank sum tests.
The magnitude of the log-fold change in the proportion of expressing cells can also be tested by setting lfc
.
By default, lfc=0
meaning that we will reject the null upon detecting any difference in proportions.
If this is set to some other positive value, the null hypothesis will change depending on direction
:
If direction="any"
, the null hypothesis is that the true log-fold change in proportions lies within [-lfc, lfc]
.
To be conservative, we perform one-sided tests against the boundaries of this interval, and combine them to obtain a two-sided p-value.
If direction="up"
, the null hypothesis is that the true log-fold change is less than lfc
.
A one-sided p-value is computed against the boundary of this interval.
If direction="down"
, the null hypothesis is that the true log-fold change is greater than -lfc
.
A one-sided p-value is computed against the boundary of this interval.
If block
is specified, binomial tests are performed between groups of cells within each level of block
.
For each pair of groups, the p-values for each gene across
all levels of block
are combined using Stouffer's weighted Z-score method.
The weight for the p-value in a particular level of block
is defined as ,
where
and
are the number of cells in groups X and Y, respectively, for that level.
This means that p-values from blocks with more cells will have a greater contribution to the combined p-value for each gene.
When combining across batches, one-sided p-values in the same direction are combined first.
Then, if direction="any"
, the two combined p-values from both directions are combined.
This ensures that a gene only receives a low overall p-value if it changes in the same direction across batches.
When comparing two groups, blocking levels are ignored if no p-value was reported, e.g., if there were insufficient cells for a group in a particular level.
If all levels are ignored in this manner, the entire comparison will only contain NA
p-values and a warning will be emitted.
Aaron Lun
Whitlock MC (2005). Combining probability from independent tests: the weighted Z-method is superior to Fisher's approach. J. Evol. Biol. 18, 5:1368-73.
binom.test
and binomTest
, on which this function is based.
combineMarkers
, to combine pairwise comparisons into a single DataFrame per group.
getTopMarkers
, to obtain the top markers from each pairwise comparison.
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Any clustering method is okay. kout <- kmeans(t(logcounts(sce)), centers=2) # Vanilla application: out <- pairwiseBinom(logcounts(sce), groups=kout$cluster) out # Directional and with a minimum log-fold change: out <- pairwiseBinom(logcounts(sce), groups=kout$cluster, direction="up", lfc=1) out
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Any clustering method is okay. kout <- kmeans(t(logcounts(sce)), centers=2) # Vanilla application: out <- pairwiseBinom(logcounts(sce), groups=kout$cluster) out # Directional and with a minimum log-fold change: out <- pairwiseBinom(logcounts(sce), groups=kout$cluster, direction="up", lfc=1) out
Perform pairwise Welch t-tests between groups of cells, possibly after blocking on uninteresting factors of variation.
pairwiseTTests(x, ...) ## S4 method for signature 'ANY' pairwiseTTests( x, groups, block = NULL, design = NULL, restrict = NULL, exclude = NULL, direction = c("any", "up", "down"), lfc = 0, std.lfc = FALSE, log.p = FALSE, gene.names = NULL, subset.row = NULL, BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' pairwiseTTests(x, ..., assay.type = "logcounts") ## S4 method for signature 'SingleCellExperiment' pairwiseTTests(x, groups = colLabels(x, onAbsence = "error"), ...)
pairwiseTTests(x, ...) ## S4 method for signature 'ANY' pairwiseTTests( x, groups, block = NULL, design = NULL, restrict = NULL, exclude = NULL, direction = c("any", "up", "down"), lfc = 0, std.lfc = FALSE, log.p = FALSE, gene.names = NULL, subset.row = NULL, BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' pairwiseTTests(x, ..., assay.type = "logcounts") ## S4 method for signature 'SingleCellExperiment' pairwiseTTests(x, groups = colLabels(x, onAbsence = "error"), ...)
x |
A numeric matrix-like object of normalized log-expression values, where each column corresponds to a cell and each row corresponds to an endogenous gene. Alternatively, a SummarizedExperiment or SingleCellExperiment object containing such a matrix. |
... |
For the generic, further arguments to pass to specific methods. For the SummarizedExperiment method, further arguments to pass to the ANY method. For the SingleCellExperiment method, further arguments to pass to the SummarizedExperiment method. |
groups |
A vector of length equal to |
block |
A factor specifying the blocking level for each cell. |
design |
A numeric matrix containing blocking terms for uninteresting factors.
Note that these factors should not be confounded with |
restrict |
A vector specifying the levels of |
exclude |
A vector specifying the levels of |
direction |
A string specifying the direction of log-fold changes to be considered in the alternative hypothesis. |
lfc |
A positive numeric scalar specifying the log-fold change threshold to be tested against. |
std.lfc |
A logical scalar indicating whether log-fold changes should be standardized. |
log.p |
A logical scalar indicating if log-transformed p-values/FDRs should be returned. |
gene.names |
Deprecated, use |
subset.row |
See |
BPPARAM |
A BiocParallelParam object indicating whether and how parallelization should be performed across genes. |
assay.type |
A string specifying which assay values to use, usually |
This function performs t-tests to identify differentially expressed genes (DEGs) between pairs of groups of cells.
A typical aim is to use the DEGs to determine cluster identity based on expression of marker genes with known biological activity.
A list of tables is returned where each table contains per-gene statistics for a comparison between one pair of groups.
This can be examined directly or used as input to combineMarkers
for marker gene detection.
We use t-tests as they are simple, fast and perform reasonably well for single-cell data (Soneson and Robinson, 2018). However, if one of the groups contains fewer than two cells, no p-value will be reported for comparisons involving that group. A warning will also be raised about insufficient degrees of freedom (d.f.) in such cases.
When log.p=TRUE
, the log-transformed p-values and FDRs are reported using the natural base.
This is useful in cases with many cells such that reporting the p-values directly would lead to double-precision underflow.
If restrict
is specified, comparisons are only performed between pairs of groups in restrict
.
This can be used to focus on DEGs distinguishing between a subset of the groups (e.g., closely related cell subtypes).
If exclude
is specified, comparisons are not performed between groups in exclude
.
Similarly, if any entries of groups
are NA
, the corresponding cells are are ignored.
A list is returned containing statistics
and pairs
.
The statistics
element is itself a list of DataFrames.
Each DataFrame contains the statistics for a comparison between a pair of groups,
including the log-fold changes, p-values and false discovery rates.
The pairs
element is a DataFrame where each row corresponds to an entry of statistics
.
This contains the fields first
and second
,
specifying the two groups under comparison in the corresponding DataFrame in statistics
.
In each DataFrame in statistics
, the log-fold change represents the change in the first
group compared to the second
group.
Log-fold changes are reported as differences in the values of x
.
Thus, all log-fold changes have the same base as whatever was used to perform the log-transformation in x
.
If logNormCounts
was used, this would be base 2.
If direction="any"
, two-sided tests will be performed for each pairwise comparisons between groups.
Otherwise, one-sided tests in the specified direction will be used instead.
This can be used to focus on genes that are upregulated in each group of interest, which is often easier to interpret when assigning cell type to a cluster.
To interpret the setting of direction
, consider the DataFrame for group X, in which we are comparing to another group Y.
If direction="up"
, genes will only be significant in this DataFrame if they are upregulated in group X compared to Y.
If direction="down"
, genes will only be significant if they are downregulated in group X compared to Y.
The magnitude of the log-fold changes can also be tested by setting lfc
.
By default, lfc=0
meaning that we will reject the null upon detecting any differential expression.
If this is set to some other positive value, the null hypothesis will change depending on direction
:
If direction="any"
, the null hypothesis is that the true log-fold change lies inside [-lfc, lfc]
.
To be conservative, we perform one-sided tests against the boundaries of this interval, and combine them to obtain a two-sided p-value.
If direction="up"
, the null hypothesis is that the true log-fold change is less than or equal to lfc
.
A one-sided p-value is computed against the boundary of this interval.
If direction="down"
, the null hypothesis is that the true log-fold change is greater than or equal to -lfc
.
A one-sided p-value is computed against the boundary of this interval.
This is similar to the approach used in treat
and allows users to focus on genes with strong log-fold changes.
If std.lfc=TRUE
, the log-fold change for each gene is standardized by the variance.
When the Welch t-test is being used, this is equivalent to Cohen's d.
Standardized log-fold changes may be more appealing for visualization as it avoids large fold changes due to large variance.
The choice of std.lfc
does not affect the calculation of the p-values.
If block
is specified, Welch t-tests are performed between groups within each level of block
.
For each pair of groups, the p-values for each gene across all levels of block
are combined using Stouffer's weighted Z-score method.
The reported log-fold change for each gene is also a weighted average of log-fold changes across levels.
The weight for a particular level is defined as ,
where
and
are the number of cells in groups X and Y, respectively, for that level.
This is inversely proportional to the expected variance of the log-fold change, provided that all groups and blocking levels have the same variance.
When comparing two groups, blocking levels are ignored if no p-value was reported, e.g., if there were insufficient cells for a group in a particular level.
This includes levels that contain fewer than two cells for either group, as this cannot yield a p-value from the Welch t-test.
If all levels are ignored in this manner, the entire comparison will only contain NA
p-values and a warning will be emitted.
If design
is specified, a linear model is instead fitted to the expression profile for each gene.
This linear model will include the groups
as well as any blocking factors in design
.
A t-test is then performed to identify DEGs between pairs of groups, using the values of the relevant coefficients and the gene-wise residual variance.
Note that design
must be full rank when combined with the groups
terms,
i.e., there should not be any confounding variables.
We make an exception for the common situation where design
contains an "(Intercept)"
column,
which is automatically detected and removed (emitting a warning along the way).
We recommend using block
instead of design
for uninteresting categorical factors of variation.
The former accommodates differences in the variance of expression in each group via Welch's t-test.
As a result, it is more robust to misspecification of the groups, as misspecified groups (and inflated variances) do not affect the inferences for other groups.
Use of block
also avoids assuming additivity of effects between the blocking factors and the groups.
Nonetheless, use of design
is unavoidable when blocking on real-valued covariates.
It is also useful for ensuring that log-fold changes/p-values are computed for comparisons between all pairs of groups
(assuming that design
is not confounded with the groups).
This may not be the case with block
if a pair of groups never co-occur in a single blocking level.
Aaron Lun
Whitlock MC (2005). Combining probability from independent tests: the weighted Z-method is superior to Fisher's approach. J. Evol. Biol. 18, 5:1368-73.
Soneson C and Robinson MD (2018). Bias, robustness and scalability in single-cell differential expression analysis. Nat. Methods
Lun ATL (2018). Comments on marker detection in scran. https://ltla.github.io/SingleCellThoughts/software/marker_detection/comments.html
t.test
, on which this function is based.
combineMarkers
, to combine pairwise comparisons into a single DataFrame per group.
getTopMarkers
, to obtain the top markers from each pairwise comparison.
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Any clustering method is okay. kout <- kmeans(t(logcounts(sce)), centers=3) # Vanilla application: out <- pairwiseTTests(logcounts(sce), groups=kout$cluster) out # Directional with log-fold change threshold: out <- pairwiseTTests(logcounts(sce), groups=kout$cluster, direction="up", lfc=0.2) out
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Any clustering method is okay. kout <- kmeans(t(logcounts(sce)), centers=3) # Vanilla application: out <- pairwiseTTests(logcounts(sce), groups=kout$cluster) out # Directional with log-fold change threshold: out <- pairwiseTTests(logcounts(sce), groups=kout$cluster, direction="up", lfc=0.2) out
Perform pairwise Wilcoxon rank sum tests between groups of cells, possibly after blocking on uninteresting factors of variation.
pairwiseWilcox(x, ...) ## S4 method for signature 'ANY' pairwiseWilcox( x, groups, block = NULL, restrict = NULL, exclude = NULL, direction = c("any", "up", "down"), lfc = 0, log.p = FALSE, gene.names = NULL, subset.row = NULL, BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' pairwiseWilcox(x, ..., assay.type = "logcounts") ## S4 method for signature 'SingleCellExperiment' pairwiseWilcox(x, groups = colLabels(x, onAbsence = "error"), ...)
pairwiseWilcox(x, ...) ## S4 method for signature 'ANY' pairwiseWilcox( x, groups, block = NULL, restrict = NULL, exclude = NULL, direction = c("any", "up", "down"), lfc = 0, log.p = FALSE, gene.names = NULL, subset.row = NULL, BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' pairwiseWilcox(x, ..., assay.type = "logcounts") ## S4 method for signature 'SingleCellExperiment' pairwiseWilcox(x, groups = colLabels(x, onAbsence = "error"), ...)
x |
A numeric matrix-like object of normalized (and possibly log-transformed) expression values, where each column corresponds to a cell and each row corresponds to an endogenous gene. Alternatively, a SummarizedExperiment or SingleCellExperiment object containing such a matrix. |
... |
For the generic, further arguments to pass to specific methods. For the SummarizedExperiment method, further arguments to pass to the ANY method. For the SingleCellExperiment method, further arguments to pass to the SummarizedExperiment method. |
groups |
A vector of length equal to |
block |
A factor specifying the blocking level for each cell. |
restrict |
A vector specifying the levels of |
exclude |
A vector specifying the levels of |
direction |
A string specifying the direction of differences to be considered in the alternative hypothesis. |
lfc |
Numeric scalar specifying the minimum log-fold change for one observation to be considered to be “greater” than another. |
log.p |
A logical scalar indicating if log-transformed p-values/FDRs should be returned. |
gene.names |
Deprecated, use |
subset.row |
See |
BPPARAM |
A BiocParallelParam object indicating whether and how parallelization should be performed across genes. |
assay.type |
A string specifying which assay values to use, usually |
This function performs Wilcoxon rank sum tests to identify differentially expressed genes (DEGs) between pairs of groups of cells.
A list of tables is returned where each table contains the statistics for all genes for a comparison between each pair of groups.
This can be examined directly or used as input to combineMarkers
for marker gene detection.
The effect size for each gene in each comparison is reported as the area under the curve (AUC). Consider the distribution of expression values for gene X within each of two groups A and B. The AUC is the probability that a randomly selected cell in A has a greater expression of X than a randomly selected cell in B. (Ties are assumed to be randomly broken.) Concordance probabilities near 0 indicate that most observations in A are lower than most observations in B; conversely, probabilities near 1 indicate that most observations in A are higher than those in B. The Wilcoxon rank sum test effectively tests for significant deviations from an AUC of 0.5.
Wilcoxon rank sum tests are more robust to outliers and insensitive to non-normality, in contrast to t-tests in pairwiseTTests
.
However, they take longer to run, the effect sizes are less interpretable, and there are more subtle violations of its assumptions in real data.
For example, the i.i.d. assumptions are unlikely to hold after scaling normalization due to differences in variance.
Also note that we approximate the distribution of the Wilcoxon rank sum statistic to deal with large numbers of cells and ties.
If restrict
is specified, comparisons are only performed between pairs of groups in restrict
.
This can be used to focus on DEGs distinguishing between a subset of the groups (e.g., closely related cell subtypes).
If exclude
is specified, comparisons are not performed between groups in exclude
.
Similarly, if any entries of groups
are NA
, the corresponding cells are are ignored.
A list is returned containing statistics
and pairs
.
The statistics
element is itself a list of DataFrames.
Each DataFrame contains the statistics for a comparison between a pair of groups,
including the AUCs, p-values and false discovery rates.
The pairs
element is a DataFrame with one row corresponding to each entry of statistics
.
This contains the fields first
and second
,
specifying the two groups under comparison in the corresponding DataFrame in statistics
.
In each DataFrame in statistics
, the AUC represents the probability of sampling a value in the first
group greater than a random value from the second
group.
If direction="any"
, two-sided Wilcoxon rank sum tests will be performed for each pairwise comparisons between groups of cells.
Otherwise, one-sided tests in the specified direction will be used instead.
This can be used to focus on genes that are upregulated in each group of interest, which is often easier to interpret.
To interpret the setting of direction
, consider the DataFrame for group X, in which we are comparing to another group Y.
If direction="up"
, genes will only be significant in this DataFrame if they are upregulated in group X compared to Y.
If direction="down"
, genes will only be significant if they are downregulated in group X compared to Y.
See ?wilcox.test
for more details on the interpretation of one-sided Wilcoxon rank sum tests.
Users can also specify a log-fold change threshold in lfc
to focus on genes that exhibit large shifts in location.
This is equivalent to specifying the mu
parameter in wilcox.test
with some additional subtleties depending on direction
:
If direction="any"
, the null hypothesis is that the true shift lies in [-lfc, lfc]
.
Two one-sided p-values are computed against the boundaries of this interval by shifting X's expression values to either side by lfc
,
and these are combined to obtain a (conservative) two-sided p-value.
If direction="up"
, the null hypothesis is that the true shift is less than or equal to lfc
.
A one-sided p-value is computed against the boundary of this interval.
If direction="down"
, the null hypothesis is that the true shift is greater than or equal to -lfc
.
A one-sided p-value is computed against the boundary of this interval.
The AUC is conveniently derived from the U-statistic used in the test, which ensures that it is always consistent with the reported p-value.
An interesting side-effect is that the reported AUC is dependent on both the specified lfc
and direction
.
If direction="up"
, the AUC is computed after shifting X's expression values to the left by the specified lfc
.
An AUC above 0.5 means that X's values are “greater” than Y's, even after shifting down the former by lfc
.
This is helpful as a large AUC tells us that X and Y are well-separated by at least lfc
.
However, an AUC below 0.5 cannot be interpreted as “X is lower than Y”, only that “X - lfc is lower than Y”.
If direction="down"
, the AUC is computed after shifting X's expression values to the right by the specified lfc
.
An AUC below 0.5 means that X's values are “lower” than Y's, even after shifting up the former by lfc
.
This is helpful as a low AUC tells us that X and Y are well-separated by at least lfc
.
However, an AUC above 0.5 cannot be interpreted as “X is greater than Y”, only that “X + lfc is greater than Y”.
If direction="any"
, the AUC is computed by averaging the AUCs obtained in each of the two one-sided tests, i.e., after shifting in each direction.
This considers an observation of Y to be tied with an observation of X if their absolute difference is less than lfc
.
(Technically, the test procedure should also consider these to be ties to be fully consistent, but we have not done so for simplicity.)
The AUC can be interpreted as it would be for lfc=0
, i.e., above 0.5 means that X is greater than Y and below 0.5 means that X is less than Y.
If block
is specified, Wilcoxon tests are performed between groups of cells within each level of block
.
For each pair of groups, the p-values for each gene across all levels of block
are combined using Stouffer's Z-score method.
The reported AUC is also a weighted average of the AUCs across all levels.
The weight for a particular level of block
is defined as ,
where
and
are the number of cells in groups X and Y, respectively, for that level.
This means that p-values from blocks with more cells will have a greater contribution to the combined p-value for each gene.
When combining across batches, one-sided p-values in the same direction are combined first.
Then, if direction="any"
, the two combined p-values from both directions are combined.
This ensures that a gene only receives a low overall p-value if it changes in the same direction across batches.
When comparing two groups, blocking levels are ignored if no p-value was reported, e.g., if there were insufficient cells for a group in a particular level.
If all levels are ignored in this manner, the entire comparison will only contain NA
p-values and a warning will be emitted.
Aaron Lun
Whitlock MC (2005). Combining probability from independent tests: the weighted Z-method is superior to Fisher's approach. J. Evol. Biol. 18, 5:1368-73.
Soneson C and Robinson MD (2018). Bias, robustness and scalability in single-cell differential expression analysis. Nat. Methods
wilcox.test
, on which this function is based.
combineMarkers
, to combine pairwise comparisons into a single DataFrame per group.
getTopMarkers
, to obtain the top markers from each pairwise comparison.
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Any clustering method is okay. kout <- kmeans(t(logcounts(sce)), centers=2) # Vanilla application: out <- pairwiseWilcox(logcounts(sce), groups=kout$cluster) out # Directional and with a minimum log-fold change: out <- pairwiseWilcox(logcounts(sce), groups=kout$cluster, direction="up", lfc=0.2) out
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Any clustering method is okay. kout <- kmeans(t(logcounts(sce)), centers=2) # Vanilla application: out <- pairwiseWilcox(logcounts(sce), groups=kout$cluster) out # Directional and with a minimum log-fold change: out <- pairwiseWilcox(logcounts(sce), groups=kout$cluster, direction="up", lfc=0.2) out
A wrapper function around edgeR's quasi-likelihood methods to conveniently perform differential expression analyses on pseudo-bulk profiles, allowing detection of cell type-specific changes between conditions in replicated studies.
pseudoBulkDGE(x, ...) ## S4 method for signature 'ANY' pseudoBulkDGE( x, col.data, label, design, coef, contrast = NULL, condition = NULL, lfc = 0, include.intermediates = TRUE, row.data = NULL, sorted = FALSE, method = c("edgeR", "voom"), qualities = TRUE, robust = TRUE, sample = NULL ) ## S4 method for signature 'SummarizedExperiment' pseudoBulkDGE(x, col.data = colData(x), ..., assay.type = 1)
pseudoBulkDGE(x, ...) ## S4 method for signature 'ANY' pseudoBulkDGE( x, col.data, label, design, coef, contrast = NULL, condition = NULL, lfc = 0, include.intermediates = TRUE, row.data = NULL, sorted = FALSE, method = c("edgeR", "voom"), qualities = TRUE, robust = TRUE, sample = NULL ) ## S4 method for signature 'SummarizedExperiment' pseudoBulkDGE(x, col.data = colData(x), ..., assay.type = 1)
x |
A numeric matrix of counts where rows are genes and columns are pseudo-bulk profiles. Alternatively, a SummarizedExperiment object containing such a matrix in its assays. |
... |
For the generic, additional arguments to pass to individual methods. For the SummarizedExperiment method, additional arguments to pass to the ANY method. |
col.data |
A data.frame or DataFrame containing metadata for each column of |
label |
A vector of factor of length equal to |
design |
A formula to be used to construct a design matrix from variables in |
coef |
String or character vector containing the coefficients to drop from the design matrix to form the null hypothesis. Can also be an integer scalar or vector specifying the indices of the relevant columns. |
contrast |
Numeric vector or matrix containing the contrast of interest.
Alternatively, a character vector to be passed to |
condition |
A vector or factor of length equal to |
lfc |
Numeric scalar specifying the log-fold change threshold to use in |
include.intermediates |
Logical scalar indicating whether the intermediate edgeR objects should be returned. |
row.data |
A DataFrame containing additional row metadata for each gene in |
sorted |
Logical scalar indicating whether the output tables should be sorted by p-value. |
method |
String specifying the DE analysis framework to use. |
qualities |
Logical scalar indicating whether quality weighting should be used when |
robust |
Logical scalar indicating whether robust empirical Bayes shrinkage should be performed. |
sample |
Deprecated. |
assay.type |
String or integer scalar specifying the assay to use from |
In replicated multi-condition scRNA-seq experiments, we often have clusters comprised of cells from different samples of different experimental conditions. It is often desirable to check for differential expression between conditions within each cluster, allowing us to identify cell-type-specific responses to the experimental perturbation.
Given a set of pseudo-bulk profiles (usually generated by sumCountsAcrossCells
),
this function loops over the labels and uses edgeR or voom
to detect DE genes between conditions.
The DE analysis for each label is largely the same as a standard analysis for bulk RNA-seq data,
using design
and coef
or contrast
as described in the edgeR or limma user guides.
Generally speaking, edgeR handles low counts better via its count-based model
but method="voom"
supports variable sample precision when quality=TRUE
.
Performing pseudo-bulk DGE enables us to reuse well-tested methods developed for bulk RNA-seq data analysis. Each pseudo-bulk profile can be treated as an in silico mimicry of a real bulk RNA-seq sample (though in practice, it tends to be much more variable due to the lower numbers of cells). This also models the relevant variability between experimental replicates (i.e., across samples) rather than that between cells in the same sample, without resorting to expensive mixed-effects models.
The DE analysis for each label is independent of that for any other label. This aims to minimize problems due to differences in abundance and variance between labels, at the cost of losing the ability to share information across labels.
In some cases, it will be impossible to perform a DE analysis for a label. The most obvious reason is if there are no residual degrees of freedom; other explanations include impossible contrasts or a failure to construct an appropriate design matrix (e.g., if a cell type only exists in one condition).
Note that we assume that x
has already been filtered to remove unstable pseudo-bulk profiles generated from few cells.
A List with one DataFrame of DE results per unique (non-failed) level of cluster
.
This contains columns from topTags
if method="edgeR"
or topTable
if method="voom"
.
All DataFrames have row names equal to rownames(x)
.
The metadata
of the List contains failed
,
a character vector with the names of the labels for which the comparison could not be performed - see Details.
The metadata
of the individual DataFrames contains design
, the final design matrix for that label.
If include.intermediates
, the metadata
will also contain
y
, the DGEList used for the analysis; and fit
, the DGEGLM object after GLM fitting.
For each label, abundance filtering is performed using filterByExpr
prior to further analysis.
Genes that are filtered out will still show up in the DataFrame for that label, but with all statistics set to NA
.
As this is done separately for each label, a different set of genes may be filtered out for each label,
which is largely to be expected if there is any label-specific expression.
By default, the minimum group size for filterByExpr
is determined using the design matrix.
However, this may not be optimal if the design matrix contains additional terms (e.g., blocking factors)
in which case it is not easy to determine the minimum size of the groups relevant to the comparison of interest.
To overcome this, users can specify condition.field
to specify the group to which each sample belongs,
which is used by filterByExpr
to obtain a more appropriate minimum group size.
Aaron Lun
Tung P-Y et al. (2017). Batch effects and the effective design of single-cell gene expression studies. Sci. Rep. 7, 39921
Lun ATL and Marioni JC (2017). Overcoming confounding plate effects in differential expression analyses of single-cell RNA-seq data. Biostatistics 18, 451-464
Crowell HL et al. (2019). On the discovery of population-specific state transitions from multi-sample multi-condition single-cell RNA sequencing data. biorXiv
sumCountsAcrossCells
, to easily generate the pseudo-bulk count matrix.
decideTestsPerLabel
, to generate a summary of the DE results across all labels.
pseudoBulkSpecific
, to look for label-specific DE genes.
pbDS
from the muscat package, which uses a similar approach.
set.seed(10000) library(scuttle) sce <- mockSCE(ncells=1000) sce$samples <- gl(8, 125) # Pretending we have 8 samples. # Making up some clusters. sce <- logNormCounts(sce) clusters <- kmeans(t(logcounts(sce)), centers=3)$cluster # Creating a set of pseudo-bulk profiles: info <- DataFrame(sample=sce$samples, cluster=clusters) pseudo <- sumCountsAcrossCells(sce, info) # Making up an experimental design for our 8 samples. pseudo$DRUG <- gl(2,4)[pseudo$sample] # DGE analysis: out <- pseudoBulkDGE(pseudo, label=pseudo$cluster, condition=pseudo$DRUG, design=~DRUG, coef="DRUG2" ) out[[1]] metadata(out[[1]])$design
set.seed(10000) library(scuttle) sce <- mockSCE(ncells=1000) sce$samples <- gl(8, 125) # Pretending we have 8 samples. # Making up some clusters. sce <- logNormCounts(sce) clusters <- kmeans(t(logcounts(sce)), centers=3)$cluster # Creating a set of pseudo-bulk profiles: info <- DataFrame(sample=sce$samples, cluster=clusters) pseudo <- sumCountsAcrossCells(sce, info) # Making up an experimental design for our 8 samples. pseudo$DRUG <- gl(2,4)[pseudo$sample] # DGE analysis: out <- pseudoBulkDGE(pseudo, label=pseudo$cluster, condition=pseudo$DRUG, design=~DRUG, coef="DRUG2" ) out[[1]] metadata(out[[1]])$design
Detect label-specific DE genes in a pseudo-bulk analysis, by testing whether the log-fold change is more extreme than the average log-fold change of other labels.
pseudoBulkSpecific(x, ...) ## S4 method for signature 'ANY' pseudoBulkSpecific( x, label, condition = NULL, ..., method = c("edgeR", "voom"), sorted = FALSE, average = c("median", "mean"), missing.as.zero = FALSE, reference = NULL ) ## S4 method for signature 'SummarizedExperiment' pseudoBulkSpecific(x, ..., assay.type = 1)
pseudoBulkSpecific(x, ...) ## S4 method for signature 'ANY' pseudoBulkSpecific( x, label, condition = NULL, ..., method = c("edgeR", "voom"), sorted = FALSE, average = c("median", "mean"), missing.as.zero = FALSE, reference = NULL ) ## S4 method for signature 'SummarizedExperiment' pseudoBulkSpecific(x, ..., assay.type = 1)
x |
A numeric matrix of counts where rows are genes and columns are pseudo-bulk profiles. Alternatively, a SummarizedExperiment object containing such a matrix in its assays. |
... |
For the generic, further arguments to pass to individual methods. For the ANY method, further arguments to pass to For the SummarizedExperiment method, further arguments to pass to the ANY method. |
label |
A vector of factor of length equal to |
condition |
A vector or factor of length equal to |
method |
String specifying the DE analysis framework to use. |
sorted |
Logical scalar indicating whether the output tables should be sorted by p-value. |
average |
String specifying the method to compute the average log-fold change of all other labels. |
missing.as.zero |
Logical scalar indicating whether missing log-fold changes should be set to zero. |
reference |
A List containing the (unsorted) output of |
assay.type |
String or integer scalar specifying the assay to use from |
This function implements a quick and dirty method for detecting label-specific DE genes. For a given label and gene, the null hypothesis is that the log-fold change lies between zero and the average log-fold change for that gene across all other labels. Genes that reject this null either have log-fold changes in the opposite sign or are significantly more extreme than the average.
To implement this, we test each gene against the two extremes and taking the larger of the two p-values.
The p-value is set to 1 if the log-fold change lies between the extremes.
This is somewhat similar to how treat
might behave if the null interval was not centered at zero;
however, our approach is more conservative than the treat
as the p-value calculations are not quite correct.
It is worth stressing that there are no guarantees that the DE genes detected in this manner are truly label-specific.
For any label and DEG, there may be one or more labels with stronger log-fold changes,
but the average may be pulled towards zero by other labels with weaker or opposing effects.
The use of the average is analogous to recommendations in the edgeR user's guide for testing against multiple groups.
However, a more stringent selection can be achieved by applying gates on decideTestsPerLabel
.
Note that, if lfc
is specified in the arguments to pseudoBulkDGE
,
the null interval is expanded in both directions by the specified value.
A List of DataFrames where each DataFrame contains DE statistics for one label.
This is equivalent to the output of pseudoBulkDGE
;
if reference
is supplied, most of the statistics will be identical to those reported there.
The main differences are that the p-values and FDR values are changed.
Each DataFrame also has an OtherAverage
field containing the average log-fold change across all other labels.
The average log-fold change for each gene is computed by taking the median or mean (depending on average
)
of the corresponding log-fold changes in each of the DE analyses for the other labels.
We use the median by default as this means that at least half of all other labels should have weaker or opposing effects.
By default, low-abundance genes that were filtered out in a comparison do not contribute to the average.
Any log-fold changes that could be computed from them are considered to be too unstable.
If the gene is filtered out in all other labels, the average is set to zero for testing but is reported as NA
.
If missing.as.zero=TRUE
, the log-fold changes for all filtered genes are set to zero.
This is useful when a gene is only expressed in the subset of labels and is consistently DEG in each comparison of the subset.
Testing against the average computed from only those labels in the subset would fail to detect this DEG as subset-specific.
Aaron Lun
pseudoBulkDGE
, for the underlying function that does all the heavy lifting.
set.seed(10000) library(scuttle) sce <- mockSCE(ncells=1000) sce$samples <- gl(8, 125) # Pretending we have 8 samples. # Making up some clusters. sce <- logNormCounts(sce) clusters <- kmeans(t(logcounts(sce)), centers=3)$cluster # Creating a set of pseudo-bulk profiles: info <- DataFrame(sample=sce$samples, cluster=clusters) pseudo <- sumCountsAcrossCells(sce, info) # Making up an experimental design for our 8 samples # and adding a common DEG across all labels. pseudo$DRUG <- gl(2,4)[pseudo$sample] assay(pseudo)[1,pseudo$DRUG==1] <- assay(pseudo)[1,pseudo$DRUG==1] * 10 # Label-specific analysis (note behavior of the first gene). out <- pseudoBulkSpecific(pseudo, label=pseudo$cluster, condition=pseudo$DRUG, design=~DRUG, coef="DRUG2" ) out[[1]]
set.seed(10000) library(scuttle) sce <- mockSCE(ncells=1000) sce$samples <- gl(8, 125) # Pretending we have 8 samples. # Making up some clusters. sce <- logNormCounts(sce) clusters <- kmeans(t(logcounts(sce)), centers=3)$cluster # Creating a set of pseudo-bulk profiles: info <- DataFrame(sample=sce$samples, cluster=clusters) pseudo <- sumCountsAcrossCells(sce, info) # Making up an experimental design for our 8 samples # and adding a common DEG across all labels. pseudo$DRUG <- gl(2,4)[pseudo$sample] assay(pseudo)[1,pseudo$DRUG==1] <- assay(pseudo)[1,pseudo$DRUG==1] * 10 # Label-specific analysis (note behavior of the first gene). out <- pseudoBulkSpecific(pseudo, label=pseudo$cluster, condition=pseudo$DRUG, design=~DRUG, coef="DRUG2" ) out[[1]]
Cluster similar cells based on their expression profiles, using either log-expression values or ranks.
quickCluster(x, ...) ## S4 method for signature 'ANY' quickCluster( x, min.size = 100, method = c("igraph", "hclust"), use.ranks = FALSE, d = NULL, subset.row = NULL, min.mean = NULL, graph.fun = "walktrap", BSPARAM = bsparam(), BPPARAM = SerialParam(), block = NULL, block.BPPARAM = SerialParam(), ... ) ## S4 method for signature 'SummarizedExperiment' quickCluster(x, ..., assay.type = "counts")
quickCluster(x, ...) ## S4 method for signature 'ANY' quickCluster( x, min.size = 100, method = c("igraph", "hclust"), use.ranks = FALSE, d = NULL, subset.row = NULL, min.mean = NULL, graph.fun = "walktrap", BSPARAM = bsparam(), BPPARAM = SerialParam(), block = NULL, block.BPPARAM = SerialParam(), ... ) ## S4 method for signature 'SummarizedExperiment' quickCluster(x, ..., assay.type = "counts")
x |
A numeric count matrix where rows are genes and columns are cells. Alternatively, a SummarizedExperiment object containing such a matrix. |
... |
For the generic, further arguments to pass to specific methods. For the ANY method, additional arguments to be passed to For the SummarizedExperiment method, additional arguments to pass to the ANY method. |
min.size |
An integer scalar specifying the minimum size of each cluster. |
method |
String specifying the clustering method to use.
|
use.ranks |
A logical scalar indicating whether clustering should be performed on the rank matrix, i.e., based on Spearman's rank correlation. |
d |
An integer scalar specifying the number of principal components to retain.
If |
subset.row |
See |
min.mean |
A numeric scalar specifying the filter to be applied on the average count for each filter prior to computing ranks.
Only used when |
graph.fun |
A function specifying the community detection algorithm to use on the nearest neighbor graph when |
BSPARAM |
A BiocSingularParam object specifying the algorithm to use for PCA, if |
BPPARAM |
A BiocParallelParam object to use for parallel processing within each block. |
block |
A factor of length equal to |
block.BPPARAM |
A BiocParallelParam object specifying whether and how parallelization should be performed across blocks,
if |
assay.type |
A string specifying which assay values to use. |
This function provides a convenient wrapper to quickly define clusters of a minimum size min.size
.
Its intended use is to generate “quick and dirty” clusters for use in computeSumFactors
.
Two clustering strategies are available:
If method="hclust"
, a distance matrix is constructed;
hierarchical clustering is performed using Ward's criterion;
and cutreeDynamic
is used to define clusters of cells.
If method="igraph"
, a shared nearest neighbor graph is constructed using the buildSNNGraph
function.
This is used to define clusters based on highly connected communities in the graph, using the graph.fun
function.
By default, quickCluster
will apply these clustering algorithms on the principal component (PC) scores generated from the log-expression values.
These are obtained by running denoisePCA
on HVGs detected using the trend fitted to endogenous genes with modelGeneVar
.
If d
is specified, the PCA is directly performed on the entire x
and the specified number of PCs is retained.
It is also possible to use the clusters from this function for actual biological interpretation.
In such cases, users should set min.size=0
to avoid aggregation of small clusters.
However, it is often better to call the relevant functions (modelGeneVar
, denoisePCA
and buildSNNGraph
) manually as this provides more opportunities for diagnostics when the meaning of the clusters is important.
A character vector of cluster identities for each cell in x
.
We can break up the dataset by specifying block
to cluster cells, usually within each batch or run.
This generates clusters within each level of block
, which is entirely adequate for applications like computeSumFactors
where the aim of clustering is to separate dissimilar cells rather than group together all similar cells.
Blocking reduces computational work considerably by allowing each level to be processed independently, without compromising performance provided that there are enough cells within each batch.
Indeed, for applications like computeSumFactors
, we can use block
even in the absence of any known batch structure.
Specifically, we can set it to an arbitrary factor such as block=cut(seq_len(ncol(x)), 10)
to split the cells into ten batches of roughly equal size.
This aims to improve speed, especially when combined with block.PARAM
to parallelize processing of the independent levels.
If use.ranks=TRUE
, clustering is instead performed on PC scores obtained from scaled and centred ranks generated by scaledColRanks
.
This effectively means that clustering uses distances based on the Spearman's rank correlation between two cells.
In addition, if x
is a dgCMatrix and BSPARAM
has deferred=TRUE
,
ranks will be computed without loss of sparsity to improve speed and memory efficiency during PCA.
When use.ranks=TRUE
, the function will filter out genes with average counts (as defined by calculateAverage
) below min.mean
prior to computing ranks.
This removes low-abundance genes with many tied ranks, especially due to zeros, which may reduce the precision of the clustering.
We suggest setting min.mean
to 1 for read count data and 0.1 for UMI data - the function will automatically try to determine this from the data if min.mean=NULL
.
Setting use.ranks=TRUE
is invariant to scaling normalization and avoids circularity between normalization and clustering, e.g., in computeSumFactors
.
However, the default is to use the log-expression values with use.ranks=FALSE
, as this yields finer and more precise clusters.
With method="hclust"
, cutreeDynamic
is used to ensure that all clusters contain a minimum number of cells.
However, some cells may not be assigned to any cluster and are assigned identities of "0"
in the output vector.
In most cases, this is because those cells belong in a separate cluster with fewer than min.size
cells.
The function will not be able to call this as a cluster as the minimum threshold on the number of cells has not been passed.
Users are advised to check that the unassigned cells do indeed form their own cluster.
Otherwise, it may be necessary to use a different clustering algorithm.
When using method="igraph"
, clusters are first identified using the specified graph.fun
.
If the smallest cluster contains fewer cells than min.size
, it is merged with the closest neighbouring cluster.
In particular, the function will attempt to merge the smallest cluster with each other cluster.
The merge that maximizes the modularity score is selected, and a new merged cluster is formed.
This process is repeated until all (merged) clusters are larger than min.size
.
Aaron Lun and Karsten Bach
van Dongen S and Enright AJ (2012). Metric distances derived from cosine similarity and Pearson and Spearman correlations. arXiv 1208.3145
Lun ATL, Bach K and Marioni JC (2016). Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17:75
computeSumFactors
, where the clustering results can be used as clusters=
.
buildSNNGraph
, for additional arguments to customize the clustering when method="igraph"
.
cutreeDynamic
, for additional arguments to customize the clustering when method="hclust"
.
scaledColRanks
, to get the rank matrix that was used with use.rank=TRUE
.
quickSubCluster
, for a related function that uses a similar approach for subclustering.
library(scuttle) sce <- mockSCE() # Basic application (lowering min.size for this small demo): clusters <- quickCluster(sce, min.size=50) table(clusters) # Operating on ranked expression values: clusters2 <- quickCluster(sce, min.size=50, use.ranks=TRUE) table(clusters2) # Using hierarchical clustering: clusters <- quickCluster(sce, min.size=50, method="hclust") table(clusters)
library(scuttle) sce <- mockSCE() # Basic application (lowering min.size for this small demo): clusters <- quickCluster(sce, min.size=50) table(clusters) # Operating on ranked expression values: clusters2 <- quickCluster(sce, min.size=50, use.ranks=TRUE) table(clusters2) # Using hierarchical clustering: clusters <- quickCluster(sce, min.size=50, method="hclust") table(clusters)
Performs a quick subclustering for all cells within each group.
quickSubCluster(x, ...) ## S4 method for signature 'ANY' quickSubCluster(x, normalize = TRUE, ...) ## S4 method for signature 'SummarizedExperiment' quickSubCluster(x, ...) ## S4 method for signature 'SingleCellExperiment' quickSubCluster( x, groups, normalize = TRUE, restricted = NULL, prepFUN = NULL, min.ncells = 50, clusterFUN = NULL, BLUSPARAM = NNGraphParam(), format = "%s.%s", assay.type = "counts", simplify = FALSE )
quickSubCluster(x, ...) ## S4 method for signature 'ANY' quickSubCluster(x, normalize = TRUE, ...) ## S4 method for signature 'SummarizedExperiment' quickSubCluster(x, ...) ## S4 method for signature 'SingleCellExperiment' quickSubCluster( x, groups, normalize = TRUE, restricted = NULL, prepFUN = NULL, min.ncells = 50, clusterFUN = NULL, BLUSPARAM = NNGraphParam(), format = "%s.%s", assay.type = "counts", simplify = FALSE )
x |
A matrix of counts or log-normalized expression values (if Alternatively, a SummarizedExperiment or SingleCellExperiment object containing such a matrix. |
... |
For the generic, further arguments to pass to specific methods. For the ANY and SummarizedExperiment methods, further arguments to pass to the SingleCellExperiment method. |
normalize |
Logical scalar indicating whether each subset of |
groups |
A vector of group assignments for all cells, usually corresponding to cluster identities. |
restricted |
Character vector containing the subset of groups in |
prepFUN |
A function that accepts a single SingleCellExperiment object and returns another SingleCellExperiment containing any additional elements required for clustering (e.g., PCA results). |
min.ncells |
An integer scalar specifying the minimum number of cells in a group to be considered for subclustering. |
clusterFUN |
A function that accepts a single SingleCellExperiment object and returns a vector of cluster assignments for each cell in that object. |
BLUSPARAM |
A BlusterParam object that is used to specify the clustering via |
format |
A string to be passed to |
assay.type |
String or integer scalar specifying the relevant assay. |
simplify |
Logical scalar indicating whether just the subcluster assignments should be returned. |
quickSubCluster
is a simple convenience function that loops over all levels of groups
to perform subclustering.
It subsets x
to retain all cells in one level and then runs prepFUN
and clusterFUN
to cluster them.
Levels with fewer than min.ncells
are not subclustered and have "subcluster"
set to the name of the level.
The distinction between prepFUN
and clusterFUN
is that the former's calculations are preserved in the output.
For example, we would put the PCA in prepFUN
so that the PCs are returned in the reducedDims
for later use.
In contrast, clusterFUN
is only used to obtain the subcluster assignments so any intermediate objects are lost.
By default, prepFUN
will run modelGeneVar
, take the top 10
clusterFUN
will then perform clustering on the PC matrix with clusterRows
and BLUSPARAM
.
Either or both of these functions can be replaced with custom functions.
The default behavior of this function is the same as running quickCluster
on each subset with default parameters except for min.size=0
.
By default, a named List of SingleCellExperiment objects.
Each object corresponds to a level of groups
and contains a "subcluster"
column metadata field with the subcluster identities for each cell.
The metadata
of the List also contains index
, a list of integer vectors specifying the cells in x
in each returned SingleCellExperiment object;
and subcluster
, a character vector of subcluster identities (see next).
If restricted
is not NULL
, only the specified groups in restricted
will be present in the output List.
If simplify=TRUE
, the character vector of subcluster identities is returned.
This is of length equal to ncol(x)
and each entry follows the format defined in format
.
The only exceptions are if the number of cells in the parent cluster is less than min.cells
,
or parent cluster is not listed in a non-NULL
value of restricted
.
In such cases, the parent cluster's name is used instead.
Aaron Lun
quickCluster
, for a related function to quickly obtain clusters.
library(scuttle) sce <- mockSCE(ncells=200) # Lowering min.size for this small demo: clusters <- quickCluster(sce, min.size=50) # Getting subclusters: out <- quickSubCluster(sce, clusters) # Defining custom prep functions: out2 <- quickSubCluster(sce, clusters, prepFUN=function(x) { dec <- modelGeneVarWithSpikes(x, "Spikes") top <- getTopHVGs(dec, prop=0.2) scater::runPCA(x, subset_row=top, ncomponents=25) } ) # Defining custom cluster functions: out3 <- quickSubCluster(sce, clusters, clusterFUN=function(x) { kmeans(reducedDim(x, "PCA"), sqrt(ncol(x)))$cluster } )
library(scuttle) sce <- mockSCE(ncells=200) # Lowering min.size for this small demo: clusters <- quickCluster(sce, min.size=50) # Getting subclusters: out <- quickSubCluster(sce, clusters) # Defining custom prep functions: out2 <- quickSubCluster(sce, clusters, prepFUN=function(x) { dec <- modelGeneVarWithSpikes(x, "Spikes") top <- getTopHVGs(dec, prop=0.2) scater::runPCA(x, subset_row=top, ncomponents=25) } ) # Defining custom cluster functions: out3 <- quickSubCluster(sce, clusters, clusterFUN=function(x) { kmeans(reducedDim(x, "PCA"), sqrt(ncol(x)))$cluster } )
Compute an approximate p-value against the null hypothesis that Spearman's rho is zero.
This vectorizes the approximate p-value calculation in cor.test
with method="spearman"
.
rhoToPValue(rho, n, positive = NULL)
rhoToPValue(rho, n, positive = NULL)
rho |
Numeric vector of rho values. |
n |
Integer scalar specifying the number of observations used to compute |
positive |
Logical scalar indicating whether to perform a one-sided test for the alternative of a positive ( |
If positive=NULL
, a list of two numeric vectors is returned,
containing p-values for the test against the alternative hypothesis in each direction.
Otherwise, a numeric vector is returned containing the p-values for the test in the specified direction.
Aaron Lun
rhoToPValue(seq(-1, 1, 21), 50)
rhoToPValue(seq(-1, 1, 21), 50)
Use gene expression data to train a classifier for cell cycle phase.
sandbag(x, ...) ## S4 method for signature 'ANY' sandbag(x, phases, gene.names = rownames(x), fraction = 0.5, subset.row = NULL) ## S4 method for signature 'SummarizedExperiment' sandbag(x, ..., assay.type = "counts")
sandbag(x, ...) ## S4 method for signature 'ANY' sandbag(x, phases, gene.names = rownames(x), fraction = 0.5, subset.row = NULL) ## S4 method for signature 'SummarizedExperiment' sandbag(x, ..., assay.type = "counts")
x |
A numeric matrix of gene expression values where rows are genes and columns are cells. Alternatively, a SummarizedExperiment object containing such a matrix. |
... |
For the generic, additional arguments to pass to specific methods. For the SummarizedExperiment method, additional arguments to pass to the ANY method. |
phases |
A list of subsetting vectors specifying which cells are in each phase of the cell cycle.
This should typically be of length 3, with elements named as |
gene.names |
A character vector of gene names. |
fraction |
A numeric scalar specifying the minimum fraction to define a marker gene pair. |
subset.row |
See |
assay.type |
A string specifying which assay values to use, e.g., |
This function implements the training step of the pair-based prediction method described by Scialdone et al. (2015).
Pairs of genes (A, B) are identified from a training data set where in each pair,
the fraction of cells in phase G1 with expression of A > B (based on expression values in training.data
)
and the fraction with B > A in each other phase exceeds fraction
.
These pairs are defined as the marker pairs for G1.
This is repeated for each phase to obtain a separate marker pair set.
Pre-defined sets of marker pairs are provided for mouse and human (see Examples).
The mouse set was generated as described by Scialdone et al. (2015), while the human training set was generated with data from Leng et al. (2015).
Classification from test data can be performed using the cyclone
function.
For each cell, this involves comparing expression values between genes in each marker pair.
The cell is then assigned to the phase that is consistent with the direction of the difference in expression in the majority of pairs.
A named list of data.frames, where each data frame corresponds to a cell cycle phase and contains the names of the genes in each marker pair.
Antonio Scialdone, with modifications by Aaron Lun
Scialdone A, Natarajana KN, Saraiva LR et al. (2015). Computational assignment of cell-cycle stage from single-cell transcriptome data. Methods 85:54–61
Leng N, Chu LF, Barry C et al. (2015). Oscope identifies oscillatory genes in unsynchronized single-cell RNA-seq experiments. Nat. Methods 12:947–50
cyclone
, to perform the classification on a test dataset.
library(scuttle) sce <- mockSCE(ncells=50, ngenes=200) is.G1 <- 1:20 is.S <- 21:30 is.G2M <- 31:50 out <- sandbag(sce, list(G1=is.G1, S=is.S, G2M=is.G2M)) str(out) # Getting pre-trained marker sets mm.pairs <- readRDS(system.file("exdata", "mouse_cycle_markers.rds", package="scran")) hs.pairs <- readRDS(system.file("exdata", "human_cycle_markers.rds", package="scran"))
library(scuttle) sce <- mockSCE(ncells=50, ngenes=200) is.G1 <- 1:20 is.S <- 21:30 is.G2M <- 31:50 out <- sandbag(sce, list(G1=is.G1, S=is.S, G2M=is.G2M)) str(out) # Getting pre-trained marker sets mm.pairs <- readRDS(system.file("exdata", "mouse_cycle_markers.rds", package="scran")) hs.pairs <- readRDS(system.file("exdata", "human_cycle_markers.rds", package="scran"))
Compute scaled column ranks from each cell's expression profile for distance calculations based on rank correlations.
scaledColRanks( x, subset.row = NULL, min.mean = NULL, transposed = FALSE, as.sparse = FALSE, withDimnames = TRUE, BPPARAM = SerialParam() )
scaledColRanks( x, subset.row = NULL, min.mean = NULL, transposed = FALSE, as.sparse = FALSE, withDimnames = TRUE, BPPARAM = SerialParam() )
x |
A numeric matrix-like object containing cells in columns and features in the rows. |
subset.row |
A logical, integer or character scalar indicating the rows of |
min.mean |
A numeric scalar specifying the filter to be applied on the average normalized count for each feature prior to computing ranks.
Disabled by setting to |
transposed |
A logical scalar specifying whether the output should be transposed. |
as.sparse |
A logical scalar indicating whether the output should be sparse. |
withDimnames |
A logical scalar specifying whether the output should contain the dimnames of |
BPPARAM |
A BiocParallelParam object specifying whether and how parallelization should be performed.
Currently only used for filtering if |
Euclidean distances computed based on the output rank matrix are equivalent to distances computed from Spearman's rank correlation.
This can be used in clustering, nearest-neighbour searches, etc. as a robust alternative to Euclidean distances computed directly from x
.
If as.sparse=TRUE
, the most common average rank is set to zero in the output.
This can be useful for highly sparse input data where zeroes have the same rank and are themselves returned as zeroes.
Obviously, this means that the ranks are not centred, so this will have to be done manually prior to any downstream distance calculations.
A matrix of the same dimensions as x
, where each column contains the centred and scaled ranks of the expression values for each cell.
If transposed=TRUE
, this matrix is transposed so that rows correspond to cells.
If as.sparse
, the columns are not centered to preserve sparsity.
Aaron Lun
quickCluster
, where this function is used.
library(scuttle) sce <- mockSCE() rout <- scaledColRanks(counts(sce), transposed=TRUE) # For use in clustering: d <- dist(rout) table(cutree(hclust(d), 4)) g <- buildSNNGraph(rout, transposed=TRUE) table(igraph::cluster_walktrap(g)$membership)
library(scuttle) sce <- mockSCE() rout <- scaledColRanks(counts(sce), transposed=TRUE) # For use in clustering: d <- dist(rout) table(cutree(hclust(d), 4)) g <- buildSNNGraph(rout, transposed=TRUE) table(igraph::cluster_walktrap(g)$membership)
Compute various summary scores for potential marker genes to distinguish between groups of cells.
scoreMarkers(x, ...) ## S4 method for signature 'ANY' scoreMarkers( x, groups, block = NULL, pairings = NULL, lfc = 0, row.data = NULL, full.stats = FALSE, subset.row = NULL, BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' scoreMarkers(x, groups, ..., assay.type = "logcounts") ## S4 method for signature 'SingleCellExperiment' scoreMarkers(x, groups = colLabels(x, onAbsence = "error"), ...)
scoreMarkers(x, ...) ## S4 method for signature 'ANY' scoreMarkers( x, groups, block = NULL, pairings = NULL, lfc = 0, row.data = NULL, full.stats = FALSE, subset.row = NULL, BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' scoreMarkers(x, groups, ..., assay.type = "logcounts") ## S4 method for signature 'SingleCellExperiment' scoreMarkers(x, groups = colLabels(x, onAbsence = "error"), ...)
x |
A matrix-like object containing log-normalized expression values, with genes in rows and cells in columns. Alternatively, a SummarizedExperiment object containing such a matrix in its assays. |
... |
For the generic, further arguments to pass to individual methods. For the SummarizedExperiment method, further arguments to pass to the ANY method. For the SingleCellExperiment method, further arguments to pass to the SummarizedExperiment method. |
groups |
A factor or vector containing the identity of the group for each cell in |
block |
A factor or vector specifying the blocking level for each cell in |
pairings |
A vector, list or matrix specifying how the comparisons are to be performed, see details. |
lfc |
Numeric scalar specifying the log-fold change threshold to compute effect sizes against. |
row.data |
A DataFrame with the same number and names of rows in |
full.stats |
Logical scalar indicating whether the statistics from the pairwise comparisons should be directly returned. |
subset.row |
See |
BPPARAM |
A BiocParallelParam object specifying how the calculations should be parallelized. |
assay.type |
String or integer scalar specifying the assay containing the log-expression matrix to use. |
Compared to findMarkers
, this function represents a simpler and more intuitive summary of the differences between the groups.
We do this by realizing that the p-values for these types of comparisons are largely meaningless;
individual cells are not meaningful units of experimental replication, while the groups themselves are defined from the data.
Thus, by discarding the p-values, we can simplify our marker selection by focusing only on the effect sizes between groups.
Here, the strategy is to perform pairwise comparisons between each pair of groups to obtain various effect sizes.
For each group , we summarize the effect sizes across all pairwise comparisons involving that group, e.g., mean, min, max and so on.
This yields a DataFrame for each group where each column contains a different summarized effect and each row corresponds to a gene in
x
.
Reordering the rows by the summary of choice can yield a ranking of potential marker genes for downstream analyses.
A List of DataFrames containing marker scores for each gene in each group.
Each DataFrame corresponds to a group and each row corresponds to a gene in x
.
See Details for information about the individual columns.
The logFC.cohen
columns contain the standardized log-fold change, i.e., Cohen's d.
For each pairwise comparison, this is defined as the difference in the mean log-expression for each group scaled by the average standard deviation across the two groups.
(Technically, we should use the pooled variance; however, this introduces some unpleasant asymmetry depending on the variance of the larger group, so we take a simple average instead.)
Cohen's d is analogous to the t-statistic in a two-sample t-test and avoids spuriously large effect sizes from comparisons between highly variable groups.
We can also interpret Cohen's d as the number of standard deviations between the two group means.
The AUC
columns contain the area under the curve.
This is the probability that a randomly chosen observation in one group is greater than a randomly chosen observation in the other group.
The AUC is closely related to the U-statistic used in the Wilcoxon rank sum test.
Values greater than 0.5 indicate that a gene is upregulated in the first group.
The key difference between the AUC and Cohen's d is that the former is less sensitive to the variance within each group. The clearest example is that of two distributions that exhibit no overlap, where the AUC is the same regardless of the variance of each distribution. This may or may not be desirable as it improves robustness to outliers but reduces the information available to obtain a highly resolved ranking. The most appropriate choice of effect size is left at the user's discretion.
Finally, the logFC.detected
columns contain the log-fold change in the proportion of cells with detected (i.e., non-zero) expression between groups.
This is specifically useful for detecting binary expression patterns, e.g., activation of an otherwise silent gene.
Note that the non-zero status of the data is not altered by normalization, so differences in library size will not be removed when computing this metric.
This effect is not necessarily problematic - users can interpret it as retaining information about the total RNA content, analogous to spike-in normalization.
The default settings may yield highly ranked genes with large effect sizes but low log-fold changes if the variance is low (Cohen's d) or separation is clear (AUC).
Such genes may not be particularly interesting as the actual change in expression is modest.
Setting lfc
allows us to focus on genes with large log-fold changes between groups,
by simply shifting the “other” group's expression values by lfc
before computing effect sizes.
When lfc
is not zero, Cohen's d is generalized to the standardized difference between the observed log-fold change and lfc
.
For example, if we had lfc=2
and we obtained a Cohen's d of 3, this means that the observed log-fold change was 3 standard deviations above a value of 2.
A side effect is that we can only unambiguously interpret the direction of Cohen's d when it has the same sign as lfc
.
Our above example represents upregulation, but if our Cohen's d was negative, this could either mean downregulation or simply that our observed log-fold change was less than lfc
.
When lfc
is not zero, the AUC is generalized to the probability of obtaining a random observation in one group that is greater than a random observation plus lfc
in the other group.
For example, if we had lfc=2
and we obtained an AUC of 0.8, this means that we would observe a difference of lfc
or greater between the random observations.
Again, we can only unambiguously interpret the direction of the change when it is the same as the sign of the lfc
.
In this case, an AUC above 0.5 with a positive lfc
represents upregulation, but an AUC below 0.5 could mean either downregulation or a log-fold change less than lfc
.
A non-zero setting of lfc
has no effect on the log-fold change in the proportion of cells with detected expression.
To simplify interpretation, we summarize the effect sizes across all pairwise comparisons into a few key metrics.
For each group , we consider the effect sizes from all pairwise comparisons between
and other groups.
We then compute the following values:
mean.*
, the mean effect size across all pairwise comparisons involving .
A large value (>0 for log-fold changes, >0.5 for the AUCs) indicates that the gene is upregulated in
compared to the average of the other groups.
A small value (<0 for the log-fold changes, <0.5 for the AUCs) indicates that the gene is downregulated in
instead.
median.*
, the median effect size across all pairwise comparisons involving .
A large value indicates that the gene is upregulated in
compared to most (>50%) other groups.
A small value indicates that the gene is downregulated in
instead.
min.*
, the minimum effect size across all pairwise comparisons involving .
A large value indicates that the gene is upregulated in
compared to all other groups.
A small value indicates that the gene is downregulated in
compared to at least one other group.
max.*
, the maximum effect size across all pairwise comparisons involving .
A large value indicates that the gene is upregulated in
compared to at least one other group.
A small value indicates that the gene is downregulated in
compared to all other groups.
rank.*
, the minimum rank (i.e., “min-rank”) across all pairwise comparisons involving - see
?computeMinRank
for details.
A small min-rank indicates that the gene is one of the top upregulated genes in at least one comparison to another group.
One set of these columns is added to the DataFrame for each effect size described above.
For example, the mean column for the AUC would be mean.AUC
.
We can then reorder each group's DataFrame by our column of choice, depending on which summary and effect size we are interested in.
For example, if we ranked by decreasing min.logFC.detected
, we would be aiming for marker genes that exhibit strong binary increases in expression in compared to all other groups.
If full.stats=TRUE
, an extra full.*
column is returned in the DataFrame.
This contains a nested DataFrame with number of columns equal to the number of other groups.
Each column contains the statistic from the comparison between and the other group.
Keep in mind that the interpretations above also depend on the sign of lfc
.
The concept of a “large” summary statistic (>0 for Cohen's d, >0.5 for the AUCs) can only be interpreted as upregulation when lfc >= 0
.
Similarly, the concept of a “small” value (<0 for Cohen's d, <0.5 for the AUCs) cannot be interpreted as downregulation when lfc <= 0
.
For example, if lfc=1
, a positive min.logFC.cohen
can still be interpreted as upregulation in compared to all other groups,
but a negative
max.logFC.cohen
could not be interpreted as downregulation in compared to all other groups.
We report the mean log-expression of all cells in , as well as the grand mean of mean log-expression values for all other groups.
This is purely descriptive; while it can be used to compute an overall log-fold change, ranking is best performed on one of the effect sizes described above.
We also report the proportion of cells with detectable expression in
and the mean proportion for all other groups.
When block
is specified, the reported mean for each group is computed via correctGroupSummary
.
Briefly, this involves fitting a linear model to remove the effect of the blocking factor from the per-group mean log-expression.
The same is done for the detected proportion, except that the values are subjected to a logit transformation prior to the model fitting.
In both cases, each group/block combination is weighted by its number of cells in the model.
The pairings
argument specifies the pairs of groups that should be compared.
This can be:
NULL
, in which case comparisons are performed between all groups in groups
.
A vector of the same type as group
, specifying a subset of groups of interest.
We then perform all pairwise comparisons between groups in the subset.
A list of two vectors, each of the same type as group
and specifying a subset of groups.
Comparisons are performed between one group from the first vector and another group from the second vector.
A matrix of two columns of the same type as group
.
Each row is assumed to specify a pair of groups to be compared.
Effect sizes (and their summaries) are computed for only the pairwise comparisons specified by pairings
.
Similarly, the other.*
values in 's DataFrame are computed using only the groups involved in pairwise comparisons with
.
The default of
pairings=NULL
ensures that all groups are used and effect sizes for all pairwise comparisons are reported;
however, this may not be the case for other values of pairings
.
For list and matrix arguments, the first vector/column is treated as the first group in the effect size calculations.
Statistics for each comparison will only show up in the DataFrame for the first group,
i.e., a comparison between and
will have a valid
full.AUC$Y
field in 's DataFrame but not vice versa.
If both directions are desired in the output, both of the corresponding permutations should be explicitly specified in
pairings
.
Aaron Lun
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Any clustering method is okay, only using k-means for convenience. kout <- kmeans(t(logcounts(sce)), centers=4) out <- scoreMarkers(sce, groups=kout$cluster) out # Ranking by a metric of choice: of.interest <- out[[1]] of.interest[order(of.interest$mean.AUC, decreasing=TRUE),1:4] of.interest[order(of.interest$rank.AUC),1:4] of.interest[order(of.interest$median.logFC.cohen, decreasing=TRUE),1:4] of.interest[order(of.interest$min.logFC.detected, decreasing=TRUE),1:4]
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Any clustering method is okay, only using k-means for convenience. kout <- kmeans(t(logcounts(sce)), centers=4) out <- scoreMarkers(sce, groups=kout$cluster) out # Ranking by a metric of choice: of.interest <- out[[1]] of.interest[order(of.interest$mean.AUC, decreasing=TRUE),1:4] of.interest[order(of.interest$rank.AUC),1:4] of.interest[order(of.interest$median.logFC.cohen, decreasing=TRUE),1:4] of.interest[order(of.interest$min.logFC.detected, decreasing=TRUE),1:4]
Compute additional gene-level statistics for each group to assist in identifying marker genes,
to complement the formal test statistics generated by findMarkers
.
summaryMarkerStats(x, ...) ## S4 method for signature 'ANY' summaryMarkerStats( x, groups, row.data = NULL, average = "mean", BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' summaryMarkerStats(x, ..., assay.type = "logcounts")
summaryMarkerStats(x, ...) ## S4 method for signature 'ANY' summaryMarkerStats( x, groups, row.data = NULL, average = "mean", BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' summaryMarkerStats(x, ..., assay.type = "logcounts")
x |
A numeric matrix-like object of expression values, where each column corresponds to a cell and each row corresponds to an endogenous gene. This is generally expected to be normalized log-expression values unless one knows better. Alternatively, a SummarizedExperiment or SingleCellExperiment object containing such a matrix. |
... |
For the generic, further arguments to pass to specific methods. For the SummarizedExperiment method, further arguments to pass to the ANY method. |
groups |
A vector of length equal to |
row.data |
A DataFrame containing additional row metadata for each gene in Alternatively, a list containing one such DataFrame per level of |
average |
String specifying the type of average, to be passed to |
BPPARAM |
A BiocParallelParam object indicating whether and how parallelization should be performed across genes. |
assay.type |
A string specifying which assay values to use, usually |
This function only generates descriptive statistics for each gene to assist marker selection. It does not consider blocking factors or covariates that would otherwise be available from comparisons between groups. For the sake of brevity, statistics for the “other” groups are summarized into a single value.
A named List of DataFrames, with one entry per level of groups
.
Each DataFrame has number of rows corresponding to the rows in x
and contains the fields:
self.average
, the average (log-)expression across all cells in the current group.
other.average
, the grand average of the average (log-)expression across cells in the other groups.
self.detected
, the proportion of cells with detected expression in the current group.
other.detected
, the average proportion of cells with detected expression in the other groups.
Aaron Lun
findMarkers
, where the output of this function can be used in row.data=
.
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Any clustering method is okay. kout <- kmeans(t(logcounts(sce)), centers=3) sum.out <- summaryMarkerStats(sce, kout$cluster) sum.out[["1"]] # Add extra rowData if you like. rd <- DataFrame(Symbol=sample(LETTERS, nrow(sce), replace=TRUE), row.names=rownames(sce)) sum.out <- summaryMarkerStats(sce, kout$cluster, row.data=rd) sum.out[["1"]]
library(scuttle) sce <- mockSCE() sce <- logNormCounts(sce) # Any clustering method is okay. kout <- kmeans(t(logcounts(sce)), centers=3) sum.out <- summaryMarkerStats(sce, kout$cluster) sum.out[["1"]] # Add extra rowData if you like. rd <- DataFrame(Symbol=sample(LETTERS, nrow(sce), replace=TRUE), row.names=rownames(sce)) sum.out <- summaryMarkerStats(sce, kout$cluster, row.data=rd) sum.out[["1"]]
Perform basic hypothesis tests with linear models in an efficient manner.
testLinearModel(x, ...) ## S4 method for signature 'ANY' testLinearModel( x, design, coefs = ncol(design), contrasts = NULL, block = NULL, equiweight = FALSE, method = "stouffer", subset.row = NULL, BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' testLinearModel(x, ..., assay.type = "logcounts")
testLinearModel(x, ...) ## S4 method for signature 'ANY' testLinearModel( x, design, coefs = ncol(design), contrasts = NULL, block = NULL, equiweight = FALSE, method = "stouffer", subset.row = NULL, BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' testLinearModel(x, ..., assay.type = "logcounts")
x |
A numeric matrix-like object containing log-expression values for cells (columns) and genes (rows). Alternatively, a SummarizedExperiment containing such a matrix. |
... |
For the generic, further arguments to pass to specific methods. For the SummarizedExperiment method, further arguments to pass to the ANY method. |
design |
A numeric design matrix with number of rows equal to |
coefs |
An integer vector specifying the coefficients to drop to form the null model.
Only used if |
contrasts |
A numeric vector or matrix specifying the contrast of interest.
This should have length (if vector) or number of rows (if matrix) equal to |
block |
A factor specifying the blocking levels for each cell in |
equiweight |
A logical scalar indicating whether statistics from each block should be given equal weight.
Otherwise, each block is weighted according to its number of cells.
Only used if |
method |
String specifying how p-values should be combined when |
subset.row |
See |
BPPARAM |
A BiocParallelParam object indicating whether parallelization should be performed across genes. |
assay.type |
String or integer scalar specifying the assay containing the log-expression values. |
This function can be considered a more efficient version of lmFit
that works on a variety of matrix representations (see fitLinearModel
).
It also omits the empirical Bayes shrinkage step,
which is acceptable given the large number of residual d.f. in typical single-cell studies.
If contrasts
is specified, the null hypothesis is defined by the contrast matrix or vector in the same manner
that is used in the limma and edgeR packages.
Briefly, the contrast vector specifies a linear combination of coefficients that sums to zero under the null.
For contrast matrices, the joint null consists of the intersection of the nulls defined by each column vector.
Otherwise, if only coefs
is specified,
the null model is formed by simply dropping all of the specified coefficients from design
.
If block
is specified, a linear model is fitted separately to the cells in each level.
The results are combined across levels by averaging coefficients and combining p-values with combinePValues
.
By default, the contribution from each level is weighted by its number of cells;
if equiweight=TRUE
, each level is given equal weight instead.
A DataFrame containing test results with one row per row of x
.
It contains the estimated values of the contrasted coefficients
as well as the p-value and FDR for each gene.
Aaron Lun
fitLinearModel
, which performs the hard work of fitting the linear models.
y <- matrix(rnorm(10000), ncol=100) # Example with categorical factors: A <- gl(2, 50) design <- model.matrix(~A) testLinearModel(y, design, contrasts=c(0, 1)) # Example with continuous variables: u <- runif(100) design <- model.matrix(~u) testLinearModel(y, design, contrasts=c(0, 1)) # Example with multiple variables: B <- gl(4, 25) design <- model.matrix(~B) testLinearModel(y, design, contrasts=cbind(c(0,1,0,0), c(0,0,1,-1)))
y <- matrix(rnorm(10000), ncol=100) # Example with categorical factors: A <- gl(2, 50) design <- model.matrix(~A) testLinearModel(y, design, contrasts=c(0, 1)) # Example with continuous variables: u <- runif(100) design <- model.matrix(~u) testLinearModel(y, design, contrasts=c(0, 1)) # Example with multiple variables: B <- gl(4, 25) design <- model.matrix(~B) testLinearModel(y, design, contrasts=cbind(c(0,1,0,0), c(0,0,1,-1)))