Package 'scrapper'

Title: Bindings to C++ Libraries for Single-Cell Analysis
Description: Implements R bindings to C++ code for analyzing single-cell (expression) data, mostly from various libscran libraries. Each function performs an individual step in the single-cell analysis workflow, ranging from quality control to clustering and marker detection. Additional wrappers are provided for easy construction of end-to-end workflows involving Bioconductor objects like SingleCellExperiments.
Authors: Aaron Lun [cre, aut]
Maintainer: Aaron Lun <[email protected]>
License: MIT + file LICENSE
Version: 1.7.3
Built: 2026-05-19 11:11:36 UTC
Source: https://github.com/bioc/scrapper

Help Index


scrapper: Bindings to C++ Libraries for Single-Cell Analysis

Description

Implements R bindings to C++ code for analyzing single-cell (expression) data, mostly from various libscran libraries. Each function performs an individual step in the single-cell analysis workflow, ranging from quality control to clustering and marker detection. Additional wrappers are provided for easy construction of end-to-end workflows involving Bioconductor objects like SingleCellExperiments.

Author(s)

Maintainer: Aaron Lun [email protected]

Authors:

See Also

Useful links:


Quality control for ADT count data

Description

Compute per-cell QC metrics from an initialized matrix of ADT counts, and use the metrics to suggest filter thresholds to retain high-quality cells.

Usage

computeAdtQcMetrics(x, subsets, num.threads = NULL)

suggestAdtQcThresholds(
  metrics,
  block = NULL,
  min.detected.drop = NULL,
  num.mads = NULL,
  detected.num.mads = num.mads,
  subset.sum.num.mads = num.mads
)

filterAdtQcMetrics(thresholds, metrics, block = NULL)

Arguments

x

A matrix-like object where rows are ADTs and columns are cells. Values are expected to be counts.

subsets

Named list of vectors specifying tag subsets of interest, typically control tags like IgGs. Each vector may be logical (whether to keep each row), integer (row indices) or character (row names). For character vectors, strings not present in rownames(x) are ignored.

num.threads

Integer specifying the number of threads to use.

If NULL, the default value in computeAdtQcMetricsDefaults is used.

metrics

DataFrame of per-cell QC metrics. This should have the same structure as the return value of computeAdtQcMetrics.

block

Factor specifying the block of origin (e.g., batch, sample) for each cell in metrics. Alternatively NULL if all cells are from the same block.

For filterAdtQcMetrics, a blocking factor should be provided if block was used to construct thresholds.

min.detected.drop

Minimum drop in the number of detected features from the median, in order to consider a cell to be of low quality.

If NULL, the default value in suggestAdtQcThresholdsDefaults is used.

num.mads

Number of median from the median, to define the threshold for outliers in each metric.

detected.num.mads

Number of median from the median, to define the threshold for outliers in the number of detected tags.

If NULL, the default value in suggestAdtQcThresholdsDefaults is used.

subset.sum.num.mads

Number of median from the median, to define the threshold for outliers in the subset sums.

If NULL, the default value in suggestAdtQcThresholdsDefaults is used.

thresholds

List with the same structure as produced by suggestAdtQcThresholds.

Value

For computeAdtQcMetrics, a DataFrame is returned with one row per cell in x. This contains the following columns:

  • sum, a numeric vector containing the total ADT count for each cell. In theory, this represents the efficiency of library preparation and sequencing. Compared to RNA, the sum is less useful as a QC metric for ADT data as it is strongly influenced by biological variation in the abundance of the targeted features. Nonetheless, we compute it for diagnostic purposes.

  • detected, an integer vector containing the number of detected tags per cell. Even though ADTs are typically used in situations where few features are highly abundant (e.g., cell type-specific markers), we still expect detectable coverage of most features due to ambient contamination, non-specific binding or some background expression. Low numbers of detected tags indicates that library preparation or sequencing depth was suboptimal.

  • subsets, a nested DataFrame where each column corresponds to a control subset and is a numeric vector containing the total count in that subset. The exact interpretation depends on the nature of the feature subset but the most common use case involves isotype control (IgG) features. IgG antibodies should not bind to anything so a high subset sum suggests that non-specific binding is a problem, e.g., due to antibody conjugates. (Unlike RNA quality control, we do not use proportions here as it is entirely possible for a cell to have low counts for other tags due to the absence of their targeted features; this would result in a high proportion even if the cell has a "normal" level of non-specific binding.)

Each vector is of length equal to the number of cells.

For suggestAdtQcThresholds, a named list is returned:

  • If block=NULL, the list contains:

    • detected, a number containing the lower bound on the number of detected tags. This is defined as the lower of (i) num.mads MADs below the median for the log-transformed values across all cells, and (ii) the product of 1 - min.detected.drop and the median across all cells. The latter avoids overly aggressive filtering when the MAD is zero.

    • subsets, a numeric vector containing the upper bound on the sum of counts in each control subset. This is defined as num.mads MADs above the median of the log-transformed metrics across all cells.

  • Otherwise, if block is supplied, the list contains:

    • detected, a numeric vector containing the lower bound on the number of detected tags for each blocking level. Here, the threshold is computed independently for each block, using the same method as the unblocked case.

    • subsets, a list of numeric vectors containing the upper bound on the sum of counts in each control subset for each blocking level. Here, the threshold is computed independently for each block, using the same method as the unblocked case.

    • block.ids, a vector containing the identities of the unique blocks.

    Each vector is of length equal to the number of levels in block and is named accordingly.

For filterAdtQcMetrics, a logical vector of length ncol(x) is returned indicating which cells are of high quality. High-quality cells are defined as those with numbers of detected tags above the detected threshold and control subset sums below the subsets threshold.

Author(s)

Aaron Lun

See Also

The compute_adt_qc_metrics, compute_adt_qc_filters and compute_adt_qc_filters_blocked functions in https://libscran.github.io/scran_qc/.

quickAdtQc.se, to run all of the ADT-related QC functions on a SummarizedExperiment.

Examples

# Mocking a matrix:
library(Matrix)
x <- round(abs(rsparsematrix(1000, 100, 0.1) * 100))

# Mocking up a control set.
sub <- list(IgG=rbinom(nrow(x), 1, 0.1) > 0)

qc <- computeAdtQcMetrics(x, sub)
qc 

filt <- suggestAdtQcThresholds(qc)
str(filt)

keep <- filterAdtQcMetrics(filt, qc)
summary(keep)

Default parameters for ADT quality control

Description

Default parameters from the underlying C++ library. These may be overridden by defaults in each function's signature.

Usage

computeAdtQcMetricsDefaults()

suggestAdtQcThresholdsDefaults()

Value

Named list containing default values for the various function arguments.

Author(s)

Aaron Lun

Examples

computeAdtQcMetricsDefaults()
suggestAdtQcThresholdsDefaults()

Aggregate expression across cells

Description

Aggregate expression values across cells based on one or more grouping factors. This is usually applied to a count matrix to create pseudo-bulk profiles for each cluster/sample combination. These profiles can then be used as if they were counts from bulk data, e.g., for differential analyses with edgeR.

Usage

aggregateAcrossCells(
  x,
  factors,
  compute.sum = NULL,
  compute.detected = NULL,
  compute.median = NULL,
  num.threads = NULL
)

Arguments

x

A matrix-like object where rows correspond to genes or genomic features and columns correspond to cells. Values are typically expected to be counts.

factors

A list or data frame (or their equivalents from S4Vectors) containing one or more grouping factors, see combineFactors.

compute.sum

Boolean indicating whether to compute the sum in each group.

If NULL, the default value in aggregateAcrossCellsDefaults is used.

compute.detected

Boolean indicating whether to compute the number of cells with detected expression in each group.

If NULL, the default value in aggregateAcrossCellsDefaults is used.

compute.median

Boolean indicating whether to compute the median in each group.

If NULL, the default value in aggregateAcrossCellsDefaults is used.

num.threads

Integer specifying the number of threads to be used for aggregation.

If NULL, the default value in aggregateAcrossCellsDefaults is used.

Value

List containing:

  • sums, a numeric matrix where each row corresponds to a gene and each column corresponds to a unique combination of levels from factors. Each entry contains the summed expression across all cells with that combination. Omitted if compute.sum=FALSE.

  • detected, an integer matrix where each row corresponds to a gene and each column corresponds to a unique combination of levels from factors. Each entry contains the number of cells with detected (i.e., positive) expression in that combination. Omitted if compute.detected=FALSE.

  • medians, a numeric matrix where each row corresponds to a gene and each column corresponds to a unique combination of levels from factors. Each entry contains the median expression across all cells with that combination. Omitted if compute.median=FALSE.

  • combinations, a DataFrame containing the unique combination of levels from factors. Rows correspond to columns of sums and detected, while columns correspond to the factors in factors.

  • counts, the number of cells associated with each combination. Each entry corresponds to a row of combinations.

  • index, an integer vector of length equal to the number of cells in x. This specifies the combination in combinations to which each cell was assigned.

Author(s)

Aaron Lun

See Also

The aggregate_across_cells function in https://libscran.github.io/scran_aggregate/.

aggregateAcrossCells.se, to perform aggregation on a SummarizedExperiment.

aggregateAcrossGenes, to aggregate expression values across gene sets.

Examples

# Mocking a matrix:
library(Matrix)
x <- round(abs(rsparsematrix(1000, 100, 0.1) * 100))

# Simple aggregation:
clusters <- sample(LETTERS, 100, replace=TRUE)
agg <- aggregateAcrossCells(x, list(cluster=clusters))
str(agg)

# Multi-factor aggregation
samples <- sample(1:5, 100, replace=TRUE)
agg2 <- aggregateAcrossCells(x, list(cluster=clusters, sample=samples))
str(agg2)

Aggregate expression across cells in a SummarizedExperiment

Description

Aggregate expression values across groups of cells for each gene, by calling aggregateAcrossCells on an assay in a SummarizedExperiment.

Usage

aggregateAcrossCells.se(
  x,
  factors,
  num.threads = 1,
  more.aggr.args = list(),
  assay.type = "counts",
  output.prefix = "factor.",
  counts.name = "counts",
  meta.name = "aggregated",
  include.coldata = TRUE,
  more.coldata.args = list(),
  altexps = NULL,
  copy.altexps = FALSE
)

aggregateColData(coldata, index, number, only.atomic = TRUE, placeholder = NA)

Arguments

x

A SummarizedExperiment object or one of its subclasses. Rows correspond to genes and columns correspond to cells.

factors

List or data frame (or their equivalents from S4Vectors) containing grouping factors, see aggregateAcrossCells for more details.

Alternatively, an atomic vector or factor representing a single variable.

num.threads

Number of threads, passed to aggregateAcrossCells.

more.aggr.args

Named list of additional arguments to pass to aggregateAcrossCells.

assay.type

Integer or string specifying the assay of x to be aggregated.

output.prefix

String specifying a prefix to add to the names of the link[SummarizedExperiment]{colData} columns storing the factor combinations. If NULL, no prefix is added.

counts.name

String specifying the name of the colData column in which to store the cell count for each factor combination. If NULL, the cell counts are not reported.

meta.name

String specifying the name of the metadata entry in which to store additional outputs like the combination indices. If NULL, additional outputs are not reported.

include.coldata

Logical scalar indicating whether to add the aggregated colData from x to the output.

more.coldata.args

Named list of additional arguments to pass to aggregateColData. Only relevant if include.coldata=TRUE.

altexps

Unnamed integer or character vector specifying the indices/names of alternative experiments of x to aggregate. The aggregated assay from each alternative experiment is determined by assay.type.

Alternatively, this may be a named integer or character vector. Each name specifies an alternative experiment while each value is the index/name of the assay to aggregate from that experiment.

Only relevant if x is a SingleCellExperiment.

copy.altexps

Logical scalar indicating whether to copy the colData and metadata of the output SingleCellExperiment into each of its alternative experiments.

coldata

DataFrame of column data, containing one row for each cell.

index

Integer vector containing the index of the factor combination to which each cell in coldata was assigned.

number

Integer specifying the total number of unique factor combinations. All elements of index should be less than or equal to number.

only.atomic

Logical scalar specifying whether to skip non-atomic, non-factor columns.

placeholder

Placeholder value to store in the output column when a factor combination does not have a single unique value.

Value

For aggregateAcrossCells.se, a SummarizedExperiment is returned where each column corresponds to a factor combination. Each row corresponds to a gene in x and the rowData is taken from x. The assays contain the sum of counts ("sums") and the number of detected cells ("detected") in each combination for each gene. The colData contains:

  • The factor combinations, with column names prefixed by output.prefix.

  • The cell count for each combination, named by counts.name.

  • Additional colData from x if include.coldata=TRUE. This is aggregated with aggregateColData on the combination indices.

The metadata contains a list named as meta.name, containing a index integer vector of length equal to the number of cells in x. Each entry of this vector is an index of the factor combination (i.e., column of the output object) to which the corresponding cell was assigned.

If altexps is specified, a SingleCellExperiment is returned instead. The same aggregation for the main experiment is applied to each alternative experiment. If copy.altexps=TRUE, the colData for each alternative experiment will contain a copy of the factor combinations and cell counts, and the metadata will contain a copy of the index vector.

For aggregateColData, a DataFrame is returned with number of rows equal to number. Each atomic or factor column in coldata is represented by a column in the output DataFrame. In each column, the j-th entry is equal to the unique value of all rows where index == j, or placeholder if there is not exactly one unique value. If only.atomic=FALSE, any non-atomic/non-factor columns of coldata are represented in the output DataFrame by a vector of placeholder values. If only.atomic=TRUE, any non-atomic/non-factor columns of coldata are skipped.

Author(s)

Aaron Lun

Examples

library(SingleCellExperiment)
sce <- getTestRnaData.se("start")
aggr <- aggregateAcrossCells.se(sce, sce$level1class)
head(assay(aggr))
colData(aggr)

# We can also aggregate within alternative experiments.
aggr2 <- aggregateAcrossCells.se(sce, sce$level1class, altexps="ERCC")
head(assay(altExp(aggr2, "ERCC")))

Default parameters for aggregateAcrossCells

Description

Default parameters from the underlying C++ library. These may be overridden by defaults in the aggregateAcrossCells function signature.

Usage

aggregateAcrossCellsDefaults()

Value

Named list containing default values for various function arguments.

Author(s)

Aaron Lun

Examples

aggregateAcrossCellsDefaults()

Aggregate expression across genes

Description

Aggregate expression values across genes, potentially with weights. This is typically used to summarize expression values for gene sets into a single per-cell score.

Usage

aggregateAcrossGenes(
  x,
  sets,
  average = NULL,
  convert = TRUE,
  num.threads = NULL
)

Arguments

x

A matrix-like object where rows correspond to genes or genomic features and columns correspond to cells. Values are usually normalized expression values, possibly log-transformed depending on the application.

sets

List of vectors where each entry corresponds to a gene set. Each entry may be an integer vector of row indices, a logical vector of length equal to the number of rows, or a character vector of row names. For integer and character vectors, duplicate elements are ignored. For character vectors, any strings not present in rownames(x) are ignored.

Alternatively, each entry may be a list of two vectors. The first vector should be either integer (row indices) or character (row names), specifying the genes in the set. The second vector should be numeric and of the same length as the first vector, specifying the weight associated with each gene. If duplicate genes are present, only the first occurrence is used. If the first vector contains gene names not present in x, those genes are ignored.

average

Boolean indicating whether to compute the average rather than the sum.

If NULL, the default value in aggregateAcrossGenesDefaults is used.

convert

Boolean indicating whether to convert gene identities to non-duplicate row indices in each entry of sets. Can be set to FALSE for greater efficiency if the sets already contains non-duplicated integer vectors.

num.threads

Integer specifying the number of threads to be used for aggregation.

If NULL, the default value in aggregateAcrossGenesDefaults is used.

Value

List of length equal to that of sets. Each entry is a numeric vector of length equal to the number of columns in x, containing the (weighted) sum/mean of expression values for the corresponding set across all cells.

Author(s)

Aaron Lun

See Also

The aggregate_across_genes function in https://libscran.github.io/scran_aggregate/.

aggregateAcrossGenes.se, to perform aggregation on a SummarizedExperiment.

aggregateAcrossCells, to aggregate expression values across groups of cells.

Examples

# Mocking a matrix:
library(Matrix)
x <- round(abs(rsparsematrix(1000, 100, 0.1) * 100))

# Unweighted aggregation:
sets <- list(
   foo = sample(nrow(x), 20),
   bar = sample(nrow(x), 10)
)
agg <- aggregateAcrossGenes(x, sets)
str(agg)

# Weighted aggregation:
sets <- list(
   foo = list(sample(nrow(x), 20), runif(20)),
   bar = list(sample(nrow(x), 10), runif(10))
)
agg2 <- aggregateAcrossGenes(x, sets, average = TRUE)
str(agg2)

Aggregate expression across gene sets in a SummarizedExperiment

Description

Aggregate expression values across sets of genes for each cell, by calling aggregateAcrossGenes on an assay in a SummarizedExperiment.

Usage

aggregateAcrossGenes.se(
  x,
  sets,
  num.threads = 1,
  more.aggr.args = list(),
  assay.type = "logcounts",
  output.name = NULL
)

Arguments

x

A SummarizedExperiment object or one of its subclasses. Rows correspond to genes and columns correspond to cells.

sets

List of gene sets, see aggregateAcrossGenes for more details.

Alternatively, sets may be a List subclass, in which case the mcols are used as the rowData of the output object. Weighted gene sets can be represented by a list of DataFrames or a DataFrameList, where each DataFrame contains two columns, i.e., the gene identities and the associated weights.

num.threads

Number of threads, passed to aggregateAcrossGenes.

more.aggr.args

Named list of additional arguments to pass to aggregateAcrossGenes.

assay.type

Integer or string specifying the assay of x to be aggregated.

output.name

String specifying the assay name in the output object. Defaults to assay.type if it is a string, otherwise "aggregated".

Value

A SummarizedExperiment with number of rows equal to the number of gene sets. The lone assay contains the aggregated values for each gene set for all cells. The colData is the same as that of x.

Author(s)

Aaron Lun

Examples

library(SingleCellExperiment)
sce <- getTestRnaData.se("norm")

library(org.Mm.eg.db)
set.df <- select(
    org.Mm.eg.db,
    keytype="GO",
    keys=c(
        "GO:0048709", # oligodendrocyte differentiation 
        "GO:0048699", # neuron development
        "GO:0048143"  # astrocyte activation 
    ),
    columns="SYMBOL"
)
sets <- splitAsList(set.df$SYMBOL, set.df$GO)

aggregated <- aggregateAcrossGenes.se(sce, sets)
aggregated
assay(aggregated)[,1:10]

Default parameters for aggregateAcrossGenes

Description

Default parameters from the underlying C++ library. These may be overridden by defaults in the aggregateAcrossGenes function signature.

Usage

aggregateAcrossGenesDefaults()

Value

Named list containing default values for various function arguments.

Author(s)

Aaron Lun

Examples

aggregateAcrossGenesDefaults()

Defunct functions

Description

See Details for each function's recommended replacements.

Usage

analyze(...)

convertAnalyzeResults(...)

Arguments

...

Arguments, ignored.

Details

analyze is replaced by analyze.se.

convertAnalyzeResults is no longer necessary as conversion is done within analyze.se.


Analyze single-cell data from a SummarizedExperiment

Description

Execute a simple single-cell analysis pipeline, starting from a count matrix and ending with clusters, visualizations and markers. This also supports integration of multiple modalities and correction of batch effects.

Usage

analyze.se(
  x,
  rna.altexp = NA,
  adt.altexp = NULL,
  crispr.altexp = NULL,
  rna.assay.type = "counts",
  adt.assay.type = "counts",
  crispr.assay.type = "counts",
  block = NULL,
  block.name = "block",
  rna.qc.subsets = list(),
  rna.qc.output.prefix = NULL,
  more.rna.qc.args = list(),
  adt.qc.subsets = list(),
  adt.qc.output.prefix = NULL,
  more.adt.qc.args = list(),
  crispr.qc.output.prefix = NULL,
  more.crispr.qc.args = list(),
  filter.cells = TRUE,
  rna.norm.output.name = "logcounts",
  more.rna.norm.args = list(),
  adt.norm.output.name = "logcounts",
  more.adt.norm.args = list(),
  crispr.norm.output.name = "logcounts",
  more.crispr.norm.args = list(),
  rna.hvg.output.prefix = NULL,
  more.rna.hvg.args = list(),
  rna.pca.output.name = "PCA",
  more.rna.pca.args = list(),
  adt.pca.output.name = "PCA",
  more.adt.pca.args = list(),
  use.rna.pcs = TRUE,
  use.adt.pcs = TRUE,
  scale.output.name = "combined",
  more.scale.args = list(),
  mnn.output.name = "MNN",
  more.mnn.args = list(),
  more.umap.args = list(),
  more.tsne.args = list(),
  cluster.graph.output.name = "graph.cluster",
  more.build.graph.args = list(),
  more.cluster.graph.args = list(),
  more.neighbor.args = list(),
  kmeans.clusters = NULL,
  kmeans.clusters.output.name = "kmeans.cluster",
  more.kmeans.args = list(),
  clusters.for.markers = c("graph", "kmeans"),
  more.rna.marker.args = list(),
  more.adt.marker.args = list(),
  more.crispr.marker.args = list(),
  BNPARAM = AnnoyParam(),
  num.threads = 3L
)

Arguments

x

A SummarizedExperiment object or one of its subclasses. Rows correspond to genomic features (genes, ADTs or CRISPR guides) and columns correspond to cells.

rna.altexp

String or integer specifying the alternative experiment of x containing the RNA data. If NA, the main experiment is assumed to contain the RNA data. If NULL, it is assumed that no RNA data is available.

adt.altexp

String or integer specifying the alternative experiment of x containing the ADT data. If NA, the main experiment is assumed to contain the ADT data. If NULL, it is assumed that no ADT data is available.

crispr.altexp

String or integer specifying the alternative experiment of x containing the CRISPR data. If NA, the main experiment is assumed to contain the CRISPR data. If NULL, it is assumed that no CRISPR data is available.

rna.assay.type

String or integer specifying the assay containing the RNA count data. Only used if rna.altexp is not NULL.

adt.assay.type

String or integer specifying the assay containing the ADT count data. Only used if adt.altexp is not NULL.

crispr.assay.type

String or integer specifying the assay containing the CRISPR count data. Only used if crispr.altexp is not NULL.

block

Vector or factor specifying the block of origin (e.g., batch, sample) for each cell in x. Alternatively NULL, if all cells are from the same block.

block.name

String specifying the name of the colData column in which to store the blocking factor. Only used if block is not NULL. If NULL, the blocking factor is not stored in the colData.

rna.qc.subsets

Passed to quickRnaQc.se as the subsets argument. Only used if rna.altexp is not NULL.

rna.qc.output.prefix

Passed to quickRnaQc.se as the output.prefix argument. Only used if rna.altexp is not NULL.

more.rna.qc.args

Named list of additional arguments to pass to quickRnaQc.se. Only used if rna.altexp is not NULL.

adt.qc.subsets

Passed to quickAdtQc.se as the subsets argument. Only used if adt.altexp is not NULL.

adt.qc.output.prefix

Passed to quickAdtQc.se as the output.prefix argument. Only used if adt.altexp is not NULL.

more.adt.qc.args

Named list of additional arguments to pass to quickAdtQc.se. Only used if adt.altexp is not NULL.

crispr.qc.output.prefix

Passed to quickCrisprQc.se as the output.prefix argument. Only used if crispr.altexp is not NULL.

more.crispr.qc.args

Named list of additional arguments to pass to quickCrisprQc.se. Only used if crispr.altexp is not NULL.

filter.cells

Logical scalar indicating whether to filter x to only retain high-quality cells in all modalities. If FALSE, QC metrics and thresholds are still computed but are not used to filter the count matrices.

rna.norm.output.name

Passed to normalizeRnaCounts.se as the output.name argument. Only used if rna.altexp is not NULL.

more.rna.norm.args

Named list of arguments to pass to normalizeRnaCounts.se. Only used if rna.altexp is not NULL.

adt.norm.output.name

Passed to normalizeAdtCounts.se as the output.name argument. Only used if adt.altexp is not NULL.

more.adt.norm.args

Named list of arguments to pass to normalizeAdtCounts.se. Only used if adt.altexp is not NULL.

crispr.norm.output.name

Passed to normalizeCrisprCounts.se as the output.name argument. Only used if crispr.altexp is not NULL.

more.crispr.norm.args

Named list of arguments to pass to normalizeCrisprCounts.se. Only used if crispr.altexp is not NULL.

rna.hvg.output.prefix

Passed to chooseRnaHvgs.se as the output.prefix argument. Only used if rna.altexp is provided.

more.rna.hvg.args

Named list of arguments to pass to chooseRnaHvgs.se. Only used if rna.altexp is provided.

rna.pca.output.name

Passed to runPca.se as the output.name argument. Only used if rna.altexp is provided.

more.rna.pca.args

Named list of arguments to pass to runPca.se. Only used if rna.altexp is provided.

adt.pca.output.name

Passed to runPca.se as the output.name argument. Only used if adt.altexp is provided.

more.adt.pca.args

Named list of arguments to pass to runPca.se. Only used if adt.altexp is provided.

use.rna.pcs

Logical scalar indicating whether to use the RNA-derived PCs for downstream steps (i.e., clustering, visualization). Only used if rna.altexp is provided.

use.adt.pcs

Logical scalar indicating whether to use the ADT-derived PCs for downstream steps (i.e., clustering, visualization). Only used if adt.altexp is provided.

scale.output.name

Passed to scaleByNeighbors.se as the output.name argument. Only used if multiple modalities are available and their corresponding use.*.pcs arguments are TRUE.

more.scale.args

Named list of arguments to pass to scaleByNeighbors.se. Only used if multiple modalities are available and their corresponding use.*.pcs arguments are TRUE.

mnn.output.name

Passed to correctMnn.se as the output.name argument. Only used if block is supplied.

more.mnn.args

Named list of arguments to pass to correctMnn.se. Only used if block is supplied.

more.umap.args

Passed to runAllNeighborSteps.se.

more.tsne.args

Passed to runAllNeighborSteps.se.

cluster.graph.output.name

Passed to runAllNeighborSteps.se as cluster.output.name.

more.build.graph.args

Passed to runAllNeighborSteps.se.

more.cluster.graph.args

Passed to runAllNeighborSteps.se.

more.neighbor.args

Passed to runAllNeighborSteps.se.

kmeans.clusters

Passed to clusterKmeans.se as the k argument. If NULL, k-means clustering is not performed.

kmeans.clusters.output.name

Passed to clusterKmeans.se as the output.name argument. Ignored if kmeans.clusters = NULL.

more.kmeans.args

Named list of arguments to pass to clusterKmeans.se. Ignored if kmeans.clusters = NULL.

clusters.for.markers

Character vector of clustering algorithms (either "graph" or "kmeans"), specifying the clustering to be used for marker detection. The first available clustering will be chosen. If no clustering is available from the list, markers will not be computed.

more.rna.marker.args

Named list of arguments to pass to scoreMarkers.se for the RNA data. Ignored if no suitable clusterings are available or if rna.altexp=NULL.

more.adt.marker.args

Named list of arguments to pass to scoreMarkers.se for the ADT data. Ignored if no suitable clusterings are available or if adt.altexp=NULL.

more.crispr.marker.args

Named list of arguments to pass to scoreMarkers.se for the CRISPR data. Ignored if no suitable clusterings are available or if crispr.altexp=NULL.

BNPARAM

A BiocNeighborParam instance specifying the nearest-neighbor search algorithm to use.

num.threads

Integer scalar specifying the number of threads to use in each step.

Details

This function is equivalent to:

Value

List containing:

  • x, a SingleCellExperiment that is a copy of the input x. It is also decorated with the results of each analysis step - see Details.

  • markers, a list of list of DataFrames containing the marker statistics for each modality. Each inner list corresponds to a modality (RNA, ADT, etc.) while each DataFrame corresponds to a cluster. If no clusterings were generated, this is set to NULL.

Author(s)

Aaron Lun

Examples

library(SingleCellExperiment)
sce <- getTestRnaData.se("start")
res <- analyze.se(
    sce, 
    rna.qc.subsets=list(mito=grep("^mt-", rownames(sce))),
    num.threads=2 # keep R CMD check happy
)
assayNames(res$x)
reducedDimNames(res$x)
colData(res$x)
previewMarkers(res$markers$rna[[1]], "cohens.d.mean")

Build a shared nearest neighbor graph

Description

Build a shared nearest neighbor (SNN) graph where each node is a cell. Edges are formed between cells that share one or more nearest neighbors, weighted by the number or ranking of those shared neighbors. If two cells are close together but have distinct sets of neighbors, the corresponding edge is downweighted as the two cells are unlikely to be part of the same neighborhood. In this manner, strongly weighted edges will only form within highly interconnected neighborhoods where many cells share the same neighbors. This provides a more sophisticated definition of similarity between cells compared to a simpler (unweighted) nearest neighbor graph that just focuses on immediate proximity.

Usage

buildSnnGraph(
  x,
  num.neighbors = NULL,
  weight.scheme = NULL,
  num.threads = NULL,
  BNPARAM = AnnoyParam(),
  as.pointer = FALSE
)

Arguments

x

For buildSnnGraph, a numeric matrix where rows are dimensions and columns are cells, typically containing a low-dimensional representation from, e.g., runPca.

Alternatively, a named list of nearest-neighbor search results. This should contain index, an integer matrix where rows are neighbors and columns are cells. Each column contains 1-based indices for the nearest neighbors of the corresponding cell, ordered by increasing distance. The number of neighbors for each cell should be equal to num.neighbors, otherwise a warning is raised.

Alternatively, an index constructed by buildIndex.

num.neighbors

Integer specifying the number of neighbors to use to construct the graph. Larger values increase the connectivity of the graph and reduce the granularity of subsequent community detection steps, at the cost of speed.

If NULL, the default value in buildSnnGraphDefaults is used.

This argument is ignored if x contains pre-computed neighbor search results.

weight.scheme

String specifying the weighting scheme to use for constructing the SNN graph. This can be one of:

  • "ranked", where the weight of the edge is defined by the smallest sum of ranks across all shared neighbors. More shared neighbors, or shared neighbors that are close to both observations, will generally yield larger weights.

  • "number", where the weight of the edge is the number of shared nearest neighbors between them. This is a simpler scheme that is also slightly faster but does not account for the ranking of neighbors within each set.

  • "jaccard", where the weight of the edge is the Jaccard index of their neighbor sets, This is a monotonic transformation of the weight used in "number".

If NULL, the default value in buildSnnGraphDefaults is used.

num.threads

Integer specifying the number of threads to use.

If NULL, the default value in buildSnnGraphDefaults is used.

This argument is only used if x is not a list of existing nearest-neighbor search results.

BNPARAM

A BiocNeighborParam object specifying the algorithm to use.

This argument is only used if x is not a list of existing nearest-neighbor search results.

as.pointer

Boolean indicating whether to return an external pointer for direct use in clusterGraph. This avoids the extra memory usage caused by conversion to/from an R list.

Value

If as.pointer=FALSE, a list is returned containing:

  • vertices, an integer specifying the number of vertices in the graph (i.e., cells in x).

  • edges, an integer vector of 1-based indices for graph edges. Pairs of values represent the endpoints of an (undirected) edge, i.e., edges[1:2] form the first edge, edges[3:4] form the second edge and so on.

  • weights, a numeric vector of weights for each edge. This has length equal to half the length of edges.

If as.pointer=TRUE, an external pointer to the graph structure is returned. This can be directly used in clusterGraph.

Author(s)

Aaron Lun

See Also

The build_snn_graph function in https://libscran.github.io/scran_graph_cluster/.

clusterGraph, to define clusters (i.e., communities) from the graph.

clusterGraph.se, which builds an SNN graph from a SingleCellExperiment.

Examples

data <- matrix(rnorm(10000), ncol=1000)
out <- buildSnnGraph(data)
str(out)

# We can use this to make an igraph::graph.
g <- igraph::make_undirected_graph(out$edges, n = out$vertices)
igraph::E(g)$weight <- out$weight

Default parameters for buildSnnGraph

Description

Default parameters from the underlying C++ library. These may be overridden by defaults in the buildSnnGraph function signature.

Usage

buildSnnGraphDefaults()

Value

Named list containing default values for various function arguments.

Author(s)

Aaron Lun

Examples

buildSnnGraphDefaults()

Center size factors

Description

Scale the size factors so they are centered at unity. This ensures that the original scale of the counts is preserved in the normalized values from normalizeCounts, which simplifies interpretation and ensures that any pseudo-count added prior to log-transformation has a predictable shrinkage effect.

Usage

centerSizeFactors(size.factors, block = NULL, mode = NULL)

Arguments

size.factors

Numeric vector of size factors across cells. Invalid size factors (e.g., non-positive, non-finite) will be ignored.

block

Vector or factor of length equal to size.factors, specifying the block of origin for each cell. Alternatively NULL, in which case all cells are assumed to be in the same block.

mode

String specifying how to scale size factors across blocks. This can be either "lowest" or "per-block", see Details.

If NULL, the default value in centerSizeFactorsDefaults is used.

This argument is only used if block is provided.

Details

"lowest" will compute the average size factor in each block, identify the lowest average across all blocks, and then scale all size factors by that value. Here, our normalization strategy involves downscaling all blocks to match the coverage of the lowest-coverage block. This is useful for datasets with big differences in coverage between blocks as it avoids egregious upscaling of low-coverage blocks. Specifically, strong upscaling allows the log-transformation to ignore any shrinkage from the pseudo-count. This is problematic as it inflates differences between cells at log-values derived from low counts, increasing noise and overstating log-fold changes. Downscaling is safer as it allows the pseudo-count to shrink the log-differences between cells towards zero at low counts, effectively sacrificing some information in the higher-coverage batches so that they can be compared to the low-coverage batches (which is preferable to exaggerating the informativeness of the latter for comparison to the former).

"per-block" will compute the average size factor in each block, and then scale each size factor by the average of block to which it belongs. The scaled size factors are identical to those obtained by separate invocations of 'center_size_factors()' on the size factors for each block. This can be desirable to ensure consistency with independent analyses of each block - otherwise, the centering would depend on the size factors in other blocks. However, any systematic differences in the size factors between blocks are lost, i.e., systematic changes in coverage between blocks will not be normalized.

Value

Numeric vector of length equal to size.factors, containing the centered size factors.

Author(s)

Aaron Lun

See Also

The center_size_factors and center_size_factors_blocked functions in https://libscran.github.io/scran_norm/.

normalizeRnaCounts.se and related functions, which center the size factors prior to normalization of a SummarizedExperiment.

Examples

centerSizeFactors(runif(100))

centerSizeFactors(runif(100), block=sample(3, 100, replace=TRUE))

Default parameters for centerSizeFactors

Description

Default parameters from the underlying C++ library. These may be overridden by defaults in the centerSizeFactors function signature.

Usage

centerSizeFactorsDefaults()

Value

Named list containing default values for various function arguments.

Author(s)

Aaron Lun

Examples

centerSizeFactorsDefaults()

Center spike-in size factors

Description

Center size factors for endogenous genes and spike-in transcripts, following the rationale in centerSizeFactors. This aims to preserve the scale of the counts in each feature set so that the normalized expression values are still comparable.

Usage

centerSpikeInFactors(endogenous, spike.ins, block = NULL, mode = NULL)

Arguments

endogenous

Numeric vector of size factors for endogenous genes across cells. Invalid size factors (e.g., non-positive, non-finite) will be ignored.

spike.ins

List of numeric vectors. Each vector should be of length equal to endogenous and contains size factors for a set of spike-in transcripts. Invalid size factors (e.g., non-positive, non-finite) will be ignored.

block

Vector or factor of length equal to endogenous, specifying the block of origin for each cell. Alternatively NULL, in which case all cells are assumed to be in the same block.

mode

String specifying how to scale size factors across blocks, see the argument of the same name in centerSizeFactors.

If NULL, the default value in centerSpikeInFactors is used.

This argument is only used if block is provided.

Details

This function is effectively a convenient wrapper around centerSizeFactors, configured to ensure that the mean of each set of spike-in size factors is the same as that of the endogenous size factors. The aim is to preserve the scale of the average abundances of the spike-ins relative to the endogenous genes for a valid comparison between feature sets. Typically, we would use spike-in transcripts to infer some properties of endogenous genes of similar molarity, e.g., technical variance.

Value

List containing:

  • endogenous, a numeric vector containing the centered size factors for the endogenous genes.

  • spike.ins, a named list of numeric vectors. Each vector is named after an entry of spike.ins and contains centered size factors for the corresponding spike-in transcripts.

Author(s)

Aaron Lun

See Also

The center_spike_in_factors and center_spike_in_factors_blocked functions in https://libscran.github.io/scran_norm/.

Examples

endogenous <- runif(100)
spike.ercc <- runif(100)
centerSpikeInFactors(endogenous, list(ERCC = spike.ercc))

block <- sample(3, 100, replace=TRUE)
centerSpikeInFactors(endogenous, list(ERCC = spike.ercc), block=block)

Default parameters for centerSpikeInFactors

Description

Default parameters for centerSpikeInFactors

Usage

centerSpikeInFactorsDefaults()

Value

Named list containing default values for various function arguments.

Author(s)

Aaron Lun

Examples

centerSpikeInFactorsDefaults()

Choose highly variable genes

Description

Choose highly variable genes (HVGs) based on a variance-related statistic. This is typically used to subset the gene-cell matrix prior to calling runPca.

Usage

chooseHighlyVariableGenes(
  stats,
  top = NULL,
  larger = NULL,
  keep.ties = NULL,
  use.bound = NULL,
  bound = NULL
)

Arguments

stats

Numeric vector of variances (or a related statistic) across all genes. Typically, the residuals from modelGeneVariances are used here.

top

Integer specifying the number of top genes to retain. Note that the actual number of retained genes may not be equal to top, depending on the other options.

If NULL, the default value in chooseHighlyVariableGenesDefaults is used.

larger

Boolean indicating whether larger values of stats correspond to more variable genes. If TRUE, HVGs are defined as those with the largest values of stats. This is typically the case for variances or related statistics, e.g., residuals.

If NULL, the default value in chooseHighlyVariableGenesDefaults is used.

keep.ties

Boolean indicating whether to keep tied values of stats, even if top may be exceeded.

If NULL, the default value in chooseHighlyVariableGenesDefaults is used.

use.bound

Boolean indicating whether to ignore genes with stats below bound (if larger = TRUE) or above bound (otherwise). Genes are not considered to be HVGs if they do not satisfy this bound, even if they are within the top genes.

If NULL, the default value in chooseHighlyVariableGenesDefaults is used.

bound

Number specifying the lower bound (if larger=TRUE) or upper bound (otherwise) to be applied to stats when use.bound = TRUE. For example, residuals from the fitted trend should be positive, which can be enforced by setting bound = 0.

If NULL, the default value in chooseHighlyVariableGenesDefaults is used.

This argument is ignored if use.bound = FALSE.

Value

Integer vector containing the indices of genes in stats that are considered to be highly variable.

Author(s)

Aaron Lun

See Also

The choose_highly_variable_genes function in https://libscran.github.io/scran_variances/.

chooseRnaHvgs.se, which choose the HVGs from the residuals computed from a SummarizedExperiment.

Examples

resids <- rexp(10000)
str(chooseHighlyVariableGenes(resids))

Default parameters for chooseHighlyVariableGenes

Description

Default parameters from the underlying C++ library. These may be overridden by defaults in the chooseHighlyVariableGenes function signature.

Usage

chooseHighlyVariableGenesDefaults()

Value

Named list containing default values for various function arguments.

Author(s)

Aaron Lun

Examples

chooseHighlyVariableGenesDefaults()

Choose a suitable pseudo-count

Description

Choose a suitable pseudo-count to control the bias introduced by log-transformation of normalized counts from normalizeCounts. Larger pseudo-counts shrink log-expression values towards the zero-expression baseline, reducing the impact of the transformation bias at the cost of some sensitivity.

Usage

choosePseudoCount(
  size.factors,
  quantile = NULL,
  max.bias = NULL,
  min.value = NULL
)

Arguments

size.factors

Numeric vector of size factors for all cells. It is expected that these have already been centered, e.g., with centerSizeFactors. Invalid size factors (e.g., non-positive, non-finite) will be ignored.

quantile

Number specifying the quantile to use for finding the smallest/largest size factors. Setting this to zero will use the observed minimum and maximum, though in practice, this is usually too sensitive to outliers.

If NULL, the default value in choosePseudoCountDefaults is used. The default uses the 5th and 95th percentile to obtain a range that captures most of the distribution.

max.bias

Number specifying the maximum allowed bias. This is the maximum absolute value of any spurious log2-fold change between the cells with the smallest and largest size factors.

If NULL, the default value in choosePseudoCountDefaults is used.

min.value

Number specifying the minimum value for the pseudo-count.

If NULL, the default value in choosePseudoCountDefaults is used. The default of 1 will stabilize near-zero normalized expression values, otherwise these manifest as avoid large negative values.

Value

A choice of pseudo-count for normalizeCounts.

Author(s)

Aaron Lun

References

Lun ATL (2018). Overcoming systematic errors caused by log-transformation of normalized single-cell RNA sequencing data. biorXiv doi:10.1101/404962

See Also

choose_pseudo_count in https://libscran.github.io/scran_norm/.

Examples

sf <- centerSizeFactors(runif(100))
choosePseudoCount(sf)
choosePseudoCount(sf, quantile=0.01)
choosePseudoCount(sf, max.bias=0.5)

Default parameters for choosePseudoCount

Description

Default parameters from the underlying C++ library. These may be overridden by defaults in the choosePseudoCount function signature.

Usage

choosePseudoCountDefaults()

Value

Named list containing default values for various function arguments.

Author(s)

Aaron Lun

Examples

choosePseudoCountDefaults()

Choose highly variable genes from a SummarizedExperiment

Description

Model the mean-variance relationship across genes and choose highly variable genes (HVGs) based on the residuals of the fitted trend. This calls modelGeneVariances on an assay of a SummarizedExperiment, and then calls chooseHighlyVariableGenes on the residuals.

Usage

chooseRnaHvgs.se(
  x,
  block = NULL,
  num.threads = 1,
  more.var.args = list(),
  top = 4000,
  more.choose.args = list(),
  assay.type = "logcounts",
  output.prefix = NULL,
  include.per.block = FALSE
)

formatModelGeneVariancesResult(
  model.res,
  choose.res = NULL,
  include.per.block = FALSE
)

Arguments

x

A SummarizedExperiment object or one of its subclasses. Rows correspond to genes and columns correspond to cells.

block

Block assignment for each cell, to pass to modelGeneVariances.

num.threads

Number of threads, to pass to modelGeneVariances.

more.var.args

Named list of arguments to pass to modelGeneVariances.

top

Number of HVGs to choose, to pass to chooseHighlyVariableGenes.

more.choose.args

Named list of arguments to pass to chooseHighlyVariableGenes.

assay.type

Integer or string specifying the assay of x containing the log-normalized expression matrix for the RNA data.

output.prefix

String containing a prefix to add to the names of the link[SummarizedExperiment]{rowData} columns containing the output statistics.

include.per.block

Logical scalar indicating whether the per-block statistics should be stored in the output rowData. Only relevant if block is specified.

model.res

List returned by modelGeneVariances.

choose.res

Integer vector returned by chooseHighlyVariableGenes. This may be NULL, in which case the identities of the HVGs will not be stored.

Value

For chooseRnaHvgs.se, x is returned with the per-gene variance modelling statistics added to its rowData. The hvg column in the rowData indicates whether a gene was chosen as a HVG. If include.per.block=TRUE and block is specified, the per-block statistics are stored as a nested DataFrame in the per.block column.

For formatModelGeneVariancesResult, a DataFrame is returned with the per-gene variance modelling statistics. If choose.res is provided, a hvg column is also stored that indicates whether a gene was chosen as a HVG.

Author(s)

Aaron Lun

See Also

chooseRnaHvgsWithSpikeIns.se, to choose HVGs with technical noise estimates from spike-in data.

Examples

library(SingleCellExperiment)
sce <- getTestRnaData.se("norm")
sce <- chooseRnaHvgs.se(sce)
summary(rowData(sce)$hvg)

plot(rowData(sce)$means, rowData(sce)$variances, col=factor(rowData(sce)$hvg))
curve(approxfun(rowData(sce)$means, rowData(sce)$fitted)(x), col="dodgerblue", add=TRUE)

Choose highly variable genes based on spike-ins

Description

Fit a mean-variance trend to spike-in transcripts, and use this trend to estimate the technical noise for endogenous genes of similar abundance. The biological component of variation are defined as the residuals from the trend and used to select highly variable genes.

Usage

chooseRnaHvgsWithSpikeIns.se(
  x,
  spike.altexp,
  block = NULL,
  num.threads = 1,
  more.endogenous.var.args = list(),
  more.spike.var.args = list(use.min.width = FALSE),
  top = 4000,
  more.choose.args = list(),
  assay.type = "logcounts",
  output.prefix = NULL,
  include.per.block = FALSE
)

Arguments

x

A SingleCellExperiment or one of its subclasses. The main experiment should contain count data for endogenous genes, where rows correspond to genes and columns correspond to cells. There should be one or more alternative experiments containing spike-in data, see spike.altexp.

spike.altexp

Integer or string specifying the name/index of the alternative experiment containing the spike-in data. The assay to use is determined by assay.type.

Alternatively, spike.altexp may be a named integer or character vector of length 1. The name specifies an alternative experiment while the value is the index/name of the assay to use from that experiment.

block

Block assignment for each cell, to pass to modelGeneVariances.

num.threads

Number of threads, to pass to modelGeneVariances.

more.endogenous.var.args

Named list of additional arguments to pass to modelGeneVariances for endogenous genes. Note that all trend-fitting arguments will be ignored as the mean-variance trend is obtained from the spike-in transcripts.

more.spike.var.args

Named list of additional arguments to pass to modelGeneVariances for spike-in transcripts.

top

Number of HVGs to choose, to pass to chooseHighlyVariableGenes.

more.choose.args

Named list of arguments to pass to chooseHighlyVariableGenes.

assay.type

Integer or string specifying the assay of x containing the log-normalized expression matrix for the RNA data.

output.prefix

String containing a prefix to add to the names of the link[SummarizedExperiment]{rowData} columns containing the output statistics.

include.per.block

Logical scalar indicating whether the per-block statistics should be stored in the output rowData. Only relevant if block is specified.

Details

It is generally expected that normalization has been performed with normalizeRnaCountsWithSpikeIns.se. This ensures that the means are comparable between endogenous genes and spike-in transcripts.

We set use.min.width=FALSE by default for the spike-in trend fit, as (i) there are typically too few spike-in transcripts for the default choice of min.window.count, and (ii) spike-in sets typically have regularly-spaced abundances that are amenable to regular LOWESS.

Value

x is returned with the per-gene variance modelling statistics for endogenous genes added to its rowData. This is equivalent to the columns returned by modelGeneVariances, except that:

  • fitted now contains the interpolated values from the mean-variance trend fitted to the spike-in transcripts. This represents the technical component of variation for each gene.

  • residuals now contains the difference of each gene's variance from the trend. This represents the biological component of variation for each gene.

The hvg column in the rowData indicates whether a gene was chosen as a HVG based on the largest residuals. If include.per.block=TRUE and block is specified, the per-block statistics are stored as a nested DataFrame in the per.block column.

Per-transcript modelling statistics are also added to the rowData of the alternative experiment containing the spike-in data. This has all of the same columns returned by modelGeneVariances.

Author(s)

Aaron Lun

See Also

chooseRnaHvgs.se, for a simpler function when spike-ins are not available.

Examples

library(SingleCellExperiment)
sce <- getTestRnaData.se("qc")
sce <- normalizeRnaCountsWithSpikeIns.se(sce, "ERCC")

sce <- chooseRnaHvgsWithSpikeIns.se(sce, "ERCC")
summary(rowData(sce)$hvg)

plot(rowData(sce)$means, rowData(sce)$variances, col=factor(rowData(sce)$hvg))
spike.rd <- rowData(altExp(sce, "ERCC"))
points(spike.rd$means, spike.rd$variances, col="dodgerblue", pch=4)
curve(approxfun(spike.rd$means, spike.rd$fitted)(x), col="dodgerblue", add=TRUE)

Graph-based clustering of cells

Description

Identify clusters by applying community detection algorithms to a graph. This assumes that that the nodes on the graph represent cells and weighted edges are formed between related cells.

Usage

clusterGraph(
  x,
  method = c("multilevel", "leiden", "walktrap"),
  seed = NULL,
  multilevel.resolution = NULL,
  multilevel.seed = seed,
  leiden.resolution = NULL,
  leiden.seed = seed,
  leiden.objective = "modularity",
  walktrap.steps = NULL
)

Arguments

x

List containing graph information or an external pointer to a graph, as returned by buildSnnGraph. Alternatively, an igraph object with edge weights.

method

String specifying the algorithm to use.

seed

Integer specifying the random seed for some of community detection algorithms.

multilevel.resolution

Number specifying the resolution when method="multilevel". Lower values favor fewer, larger communities; higher values favor more, smaller communities.

If NULL, the default value in clusterGraphDefaults is used.

multilevel.seed

Integer specifying the random seed to use for method="multilevel".

If NULL, the default value in clusterGraphDefaults is used.

leiden.resolution

Number specifying the resolution when method="leiden". Lower values favor fewer, larger communities; higher values favor more, smaller communities.

If NULL, the default value in clusterGraphDefaults is used.

leiden.seed

Integer specifying the random seed to use for method="leiden".

If NULL, the default value in clusterGraphDefaults is used.

leiden.objective

String specifying the objective function when method="leiden". "modularity" uses the generalized modularity, "cpm" uses the Constant Potts Model, and "er" uses the Erd\"os-R\'enyi G(n, p) model. The CPM typically yields more fine-grained clusters than the modularity at the same leiden.resolution.

If NULL, the default value in clusterGraphDefaults is used.

walktrap.steps

Integer specifying the number of steps to use when method="walktrap". This determines the ability of the Walktrap algorithm to distinguish highly interconnected communities from the rest of the graph.

If NULL, the default value in clusterGraphDefaults is used.

Value

A list containing membership, a factor containing the cluster assignment for each cell. Additional fields may be present depending on the method:

  • For method="multilevel", the levels list contains the clustering result at each level of the algorithm. A modularity numeric vector also contains the modularity at each level, the highest of which corresponds to the reported membership.

  • For method="leiden", the quality number contains the quality of the partitioning.

  • For method="walktrap", the merges matrix specifies the pair of cells or clusters that were merged at each step of the algorithm. The modularity number also contains the modularity of the final partitioning.

Author(s)

Aaron Lun

See Also

The various cluster_* functions in https://libscran.github.io/scran_graph_cluster/.

clusterGraph.se, which performs clustering on graph constructed from a SingleCellExperiment.

Examples

data <- matrix(rnorm(10000), ncol=1000)
gout <- buildSnnGraph(data)
str(gout)

str(clusterGraph(gout))
str(clusterGraph(gout, method="leiden"))
str(clusterGraph(gout, method="walktrap"))

Graph-based clustering of cells in a SingleCellExperiment

Description

Construct a shared-nearest neighbor (SNN) graph from an existing low-dimensional embedding by calling buildSnnGraph on a reduced dimension entry in a SingleCellExperiment. Then, apply community detection algorithms to obtain clusters of cells with clusterGraph.

Usage

clusterGraph.se(
  x,
  num.neighbors = 10,
  num.threads = 1,
  more.build.args = list(),
  method = "multilevel",
  resolution = NULL,
  more.cluster.args = list(),
  reddim.type = "PCA",
  output.name = "clusters",
  meta.name = NULL,
  graph.name = NULL
)

Arguments

x

A SingleCellExperiment object or one of its subclasses. Rows correspond to genomic features and columns correspond to cells.

num.neighbors

Number of neighbors for constructing the graph, passed to buildSnnGraph.

num.threads

Number of threads for graph construction, passed to buildSnnGraph.

more.build.args

Named list of further arguments to be passed to buildSnnGraph.

method

Clustering method to use, passed to clusterGraph.

resolution

Resolution for the community detection method in clusterGraph. This is either passed to multilevel.resolution or leiden.resolution depending on method.

more.cluster.args

Named list of further arguments to be passed to clusterGraph.

reddim.type

Integer or string specifying the existing embedding in the reducedDim of x. Alternatively, a named integer or character vector of length 1, where the name specifies an alternative experiment of x and the value is the name/index of a reducedDim entry in that alternative experiment.

output.name

String containing the name of the column of the colData in which to store the cluster assignments.

meta.name

String containing the name of the metadata entry in which to store extra clustering output. If NULL, no extra clustering output is stored.

graph.name

String containing the name of the metadata entry in which to store the SNN graph. If NULL, the SNN graph is not stored.

Value

x is returned with the cluster assignment for each cell stored in the colData. Additional clustering output is stored in the metadata.

Author(s)

Aaron Lun

Examples

sce <- getTestRnaData.se("pca")
sce <- clusterGraph.se(sce)
table(sce$clusters)

Default parameters for clusterGraph

Description

Default parameters from the underlying C++ library. Some of these may be overridden by defaults in the clusterGraph function signature.

Usage

clusterGraphDefaults()

Value

Named list containing default values for various function arguments.

Author(s)

Aaron Lun

Examples

clusterGraphDefaults()

K-means clustering

Description

Perform k-means clustering with a variety of different initialization and refinement algorithms.

Usage

clusterKmeans(
  x,
  k,
  init.method = c("var-part", "kmeans++", "random"),
  refine.method = c("hartigan-wong", "lloyd"),
  warn = TRUE,
  seed = NULL,
  random.seed = seed,
  kmeanspp.seed = seed,
  var.part.optimize.partition = NULL,
  var.part.size.adjustment = NULL,
  lloyd.iterations = NULL,
  hartigan.wong.iterations = NULL,
  hartigan.wong.quick.transfer.iterations = NULL,
  hartigan.wong.quit.quick.transfer.failure = NULL,
  num.threads = NULL
)

Arguments

x

Matrix-like object where rows are dimensions and columns are cells. This is typically a dense double-precision matrix containing a low-dimensional representation from, e.g., runPca. However, any matrix representation supported by initializeCpp can also be used.

k

Integer specifying the number of clusters.

init.method

String specifying the initialization method for the centers:

  • "var-part" uses variance partitioning as described by Su and Dy (2007). The dataset is repeatedly split along the dimension of greatest variance until k partitions are formed, the centroids of which form the initial clusters. This approach is slower than the others but fully deterministic.

  • "kmeans++" uses the weighted sampling method described by Arthur and Vassilvitskii (2007). k points are sampled with probability based on the smallest distance to any previously sampled point. This improves the likelihood of choosing initial centroids that are far apart from each other.

  • "random" initialization involves choosing k random points as the initial centers. This is the simplest and fastest method but may not yield good starting points.

refine.method

String specifying the refinement method.

  • "lloyd" uses Lloyd's algorithm, which performs a batch update in each iteration. This is simple and amenable to parallelization but may not converge.

  • "hartigan-wong" uses the Hartigan-Wong algorithm, which transfers points between clusters to optimize the drop in the within-cluster sum of squares. This is slower but has a greater chance of convergence.

warn

Boolean specifying whether a warning should be emitted if the k-means algorithm failed to converge.

seed

Integer specifying the seed for random number generation in some of the initialization methods.

random.seed

Integer specifying the seed for random number generation when init.method = "random".

If NULL, the default value in clusterKmeansDefaults is used.

kmeanspp.seed

Integer specifying the seed for random number generation when init.method = "kmeans++".

If NULL, the default value in clusterKmeansDefaults is used.

var.part.optimize.partition

Boolean indicating whether each partition boundary should be optimized in init.method = "var.part". This reduces the sum of squares in the child partitions, which improves the quality of the partition at the cost of some extra compute time.

If NULL, the default value in clusterKmeansDefaults is used.

var.part.size.adjustment

Number specifying the cluster size adjustment when selecting the next cluster to partition in init.method = "var.part". This should be non-negative and no greater than 1. Setting it to 0 will select the cluster with the highest variance, ignoring the cluster size. Setting it to 1 will select the cluster with the highest sum of squares, favoring larger clusters.

If NULL, the default value in clusterKmeansDefaults is used.

lloyd.iterations

Integer specifying the maximum number of iterations for refine.method = "lloyd". Larger values increase the chance of convergence at the cost of increasing compute time.

If NULL, the default value in clusterKmeansDefaults is used.

hartigan.wong.iterations

Integer specifying the maximum number of iterations for refine.method = "hartigan-wong". Larger values increase the chance of convergence at the cost of increasing compute time.

If NULL, the default value in clusterKmeansDefaults is used.

hartigan.wong.quick.transfer.iterations

Integer specifying the maximum number of quick transfer iterations for refine.method = "hartigan-wong". Larger values increase the chance of convergence at the cost of increasing compute time.

If NULL, the default value in clusterKmeansDefaults is used.

hartigan.wong.quit.quick.transfer.failure

Boolean indicating whether to quit kon convergence failure during quick transfer iterations in refine.method = "hartigan-wong". Setting this to FALSE gives the algorithm another chance to converge by attempting another optimal transfer iteration, at the cost of more compute time.

If TRUE, the function follows the same behavior as R's kmeans. If NULL, the default value in clusterKmeansDefaults is used.

num.threads

Integer specifying the number of threads to use.

If NULL, the default value in clusterKmeansDefaults is used.

Value

List containing:

  • clusters, a factor containing the cluster assignment for each cell. The number of levels is no greater than k, where each level is an integer that refer to a column of centers.

  • centers, a numeric matrix with the coordinates of the cluster centroids (dimensions in rows, centers in columns). The number of columns is no greater than k. Empty clusters are automatically removed.

  • iterations, an integer specifying the number of refinement iterations that were performed.

  • status, an integer specifying the completion status of the algorithm. A value of zero indicates success while the meaning of any non-zero value depends on the choice of refine.method:

    • For Lloyd, a value of 2 indicates convergence failure.

    • For Hartigan-Wong, a value of 2 indicates convergence failure in the optimal transfer iterations. A value of 4 indicates convergence failure in the quick transfer iterations when hartigan.wong.quit.quick.transfer.failure = TRUE.

Author(s)

Aaron Lun

References

Hartigan JA. and Wong MA (1979). Algorithm AS 136: A K-means clustering algorithm. Applied Statistics 28, 100-108.

Arthur D and Vassilvitskii S (2007). k-means++: the advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms 1027-1035.

Su T and Dy JG (2007). In Search of Deterministic Methods for Initializing K-Means and Gaussian Mixture Clustering. Intelligent Data Analysis 11, 319-338.

See Also

https://libscran.github.io/kmeans/, which describes the various initialization and refinement algorithms in more detail.

clusterKmeans.se, for k-means clustering on a SingleCellExperiment.

Examples

x <- t(as.matrix(iris[,1:4]))
clustering <- clusterKmeans(x, k=3)
table(clustering$clusters, iris[,"Species"])

k-means clustering of cells in a SingleCellExperiment

Description

Perform k-means clustering on an existing low-dimensional embedding by calling clusterKmeans on a reduced dimension entry in a SingleCellExperiment.

Usage

clusterKmeans.se(
  x,
  k,
  num.threads = 1,
  more.kmeans.args = list(),
  reddim.type = "PCA",
  output.name = "clusters",
  meta.name = NULL
)

Arguments

x

A SingleCellExperiment object or one of its subclasses. Rows correspond to genomic features and columns correspond to cells.

k

Number of clusters, passed to clusterKmeans.

num.threads

Number of threads, passed to clusterKmeans.

more.kmeans.args

Named list of further arguments to be passed to clusterKmeans.

reddim.type

Integer or string specifying the existing embedding in the reducedDim of x. Alternatively, a named integer or character vector of length 1, where the name specifies an alternative experiment of x and the value is the name/index of a reducedDim entry in that alternative experiment.

output.name

String containing the name of the colData column in which to store the cluster assignments.

meta.name

String containing the name of the metadata entry in which to store extra clustering output. If NULL, no extra clustering output is stored.

Value

x is returned with the cluster assignment for each cell stored in the colData. Additional clustering output is stored in the metadata.

Author(s)

Aaron Lun

Examples

sce <- getTestRnaData.se("pca")
sce <- clusterKmeans.se(sce, k=10)
table(sce$clusters)

Default parameters for clusterKmeans

Description

Default parameters from the underlying C++ library. These may be overridden by defaults in the clusterKmeans function signature.

Usage

clusterKmeansDefaults()

Value

Named list containing default values for various function arguments.

Author(s)

Aaron Lun

Examples

clusterKmeansDefaults()

Combine multiple factors

Description

Combine multiple factors into a single factor where each level of the latter is a unique combination of levels from the former.

Usage

combineFactors(factors, keep.unused = FALSE)

Arguments

factors

An ordinary list or List of vectors or factors of the same length. Corresponding elements across all vectors/factors represent the combination of levels for a single observation. For factors, any existing levels are respected. For other vectors, the sorted and unique values are used as levels.

Alternatively, a data frame or DataFrame where each column is a vector or factor and each row corresponds to an observation.

keep.unused

Boolean indicating whether to report unused combinations of levels.

Value

List containing:

  • levels, a DataFrame containing the sorted and unique combinations of levels from factors. Each column corresponds to a factor in factors while each row corresponds to a unqiue combination.

  • index, an integer vector specifying the index into levels for each observation.

For observation i and factor j, levels[[[j]][index[i]] will recover factors[[j]][i].

Author(s)

Aaron Lun

See Also

The combine_to_factor function in https://libscran.github.io/factorize/.

Examples

combineFactors(list(
    sample(LETTERS[1:5], 100, replace=TRUE),
    sample(3, 100, replace=TRUE)
))

combineFactors(list(
    factor(sample(LETTERS[1:5], 10, replace=TRUE), LETTERS[1:5]),
    factor(sample(5, 10, replace=TRUE), 1:5)
), keep.unused=TRUE)

Compute block weights

Description

Compute a weight for each block based on the number of cells in each block. This is typically used to aggregate statistics across blocks, e.g., with weighted sums/averages.

Usage

computeBlockWeights(
  sizes,
  block.weight.policy = c("variable", "equal", "size", "none"),
  variable.block.weight = NULL
)

Arguments

sizes

Numeric vector containing the size of (i.e., number of cells in) each block.

block.weight.policy

String specifying the policy to use for weighting different blocks. This should be one of:

  • "size": the contribution of each block is proportional to its size. "none" is also a deprecated alias for "size".

  • "equal": blocks are equally weighted regardless of their size. The exception is that of empty blocks with no cells, which receive zero weight.

  • "variable": blocks are equally weighted past a certain threshold size. Below that size, the contribution of each block is proportional to its size. This avoids outsized contributions from very large blocks.

variable.block.weight

Numeric vector of length 2, specifying the parameters for variable block weighting. The first and second values are used as the lower and upper bounds, respectively, for the variable weight calculation.

If NULL, the default value in computeBlockWeightsDefaults is used.

This argument is only used if block.weight.policy = "variable".

Value

Numeric vector containing the relative block weights.

Author(s)

Aaron Lun

See Also

The compute_weights function from https://libscran.github.io/scran_blocks/.

Examples

computeBlockWeights(c(1, 10, 100, 1000, 10000))
computeBlockWeights(c(1, 10, 100, 1000, 10000), block.weight.policy="equal")
computeBlockWeights(c(1, 10, 100, 1000, 10000), variable.block.weight=c(50, 5000))

Default parameters for computeBlockWeights

Description

Default parameters from the underlying C++ library. These may be overridden by defaults in the computeBlockWeights function signature.

Usage

computeBlockWeightsDefaults()

Value

Named list containing default values for various function arguments.

Author(s)

Aaron Lun

Examples

computeBlockWeightsDefaults()

Compute size factors for ADT counts

Description

Compute size factors from an ADT count matrix using the CLRm1 method. This is a variant of the centered log-ratio (CLR) method, where the size factors are defined from the geometric mean of counts within each cell.

Usage

computeClrm1Factors(x, num.threads = NULL)

Arguments

x

A matrix-like object containing ADT count data. Rows correspond to tags and columns correspond to cells.

num.threads

Number of threads to use.

If NULL, the default value in computeClrm1FactorsDefaults is used.

Value

Numeric vector containing the CLRm1 size factor for each cell. Note that these size factors are not centered and should be passed through, e.g., centerSizeFactors before normalization.

Author(s)

Aaron Lun

See Also

https://github.com/libscran/clrm1, for a description of the CLRm1 method.

normalizeAdtCounts.se, which computes CLRm1 factors prior to normalization.

Examples

library(Matrix)
x <- abs(rsparsematrix(1000, 100, 0.1) * 10)
head(computeClrm1Factors(x))

Default parameters for computeClrm1Factors

Description

Default parameters from the underlying C++ library. These may be overridden by defaults in the computeClrm1Factors function signature.

Usage

computeClrm1FactorsDefaults()

Value

Named list containing default values for various function arguments.

Author(s)

Aaron Lun

Examples

computeClrm1FactorsDefaults()

Batch correction with mutual nearest neighbors

Description

Apply mutual nearest neighbor (MNN) correction to remove batch effects from a low-dimensional embedding.

Usage

correctMnn(
  x,
  block,
  num.neighbors = NULL,
  num.steps = NULL,
  merge.policy = NULL,
  num.mads = NULL,
  robust.iterations = NULL,
  robust.trim = NULL,
  mass.cap = NULL,
  order = NULL,
  reference.policy = NULL,
  BNPARAM = AnnoyParam(),
  num.threads = NULL
)

Arguments

x

Numeric matrix where rows are dimensions and columns are cells, typically containing coordinates in a low-dimensional embedding (e.g., from runPca).

block

Factor specifying the block of origin (e.g., batch, sample) for each cell in x.

num.neighbors

Integer specifying the number of neighbors in the various search steps. Larger values improve the stability of the correction by increasing the number of MNN pairs and including more observations in each center of mass. However, this comes at the cost of reduced resolution when matching subpopulations across batches.

If NULL, the default value in correctMnnDefaults is used.

num.steps

Integer specifying the number of steps for the recursive neighbor search to compute the center of mass. Larger values mitigate the kissing effect but increase the risk of including inappropriately distant subpopulations into the center of mass.

If NULL, the default value in correctMnnDefaults is used.

merge.policy

String specifying the policy to use to choose the order of batches to merge.

  • "input" will use the input order of the batches. Observations in the last batch are corrected first, and then the second-last batch, and so on. This allows users to control the merge order by simply changing the inputs.

  • "size" will merge batches in order of increasing size (i.e., the number of observations). So, the smallest batch is corrected first while the largest batch is unchanged. The aim is to lower compute time by reducing the number of observations that need to be reprocessed in later merge steps.

  • "variance" will merge batches in order of increasing variance between observations. So, the batch with the lowest variance is corrected first while the batch with the highest variance is unchanged. The aim is to lower compute time by encouraging more observations to be corrected to the most variable batch, thus avoid reprocessing in later merge steps.

  • "rss" will merge batches in order of increasing residual sum of squares (RSS). This is effectively a compromise between "variance" and "size".

If NULL, the default value in correctMnnDefaults is used.

num.mads

Deprecated and ignored.

robust.iterations

Deprecated and ignored.

robust.trim

Deprecated and ignored.

mass.cap

Deprecated and ignored.

order

Deprecated and ignored, the merge order is now always automatically determined.

reference.policy

Deprecated, use merge.policy instead.

BNPARAM

A BiocNeighborParam object specifying the nearest-neighbor algorithm to use.

num.threads

Integer specifying the number of threads to use.

If NULL, the default value in correctMnnDefaults is used.

Value

List containing:

  • corrected, a numeric matrix of the same dimensions as x. This contains the MNN-corrected coordinates for each cell (column) across dimensions (rows).

Author(s)

Aaron Lun

References

Haghverdi L, Lun ATL, Morgan MD, Marioni JC (2018). Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36(5):421-427

See Also

The compute function in https://libscran.github.io/mnncorrect/.

correctMnn.se, to perform MNN correction on a SingleCellExperiment.

Examples

# Mocking up a dataset with multiple batches.
x <- matrix(rnorm(10000), nrow=10)
b <- sample(3, ncol(x), replace=TRUE)
x[,b==2] <- x[,b==2] + 3
x[,b==3] <- x[,b==3] + 5
lapply(split(colMeans(x), b), mean) # indeed the means differ...

corrected <- correctMnn(x, b)
str(corrected)
lapply(split(colMeans(corrected$corrected), b), mean) # now merged.

MNN correction on a SingleCellExperiment

Description

Correct batch effects from an existing embedding with mutual nearest neighbors (MNNs), by calling correctMnn on a reduced dimension entry of a SingleCellExperiment.

Usage

correctMnn.se(
  x,
  block,
  BNPARAM = AnnoyParam(),
  num.threads = 1,
  more.mnn.args = list(),
  reddim.type = "PCA",
  output.name = "MNN",
  delayed.transpose = FALSE
)

Arguments

x

A SingleCellExperiment object or one of its subclasses. Rows correspond to genomic features and columns correspond to cells.

block

Block assignment for each cell, passed to correctMnn.

BNPARAM

Algorithm for the nearest neighbor search, passed to correctMnn.

num.threads

Number of threads, passed to correctMnn.

more.mnn.args

Named list of additional arguments to pass to correctMnn.

reddim.type

String or integer specifying the reducedDim entry on which to perform MNN correction. Alternatively, a named integer or character vector of length 1, where the name specifies an alternative experiment of x and the value is the name/index of a reducedDim entry in that alternative experiment.

output.name

String containing the name of the reducedDim entry in which to store the corrected embedding.

delayed.transpose

Logical scalar indicating whether to delay the transposition when storing coordinates in the reducedDims.

Value

x is returned with the corrected embedding stored as a reducedDim entry.

Author(s)

Aaron Lun

Examples

library(SingleCellExperiment)
sce <- getTestRnaData.se("pca")
# Treating the tissue of origin as the batch.
sce <- correctMnn.se(sce, sce$tissue)
reducedDimNames(sce)

Default parameters for correctMnn

Description

Default parameters from the underlying C++ library. These may be overridden by defaults in the correctMnn function signature.

Usage

correctMnnDefaults()

Value

Named list containing default values for various function arguments.

Author(s)

Aaron Lun

Examples

correctMnnDefaults()

Count cells in groups and blocks

Description

Tabulate the frequency of cells in each combination of groups and blocks. This is typically used to examine the distribution of cells across batches for each cluster - the presence of a batch-specific cluster may be indicative of a batch effect.

Usage

countGroupsByBlock(
  groups,
  block,
  normalize.block = FALSE,
  normalize.groups = FALSE
)

Arguments

groups

Factor specifying the group to which each cell was assigned. This is typically used for clusters.

block

Factor specifying the block to which each cell was assigned. This is typically used for batches or samples.

normalize.block

Boolean indicating whether to normalize the number of cells across blocks. If TRUE, frequencies are divided by the column sums.

normalize.groups

Boolean indicating whether to normalize the number of cells across groups. If TRUE, frequencies are divided by the row sums. This is performed after normalization of the block counts if normalize.block=TRUE.

Value

Matrix of (normalized) frequencies. Each row corresponds to a group and each column corresponds to a block.

Author(s)

Aaron Lun

See Also

table, which is used internally by this function.

Examples

groups <- sample(10, 100, replace=TRUE)
block <- sample(LETTERS[1:6], 100, replace=TRUE)

countGroupsByBlock(groups, block)
countGroupsByBlock(groups, block, normalize.block=TRUE)
countGroupsByBlock(groups, block, normalize.groups=TRUE)
countGroupsByBlock(groups, block, normalize.block=TRUE, normalize.groups=TRUE)

Quality control for CRISPR count data

Description

Compute per-cell QC metrics from an initialized matrix of CRISPR counts, and use the metrics to suggest filter thresholds to retain high-quality cells.

Usage

computeCrisprQcMetrics(x, num.threads = NULL)

suggestCrisprQcThresholds(
  metrics,
  block = NULL,
  num.mads = NULL,
  max.value.num.mads = num.mads
)

filterCrisprQcMetrics(thresholds, metrics, block = NULL)

Arguments

x

A matrix-like object where rows are CRISPRs and columns are cells. Values are expected to be counts.

num.threads

Integer scalar specifying the number of threads to use. If NULL, the default value in computeCrisprQcMetricsDefaults is used.

metrics

DataFrame of per-cell QC metrics. This should have the same structure as the return value of computeCrisprQcMetrics.

block

Factor specifying the block of origin (e.g., batch, sample) for each cell in metrics. Alternatively NULL if all cells are from the same block.

For filterCrisprQcMetrics, a blocking factor should be provided if block was used to construct thresholds.

num.mads

Number of median from the median, to define the threshold for outliers in each metric.

max.value.num.mads

Number of median from the median, to define the threshold for the maximum count in each cell. If NULL, the default value in suggestCrisprQcThresholdsDefaults is used.

thresholds

List with the same structure as produced by suggestCrisprQcThresholds.

Details

In CRISPR data, a cell is considered to be of low quality if it has a low count for its most abundant guide. However, directly defining a MAD-based outlier threshold on the maximum count is somewhat tricky as unsuccessful transfection can be common. This often results in a large subpopulation with low maximum counts, inflating the MAD and compromising the threshold calculation. Instead, we use the following approach:

  • Compute the proportion of counts in the most abundant guide (i.e., the maximum proportion) in each cell. Cells that were successfully transfected should have high maximum proportions. In contrast, unsuccessfully transfected cells will be dominated by ambient contamination and have low proportions.

  • Subset the dataset to only retain those cells with maximum proportions above the median. This assumes that at least 50 Thus, we remove all of the unsucessful transfections and enrich for mostly-high-quality cells.

  • Define a MAD-based threshold for low outliers on the log-transformed maximum count within the subset (see 'choose_filter_thresholds()' for details). This is now possible as we can assume that most of the remaining cells are of high quality.

Note that the maximum proportion is only used to define the subset for threshold calculation. Once the maximum count threshold is computed, it is applied to all cells regardless of their maximum proportions. This ensures that we correctly remove cells with low coverage, even if the proportion is high. It also allows us to retain cells transfected with multiple guides, as long as the maximum is high enough - such cells are not necessarily uninteresting, e.g., for examining interaction effects, so we will err on the side of caution and leave them in.

Value

For computeCrisprQcMetrics, a DataFrame is returned with one row per cell in x. This contains the following columns:

  • sum, a numeric vector containing the total CRISPR count for each cell. Low counts indicate that the cell was not successfully transfected with a construct or that library preparation and sequencing failed.

  • detected, an integer vector containing the number of detected guides per cell. In theory, this should be 1, as each cell should express no more than one guide construct. However, ambient contamination may introduce non-zero counts for multiple guides, without necessarily interfering with downstream analyses. As such, this metric is less useful for guide data, though we compute it anyway.

  • max.value, a numeric vector containing the count for the most abundant guide in cell. Low values indicate that the cell was not successfully transfected or that library preparation and sequencing failed.

  • max.index, an integer vector containing the row index of the most abundant guide in cell.

Each vector is of length equal to the number of cells.

For suggestCrisprQcThresholds, a named list is returned.

  • If block=NULL, the list contains:

    • max.value, a numeric scalar containing the lower bound on the maximum count. This is defined as num.mads MADs below the median of the log-transformed metrics across cells with high maximum proportions (see Details).

  • Otherwise, if block is supplied, the list contains:

    • max.value, a numeric vector containing the lower bound on the maximum counts for each blocking level. Here, the threshold is computed independently for each block, using the same method as the unblocked case.

    • block.ids, a vector containing the identities of the unique blocks.

    Each vector is of length equal to the number of levels in block and is named accordingly.

For filterCrisprQcMetrics, a logical vector of length ncol(x) is returned indicating which cells are of high quality. High-quality cells are defined as those with maximum counts above the max.value threshold.

Author(s)

Aaron Lun

See Also

The compute_crispr_qc_metrics, compute_crispr_qc_filters and compute_crispr_qc_filters_blocked functions in https://libscran.github.io/scran_qc/.

quickCrisprQc.se, to run all of the CRISPR-related QC functions on a SummarizedExperiment.

Examples

# Mocking a matrix:
library(Matrix)
x <- round(abs(rsparsematrix(100, 100, 0.1) * 100))

qc <- computeCrisprQcMetrics(x)
qc

filt <- suggestCrisprQcThresholds(qc)
str(filt)

keep <- filterCrisprQcMetrics(filt, qc)
summary(keep)

Default parameters for CRISPR quality control

Description

Default parameters from the underlying C++ library. These may be overridden by defaults in each function's signature.

Usage

computeCrisprQcMetricsDefaults()

suggestCrisprQcThresholdsDefaults()

Value

Named list containing default values for the various function arguments.

Author(s)

Aaron Lun

Examples

computeCrisprQcMetricsDefaults()
suggestCrisprQcThresholdsDefaults()

Fit a mean-variance trend

Description

Fit a trend to the per-gene variances with respect to their means, typically from normalized and log-transformed expression values.

Usage

fitVarianceTrend(
  means,
  variances,
  mean.filter = NULL,
  min.mean = NULL,
  transform = NULL,
  span = NULL,
  use.min.width = NULL,
  min.width = NULL,
  min.window.count = NULL,
  num.threads = NULL
)

Arguments

means

Numeric vector containing the mean (log-)expression for each gene.

variances

Numeric vector containing the variance in the (log-)expression for each gene.

mean.filter

Logical scalar indicating whether to filter on the means before trend fitting. The assumption is that there is a bulk of low-abundance genes that are uninteresting and should be removed to avoid skewing the windows of the LOWESS smoother.

If NULL, the default value in fitVarianceTrendDefaults is used.

min.mean

Numeric scalar specifying the minimum mean of genes to use in trend fitting. Genes with lower means do not participate in the LOWESS fit, to ensure that windows are not skewed towards the majority of low-abundance genes. Instead, the fitted values for these genes are defined by extrapolating the left edge of the fitted trend is extrapolated to the origin.

If NULL, the default value in fitVarianceTrendDefaults is used; this is chosen based on the typical distribution of means of log-expression values across genes.

This argument is ignored if mean.filter=FALSE.

transform

Logical scalar indicating whether a quarter-root transformation should be applied before trend fitting. This transformation is copied from limma::voom and shrinks all values towards 1, flattening any sharp gradients in the trend for an easier fit. For example, variances are computed from log-expression values typically exhibit a strong “hump” in the mean-variance relationship.

If NULL, the default value in fitVarianceTrendDefaults is used;

span

Numeric scalar specifying the span of the LOWESS smoother, as a proportion of the total number of points. Larger values improve stability at the cost of sensitivity to changes in low-density regions.

If NULL, the default value in fitVarianceTrendDefaults is used.

This argument is ignored if use.min.width=TRUE.

use.min.width

Logical scalar indicating whether a minimum width constraint should be applied to the LOWESS smoother. This replaces the proportion-based span for defining each window. Instead, the window for each point must be of a minimum width and is extended until it contains a minimum number of points. Setting this to 'TRUE' ensures that sensitivity is maintained in the trend fit at low-density regions for the distribution of means, e.g., at high abundances. It also avoids overfitting from very small windows in high-density intervals.

If NULL, the default value in fitVarianceTrendDefaults is used.

min.width

Minimum width of the window to use when use.min.width=TRUE.

If NULL, the default value in fitVarianceTrendDefaults is used; this is chosen based on the typical range of means in single-cell RNA-seq data.

min.window.count

Minimum number of observations in each window. This ensures that each window contains at least a given number of observations for a stable fit. If the minimum width window contains fewer observations, it is extended using the standard LOWESS logic until the minimum number is achieved.

If NULL, the default value in fitVarianceTrendDefaults is used.

This argument is ignored if use.min.width = FALSE.

num.threads

Number of threads to use.

If NULL, the default value in fitVarianceTrendDefaults is used.

Value

List containing fitted, a numeric vector containing the fitted values of the trend for each gene; and residuals, a numeric vector containing the residuals from the trend.

Author(s)

Aaron Lun

See Also

modelGeneVariances, to compute the means and variances on which the trend is fitted.

The fit_variance_trend function in https://libscran.github.io/scran_variances/.

Examples

# Setting up some single-cell-like data.
mu <- 2^runif(1000, -10, 10)
counts <- matrix(rpois(20 * length(mu), lambda=mu), ncol=20)

sf <- centerSizeFactors(colSums(counts))
normalized <- normalizeCounts(counts, size.factors=sf)
stats <- modelGeneVariances(normalized)

out <- fitVarianceTrend(stats$statistics$means, stats$statistics$variances)
plot(stats$statistics$means, stats$statistics$variances)
curve(approxfun(stats$statistics$means, out$fitted)(x), col="red", add=TRUE)

Default parameters for fitVarianceTrend

Description

Default parameters from the underlying C++ library. These may be overridden by defaults in the fitVarianceTrend function signature.

Usage

fitVarianceTrendDefaults()

Value

Named list containing default values for the various function arguments.

Author(s)

Aaron Lun

Examples

fitVarianceTrendDefaults()

Get datasets for testing

Description

Get single-cell datasets from the scRNAseq package with varying levels of processing. This is primarily intended for testing other scrapper functions, e.g., in their Examples section.

Usage

getTestRnaData.se(at = c("start", "qc", "norm", "hvg", "pca", "cluster"))

getTestAdtData.se(at = c("start", "qc", "norm", "hvg", "pca"))

getTestCrisprData.se(at = c("start", "qc"))

Arguments

at

String specifying the level of processing. For "start", no processing was performed. Otherwise, the dataset is returned after quality control ("qc"), normalization ("norm"), feature selection ("hvg"), PCA ("PCA") or graph-based clustering ("cluster").

Details

For getTestRnaData, this is a scRNA-seq dataset of the mouse brain, where the main experiment contains RNA counts and the alternative experiments contain ERCC and repeat element counts. This is obtained with fetchDataset("zeisel-brain-2015", "2023-12-14").

For getTestAdtData, this is a CITE-seq dataset of human PBMCs, where the main experiment contains RNA counts and the alternative experiment contains ADT counts. This is obtained with fetchDataset("kotliarov-pbmc-2020", "2024-04-18"). Only the first 5000 cells are loaded for speed.

For getTestCrisprData, this is a Perturb-seq dataset of a pancreatic beta cell line, where the main experiment contains RNA counts and the alternative experiment contains CRISPR guide counts. This is obtained with fetchDataset("cao-pancreas-2025", "2025-10-10", "rqc"). Only the first 5000 cells are loaded for speed.

Value

A SingleCellExperiment containing a dataset at the specified level of processing.

Author(s)

Aaron Lun

See Also

fetchDataset, used to obtain each dataset.

Examples

getTestRnaData.se()
getTestAdtData.se()
getTestCrisprData.se()

Delayed log-normalization of a matrix

Description

Delayed calculation of log-normalized expression values, typically returned by normalizeCounts.

Usage

LogNormalizedMatrix(x, size.factors, pseudo.count = 1, log.base = 2)

LogNormalizedMatrixSeed(x, size.factors, pseudo.count = 1, log.base = 2)

Arguments

x

Count matrix to be normalized.

size.factors

Numeric vector of size factors, of length equal to the number of columns of x.

pseudo.count

Number specifying the pseudo-count to add prior to log-transformation.

log.base

Number specifying the base of the log-transformation.

Details

This is based on the DelayedArray framework and

Value

An instance of a LogNormalizedMatrix(Seed).

Author(s)

Aaron Lun

Examples

mat <- matrix(rpois(1000, lambda=2), ncol=10)
sf <- centerSizeFactors(colSums(mat))
norm <- LogNormalizedMatrix(mat, sf)
norm

# Also works with sparse matrices.
library(Matrix)
smat <- abs(rsparsematrix(50, 20, density=0.21))
ssf <- centerSizeFactors(colSums(smat))
snorm <- LogNormalizedMatrix(smat, ssf)
snorm

Model per-gene variances in expression

Description

Model the per-gene variances as a function of the mean in single-cell expression data. Highly variable genes can then be selected for downstream analyses.

Usage

modelGeneVariances(
  x,
  block = NULL,
  block.average.policy = NULL,
  block.weight.policy = NULL,
  variable.block.weight = NULL,
  block.quantile = NULL,
  fit.trend = NULL,
  mean.filter = NULL,
  min.mean = NULL,
  transform = NULL,
  span = NULL,
  use.min.width = NULL,
  min.width = NULL,
  min.window.count = NULL,
  num.threads = NULL
)

Arguments

x

A matrix-like object where rows correspond to genes or genomic features and columns correspond to cells. It is typically expected to contain log-expression values, e.g., from normalizeCounts.

block

Factor specifying the block of origin (e.g., batch, sample) for each cell in x. If provided, calculation of means/variances and trend fitting are performed within each block to ensure that block effects do not confound the estimates. The weighted average of each statistic across all blocks is reported for each gene. Alternatively NULL, if all cells are from the same block.

block.average.policy

String specifying the policy to use for average statistics across blocks. This can either be "mean" (to compute a weighted mean), "quantile" (to compute a quantile) or "none" (no averaging).

If NULL, the default value in modelGeneVariancesDefaults is used.

This argument is only used if block is not NULL.

block.weight.policy

String specifying the policy to use for weighting different blocks when computing the average for each statistic. See the argument of the same name in computeBlockWeights for more details.

If NULL, the default value in modelGeneVariancesDefaults is used.

This argument is only used if block is not NULL and block.average.policy = "mean".

variable.block.weight

Numeric vector of length 2, specifying the parameters for variable block weighting. See the argument of the same name in computeBlockWeights for more details.

If NULL, the default value in modelGeneVariancesDefaults is used.

This argument is only used if block is not NULL, block.average.policy = "mean" and block.weight.policy = "variable".

block.quantile

Number specifying the probability of the quantile of statistics across blocks.

If NULL, the default value in modelGeneVariancesDefaults is used; this is 0.5, i.e., the median of per-block statistics.

This argument is only used if block is not NULL and block.average.policy = "quantile".

fit.trend

Boolean indicating whether a mean-variance trend should be fitted. If FALSE, only the means and variances are computed. This can occasionally be useful when the trend is computed separately (e.g., spike-ins).

If NULL, the default value in modelGeneVariancesDefaults is used.

mean.filter

Logical scalar indicating whether to filter on the means before trend fitting. See the argument of the same name in fitVarianceTrend for more details.

If NULL, the default value in modelGeneVariancesDefaults is used.

This argument is only used if fit.trend = TRUE.

min.mean

Number specifying the minimum mean of genes to use in trend fitting. See the argument of the same name in fitVarianceTrend for more details. If NULL, the default value in modelGeneVariancesDefaults is used.

This argument is only used if fit.trend = TRUE.

transform

Logical scalar indicating whether a quarter-root transformation should be applied before trend fitting. See the argument of the same name in fitVarianceTrend for more details.

If NULL, the default value in modelGeneVariancesDefaults is used.

This argument is only used if fit.trend = TRUE.

span

Numeric scalar specifying the span of the LOWESS smoother, as a proportion of the total number of points. See the argument of the same name in fitVarianceTrend for more details.

If NULL, the default value in modelGeneVariancesDefaults is used.

This argument is only used if fit.trend = TRUE.

use.min.width

Logical scalar indicating whether a minimum width constraint should be applied to the LOWESS smoother. See the argument of the same name in fitVarianceTrend for more details.

If NULL, the default value in modelGeneVariancesDefaults is used.

This argument is only used if fit.trend = TRUE.

min.width

Minimum width of the window to use when use.min.width=TRUE. See the argument of the same name in fitVarianceTrend for more details.

If NULL, the default value in modelGeneVariancesDefaults is used.

This argument is only used if fit.trend = TRUE.

min.window.count

Minimum number of observations in each window. See the argument of the same name in fitVarianceTrend for more details.

If NULL, the default value in modelGeneVariancesDefaults is used.

This argument is only used if fit.trend = TRUE.

num.threads

Integer scalar specifying the number of threads to use.

If NULL, the default value in modelGeneVariancesDefaults is used.

Details

We compute the mean and variance for each gene and fit a trend to the variances with respect to the means using fitVarianceTrend. We assume that most genes at any given abundance are not highly variable, such that the fitted value of the trend is interpreted as the “uninteresting” variance - this is mostly attributed to technical variation like sequencing noise, but can also represent constitutive biological noise like transcriptional bursting. Under this assumption, the residual can be treated as a measure of biologically interesting variation. Genes with large residuals can then be selected for downstream analyses, e.g., with chooseHighlyVariableGenes.

Value

A list containing statistics, a DataFrame with number of rows equal to the number of genes. This contains the columns means, variances, fitted and residuals, each of which is a numeric vector containing the statistic of the same name across all genes.

If block is supplied, each of the column vectors described above contains the average across all blocks. The list will also contain per.block, a list of DataFrames containing the equivalent statistics for each block; and block.ids, a vector containing the identities of the unique blocks in the same order as per.block.

If block is supplied and block.average.policy = "none", the statistics DataFrame will have no columns.

If fit.trend = FALSE, the fitted and residuals columns will be omitted from all DataFrames.

Author(s)

Aaron Lun

See Also

The model_gene_variances function in https://libscran.github.io/scran_variances/.

chooseRnaHvgs.se, which computes the variances and trend from a SummarizedExperiment.

Examples

library(Matrix)
x <- abs(rsparsematrix(1000, 100, 0.1) * 10)
out <- modelGeneVariances(x)
out

# Throwing in some blocking.
block <- sample(letters[1:4], ncol(x), replace=TRUE)
out <- modelGeneVariances(x, block=block)
out

Default parameters for modelGeneVariances

Description

Default parameters from the underlying C++ library. These may be overridden by defaults in the modelGeneVariances function signature.

Usage

modelGeneVariancesDefaults()

Value

Named list of default values for various function arguments.

Author(s)

Aaron Lun

Examples

modelGeneVariancesDefaults()

Normalize ADT counts in a SummarizedExperiment

Description

Compute (log-)normalized expression values after performing scaling normalization of an ADT count matrix. This calls computeClrm1Factors on an assay of a SingleCellExperiment, centering the subsequent size factors with centerSizeFactors, and then computing normalized log-expression values with normalizeCounts.

Usage

normalizeAdtCounts.se(
  x,
  size.factors = NULL,
  num.threads = 1,
  center = TRUE,
  block = NULL,
  mode = "lowest",
  log = TRUE,
  pseudo.count = 1,
  more.norm.args = list(),
  assay.type = "counts",
  output.name = "logcounts",
  factor.name = "sizeFactor"
)

Arguments

x

A SummarizedExperiment object or one of its subclasses. Rows correspond to antibody-derived tags (ADTs) and columns correspond to cells.

size.factors

Numeric vector of length equal to the number of columns of x, containing the size factor for each cell in x. If NULL, this defaults to the output of computeClrm1Factors.

num.threads

Number of threads, passed to computeClrm1Factors.

center

Logical scalar indicating whether to center the size.factors, see ?centerSizeFactors for more details.

block

Block assignments for each cell, passed to centerSizeFactors.

mode

How to center size factors in different blocks, see ?centerSizeFactors for more details.

log

Whether to log-transform the normalized expression values, see ?normalizeCounts for more details.

pseudo.count

The pseudo-count for log-transformation, see ?normalizeCounts for more details.

more.norm.args

Named list of additional arguments to pass to normalizeCounts.

assay.type

Integer or string specifying the assay of x with the count matrix.

output.name

String containing the name of the assay to store the normalized matrix.

factor.name

String containing the name of the colData column in which to store the size factors in the output object. If NULL, the size factors are not stored.

Value

x is returned with a new assay containing the (log-)normalized matrix. Size factors are also stored in the colData.

Author(s)

Aaron Lun

Examples

library(SingleCellExperiment)
sce <- altExp(getTestAdtData.se("qc"), "ADT")
sce <- normalizeAdtCounts.se(sce)
assayNames(sce)
summary(sizeFactors(sce))

Normalize the count matrix

Description

Apply scaling normalization and log-transformation to a count matrix. This yields normalized expression values that can be used in downstream procedures like PCA.

Usage

normalizeCounts(
  x,
  size.factors,
  log = NULL,
  pseudo.count = NULL,
  log.base = NULL,
  preserve.sparsity = NULL,
  delayed = TRUE
)

Arguments

x

A matrix-like object where rows correspond to genes or genomic features and columns correspond to cells. Values are expected to be non-negative counts. Alternatively, an external pointer created by initializeCpp.

size.factors

A numeric vector of length equal to the number of cells in x, containing positive size factors for all cells. Any invalid values should be replaced with sanitizeSizeFactors. For most applications, these size factors should also be centered with centerSizeFactors.

log

Logical scalar indicating whether log-transformation should be performed. This ensures that downstream analyses like t-tests and distance calculations focus on relative fold-changes rather than absolute differences. The log-transformation also provides some measure of variance stabilization so that the downstream analyses are not dominated by sampling noise at large counts.

If NULL, the default value in normalizeCountsDefaults is used.

pseudo.count

Numeric scalar specifying the positive pseudo-count to add before any log-transformation. Larger values shrink the differences between cells towards zero, reducing spurious differences (but also signal) at low counts - see choosePseudoCount for comments.

If NULL, the default value in normalizeCountsDefaults is used.

This argument is ignored if log = FALSE.

log.base

Numeric scalar specifying the base of the log-transformation. This is typically set to 2 or 10 for convenient interpretation of the log-values.

If NULL, the default value in normalizeCountsDefaults is used.

This argument is ignored if log = FALSE.

preserve.sparsity

Logical scalar indicating whether to preserve sparsity for pseudo.count != 1. This improves memory and compute efficiency in downstream steps but changes the interpretation of the returned matrix. Specifically, if TRUE, users should manually add log(pseudo.count, log.base) to the returned matrix to obtain the desired log-transformed expression values.

If NULL, the default value in normalizeCountsDefaults is used.

This argument is ignored if log = FALSE or pseudo.count = 1.

delayed

Logical scalar indicating whether operations on a matrix-like x should be delayed. This improves memory efficiency at the cost of some speed in downstream operations.

Value

A normalized expression matrix in varying forms:

  • If x is a matrix-like object, a matrix-like object is returned containing the (log-transformed) normalized expression matrix. If delayed=TRUE, the returned object is a DelayedArray, otherwise the return type depends on the type of x and the operations involved.

  • If x is an external pointer produced by initializeCpp, a new external pointer is returned containing the normalized expression matrix.

Author(s)

Aaron Lun

See Also

The normalize_counts function in https://libscran.github.io/scran_norm/.

normalizeRnaCounts.se and related functions, which compute normalized expression values from a SummarizedExperiment.

Examples

# Mocking a matrix:
library(Matrix)
x <- round(abs(rsparsematrix(1000, 100, 0.1) * 100))
sf <- centerSizeFactors(colSums(x))
normed <- normalizeCounts(x, size.factors=sf)
normed

# Passing a pointer.
ptr <- beachmat::initializeCpp(x)
optr <- normalizeCounts(ptr, sf)
optr

Default parameters for normalizeCounts

Description

Default parameters from the underlying C++ library. These may be overridden by defaults in the normalizeCounts function signature.

Usage

normalizeCountsDefaults()

Value

Named list of default values for various function arguments.

Author(s)

Aaron Lun

Examples

normalizeCountsDefaults()

Normalize CRISPR counts in a SummarizedExperiment

Description

Compute (log-)normalized expression values after performing scaling normalization of an CRISPR count matrix. This calls normalizeCounts on an assay of a SummarizedExperiment, after centering the size factors with centerSizeFactors.

Usage

normalizeCrisprCounts.se(
  x,
  size.factors = NULL,
  center = TRUE,
  block = NULL,
  mode = "lowest",
  log = TRUE,
  pseudo.count = 1,
  more.norm.args = list(),
  assay.type = "counts",
  output.name = "logcounts",
  factor.name = "sizeFactor",
  num.threads = 1
)

Arguments

x

A SummarizedExperiment object or one of its subclasses. Rows correspond to CRISPR guides and columns correspond to cells.

size.factors

Numeric vector of length equal to the number of columns of x, containing the size factor for each cell in x. If NULL, this defaults to the column sums of the count matrix in x.

center

Logical scalar indicating whether to center the size.factors, see ?centerSizeFactors for more details.

block

Block assignments for each cell, passed to centerSizeFactors.

mode

How to center size factors in different blocks, see ?centerSizeFactors for more details.

log

Whether to log-transform the normalized expression values, see ?normalizeCounts for more details.

pseudo.count

The pseudo-count for log-transformation, see ?normalizeCounts for more details.

more.norm.args

Named list of additional arguments to pass to normalizeCounts.

assay.type

Integer or string specifying the assay of x with the count matrix.

output.name

String containing the name of the assay to store the normalized matrix.

factor.name

String containing the name of the colData column in which to store the size factors in the output object. If NULL, the size factors are not stored.

num.threads

Integer specifying the number of threads for computing column sums. Only used if size.factors=NULL.

Value

x is returned with a new assay containing the (log-)normalized matrix. Size factors are also stored in the colData.

Author(s)

Aaron Lun

Examples

library(SingleCellExperiment)
sce <- altExp(getTestCrisprData.se("qc"), "CRISPR Guide Capture")
sce <- normalizeCrisprCounts.se(sce, size.factors=sce$sum)
assayNames(sce)
summary(sizeFactors(sce))

Normalize RNA counts in a SummarizedExperiment

Description

Compute (log-)normalized expression values for a RNA count matrix after scaling normalization. This calls normalizeCounts on an assay of a SummarizedExperiment, after centering the size factors with centerSizeFactors.

Usage

normalizeRnaCounts.se(
  x,
  size.factors = NULL,
  center = TRUE,
  block = NULL,
  mode = "lowest",
  log = TRUE,
  pseudo.count = 1,
  more.norm.args = list(),
  assay.type = "counts",
  output.name = "logcounts",
  factor.name = "sizeFactor",
  num.threads = 1
)

Arguments

x

A SummarizedExperiment object or one of its subclasses. Rows correspond to genes and columns correspond to cells.

size.factors

Numeric vector of length equal to the number of columns of x, containing the size factor for each cell in x. If NULL, this defaults to the column sums of the count matrix in x.

center

Logical scalar indicating whether to center the size.factors, see ?centerSizeFactors for more details.

block

Block assignments for each cell, passed to centerSizeFactors.

mode

How to center size factors in different blocks, see ?centerSizeFactors for more details.

log

Whether to log-transform the normalized expression values, see ?normalizeCounts for more details.

pseudo.count

The pseudo-count for log-transformation, see ?normalizeCounts for more details.

more.norm.args

Named list of additional arguments to pass to normalizeCounts.

assay.type

Integer or string specifying the assay of x with the count matrix.

output.name

String containing the name of the assay to store the normalized matrix.

factor.name

String containing the name of the colData column in which to store the size factors in the output object. If NULL, the size factors are not stored.

num.threads

Integer specifying the number of threads for computing column sums. Only used if size.factors=NULL.

Value

x is returned with a new assay containing the (log-)normalized matrix. Size factors are also stored in the colData.

Author(s)

Aaron Lun

See Also

normalizeRnaCountsWithSpikeIns.se, to normalize RNA data with spike-in transcripts.

Examples

library(SingleCellExperiment)
sce <- getTestRnaData.se("qc")
sce <- normalizeRnaCounts.se(sce, size.factors=sce$sum)
assayNames(sce)
summary(sizeFactors(sce))

Normalize RNA and spike-in counts

Description

Compute (log-)normalized expression values for endogenous genes and spike-in transcripts after scaling normalization. All sets of size factors are centered with centerSpikeInFactors prior to calling normalizeCounts on the relevant assays.

Usage

normalizeRnaCountsWithSpikeIns.se(
  x,
  spike.altexps,
  endogenous.factors = NULL,
  spike.factors = NULL,
  use.spike.ins.for.endogenous = FALSE,
  block = NULL,
  mode = "lowest",
  log = TRUE,
  pseudo.count = 1,
  more.norm.args = list(),
  assay.type = "counts",
  output.name = "logcounts",
  factor.name = "sizeFactor",
  num.threads = 1
)

Arguments

x

A SingleCellExperiment or one of its subclasses. The main experiment should contain count data for endogenous genes, where rows correspond to genes and columns correspond to cells. There should also be one or more alternative experiments containing spike-in data, see spike.altexps.

spike.altexps

Alternative experiments containing spike-in data. This should be an unnamed integer or character vector containing the names/indices of the alternative experiments of interest. The assay to use from each alternative experiment is determined by assay.type.

Alternatively, spike.altexps may be a named integer or character vector. Each name specifies an alternative experiment while each value is the index/name of the assay to use from that experiment.

endogenous.factors

Numeric vector of length equal to ncol(x), containing the size factors for the endogenous genes. If NULL, this defaults to the column sums of the relevant count matrix in x. Ignored if use.spike.ins.for.endogenous.factors is not FALSE.

spike.factors

Named list of numeric vectors containing the size factors for each spike-in set. Each entry corresponds to a spike-in set ad be named after its alternative experiment in spike.altexps. Each vector should be of length equal to ncol(x). If NULL or a spike-in set is not present in spike.factors, the column sums of the relevant count matrix is used instead.

use.spike.ins.for.endogenous

Boolean indicating whether to use spike-in factors for normalization of endogenous genes. By default, each set of spike-in factors is only used on its corresponding count matrix, while the endogenous factors are used on the endogenous count matrix.

If this is set to TRUE, the size factors from the first spike-in set in spike.altexps are used to normalize the endogenous counts. This is useful for removing technical differences in scaling while preserving differences in total RNA content between cells.

Alternatively, this may be a string specifying the name of the spike-in set from which to obtain size factors for normalization of the endogenous counts. This should refer to one of the alternative experiments in spike.altexps.

block

Block assignments for each cell, passed to centerSizeFactors.

mode

How to center size factors in different blocks, see ?centerSizeFactors for more details.

log

Whether to log-transform the normalized expression values, see ?normalizeCounts for more details.

pseudo.count

The pseudo-count for log-transformation, see ?normalizeCounts for more details.

more.norm.args

Named list of additional arguments to pass to normalizeCounts.

assay.type

Integer or string specifying the assay of x with the count matrix.

output.name

String containing the name of the assay to store the normalized matrix.

factor.name

String containing the name of the colData column in which to store the size factors in the output object. If NULL, the size factors are not stored.

num.threads

Integer specifying the number of threads for computing column sums. Only used if precomputed factors are not available in endogenous.factors or spike.factors.

Value

x is returned with new assays containing (log-)normalized matrices for endogenous genes and spike-in transcripts. Size factors are also stored in the colData for each experiment.

Author(s)

Aaron Lun

See Also

normalizeRnaCounts.se, for simpler normalization when spike-ins are not present.

Examples

library(SingleCellExperiment)
sce <- getTestRnaData.se("qc")
sce <- normalizeRnaCountsWithSpikeIns.se(sce, "ERCC")

summary(sce$sizeFactor)
assayNames(sce)

summary(altExp(sce, "ERCC")$sizeFactor)
assayNames(altExp(sce, "ERCC"))

Quick quality control for ADT data in a SummarizedExperiment

Description

Quickly compute quality control (QC) metrics, thresholds and filters from ADT data in a SummarizedExperiment. This calls computeAdtQcMetrics on an assay in a SummarizedExperiment, followed by suggestAdtQcThresholds and filterAdtQcMetrics to identify high-quality cells.

Usage

quickAdtQc.se(
  x,
  subsets,
  num.threads = 1,
  thresholds = NULL,
  block = NULL,
  more.suggest.args = list(),
  assay.type = "counts",
  output.prefix = NULL,
  meta.name = "qc",
  flatten = TRUE
)

formatComputeAdtQcMetricsResult(compute.res, flatten = TRUE)

Arguments

x

A SummarizedExperiment object or one of its subclasses. Rows correspond to antibody-derived tags (ADTs) and columns correspond to cells.

subsets

List of subsets of control tags, see ?computeAdtQcMetrics for more details.

num.threads

Number of threads, to pass to computeAdtQcMetrics.

thresholds

List containing pre-defined thresholds for each QC metric, see the return value of suggestAdtQcThresholds for the expected format.

block

Block assignment for each cell, to pass to suggestAdtQcThresholds and filterAdtQcMetrics.

more.suggest.args

Named list of additional arguments to pass to suggestAdtQcThresholds.

assay.type

Integer or string specifying the assay of x containing the ADT count matrix.

output.prefix

String containing a prefix to add to the names of the link[SummarizedExperiment]{colData} columns containing the output statistics.

meta.name

String containing the name of the metadata entry containing additional outputs like the filtering thresholds. If NULL, additional outputs are not reported.

flatten

Logical scalar indicating whether to flatten the subset proportions into separate columns of the link[SummarizedExperiment]{colData}. If FALSE, the subset proportions are stored in a nested DataFrame.

compute.res

DataFrame returned by computeAdtQcMetrics.

Value

For quickAdtQc.se, x is returned with additional columns added to its colData. Each column contains per-cell values for one of the QC metrics, see computeAdtQcMetrics for details. The suggested thresholds are stored as a list in metadata. The colData also contains a keep column, specifying which cells are to be retained.

For formatComputeAdtQcMetricsResult, a DataFrame is returned with the per-cell QC metrics.

Author(s)

Aaron Lun

Examples

library(SingleCellExperiment)
sce <- altExp(getTestAdtData.se(), "ADT")
sce <- quickAdtQc.se(sce, subsets=list(igg=grepl("IgG", rownames(sce))))
colData(sce)[,c("sum", "detected", "subset.sum.igg")]
metadata(sce)$qc$thresholds
summary(sce$keep)

Quick quality control for CRISPR data in a SummarizedExperiment

Description

Quickly compute quality control (QC) metrics, thresholds and filters from CRISPR data in a SummarizedExperiment. This calls computeCrisprQcMetrics on an assay in a SummarizedExperiment, followed by suggestCrisprQcThresholds and filterCrisprQcMetrics to identify high-quality cells.

Usage

quickCrisprQc.se(
  x,
  num.threads = 1,
  thresholds = NULL,
  block = NULL,
  more.suggest.args = list(),
  assay.type = "counts",
  output.prefix = NULL,
  meta.name = "qc"
)

formatComputeCrisprQcMetricsResult(compute.res)

Arguments

x

A SummarizedExperiment object or one of its subclasses. Rows correspond to CRISPR guides and columns correspond to cells.

num.threads

Number of threads, to pass to computeCrisprQcMetrics.

thresholds

List containing pre-defined thresholds for each QC metric, see the return value of suggestRnaQcThresholds for the expected format.

block

Block assignment for each cell, to pass to suggestCrisprQcThresholds and filterCrisprQcMetrics.

more.suggest.args

Named list of additional arguments to pass to suggestCrisprQcThresholds.

assay.type

Integer or string specifying the assay of x containing the CRISPR count matrix.

output.prefix

String containing a prefix to add to the names of the link[SummarizedExperiment]{colData} columns containing the output statistics.

meta.name

String containing the name of the metadata entry containing additional outputs like the filtering thresholds. If NULL, additional outputs are not reported.

compute.res

DataFrame returned by computeCrisprQcMetrics.

Value

For quickCrisprQc.se, x is returned with additional columns added to its colData. Each column contains per-cell values for one of the QC metrics, see computeCrisprQcMetrics for details. The suggested thresholds are stored as a list in metadata. The colData also contains a keep column, specifying which cells are to be retained.

For formatComputeCrisprQcMetricsResult, a DataFrame is returned with the per-cell QC metrics.

Author(s)

Aaron Lun

Examples

library(SingleCellExperiment)
sce <- altExp(getTestCrisprData.se(), "CRISPR Guide Capture")
sce <- quickCrisprQc.se(sce)
colData(sce)[,c("sum", "detected", "max.value", "max.index")]
metadata(sce)$qc$thresholds
summary(sce$keep)

Quick quality control for RNA data in a SummarizedExperiment

Description

Quickly compute quality control (QC) metrics, thresholds and filters from RNA data in a SummarizedExperiment. This calls computeRnaQcMetrics on an assay in a SummarizedExperiment, followed by suggestRnaQcThresholds and filterRnaQcMetrics to identify high-quality cells.

Usage

quickRnaQc.se(
  x,
  subsets,
  num.threads = 1,
  thresholds = NULL,
  block = NULL,
  more.suggest.args = list(),
  altexp.proportions = NULL,
  assay.type = "counts",
  output.prefix = NULL,
  meta.name = "qc",
  flatten = TRUE
)

computeRnaQcMetricsWithAltExps(
  x,
  subsets,
  altexp.proportions,
  num.threads = 1,
  assay.type = "counts"
)

formatComputeRnaQcMetricsResult(compute.res, flatten = TRUE)

Arguments

x

A SummarizedExperiment object or one of its subclasses. Rows correspond to genes and columns correspond to cells.

subsets

List of subsets of control genes, see ?computeRnaQcMetrics for more details.

num.threads

Number of threads, to pass to computeRnaQcMetrics.

thresholds

List containing pre-defined thresholds for each QC metric, see the return value of suggestRnaQcThresholds for the expected format.

block

Block assignment for each cell, to pass to suggestRnaQcThresholds and filterRnaQcMetrics.

more.suggest.args

Named list of additional arguments to pass to suggestRnaQcThresholds.

altexp.proportions

Alternative experiments for which to compute QC metrics. This is typically used to refer to alternative experiments holding spike-in data. For each alternative experiment, the proportion is defined as X/(X+Y)X/(X+Y) where XX is the alternative experiment's total and YY is the RNA total. These proportions will be used for filtering in the same manner as the proportions computed from subsets.

More specifically, altexp.proportions should be an unnamed integer or character vector containing the names/indices of the alternative experiments of interest. The assay to use from each alternative experiment is determined by assay.type.

Alternatively, altexp.proportions may be a named integer or character vector. Each name specifies an alternative experiment while each value is the index/name of the assay to use from that experiment.

Only relevant if x is a SingleCellExperiment.

assay.type

Integer or string specifying the assay of x containing the RNA count matrix.

output.prefix

String containing a prefix to add to the names of the link[SummarizedExperiment]{colData} columns containing the output statistics.

meta.name

String containing the name of the metadata entry containing the additional outputs such as the filtering thresholds. If NULL, additional outputs are not reported.

flatten

Logical scalar indicating whether to flatten the subset proportions into separate columns of the link[SummarizedExperiment]{colData}. If FALSE, the subset proportions are stored in a nested DataFrame.

compute.res

DataFrame returned by computeRnaQcMetrics.

Value

For quickRnaQc.se, x is returned with additional columns added to its colData. Each column contains per-cell values for one of the QC metrics, see computeRnaQcMetrics for details. The suggested thresholds are stored as a list in metadata. The colData also contains a keep column, specifying which cells are to be retained. If altexp.proportions is provided, QC metrics are added to the colData of the specified alternative experiments in the output object.

For computeRnaQcMetricsWithAltExps, a list is returned containing:

  • main, the result of calling computeRnaQcMetrics on the RNA count matrix in x. The proportion of counts in each alternative experiment is added to the subsets.

  • altexp, a named list of length equal to altexp.proportions. Each inner list is the result of calling computeRnaQcMetrics on the RNA count matrix of the corresponding alternative experiment of x.

For formatComputeRnaQcMetricsResult, a DataFrame is returned containing the per-cell QC metrics.

Author(s)

Aaron Lun

Examples

library(SingleCellExperiment)
sce <- getTestRnaData.se()
sce <- quickRnaQc.se(sce, subsets=list(mito=grepl("^mt", rownames(sce))))
colData(sce)[,c("sum", "detected", "subset.proportion.mito")]
metadata(sce)$qc$thresholds
summary(sce$keep)

# Computing spike-in proportions, if available.
sce <- getTestRnaData.se()
sce <- quickRnaQc.se(
   sce,
   subsets=list(mito=grepl("^mt", rownames(sce))),
   altexp.proportions="ERCC"
)
colData(sce)[,c("sum", "detected", "subset.proportion.mito", "subset.proportion.ERCC")]
colData(altExp(sce, "ERCC"))[,c("sum", "detected")]

Report marker statistics for a single group

Description

Combine all marker statistics for a single group into a data frame for easy inspection. Users can pick one of the columns for sorting potential marker genes.

Usage

reportGroupMarkerStatistics(
  results,
  group,
  effect.sizes = NULL,
  summaries = NULL,
  include.mean = TRUE,
  include.detected = TRUE
)

Arguments

results

Named list of marker statistics, typically generated by scoreMarkers with all.pairwise=FALSE.

group

String or integer specifying the group of interest.

effect.sizes

Character vector specifying the effect sizes of interest. If NULL, all effect sizes are reported in the returned data frame.

summaries

Character vector specifying the summary statistics of interest. If NULL, all summaries are reported in the returned data frame.

include.mean

Boolean indicating whether the mean expression should be reported in the returned data frame.

include.detected

Boolean indicating whether the proportion of detected cells should be reported in the returned data frame.

Value

Data frame where each row corresponds to a gene. Each column contains the requested statistics for group. Effect size summary columns are named as <EFFECT>.<SUMMARY>.

Author(s)

Aaron Lun

See Also

scoreMarkers, to generate results.

summarizeEffects, for the trade-offs between effect size summaries.


Quality control for RNA count data

Description

Compute per-cell QC metrics from an initialized matrix of RNA counts, and use the metrics to suggest filter thresholds to retain high-quality cells.

Usage

computeRnaQcMetrics(x, subsets, num.threads = NULL)

suggestRnaQcThresholds(
  metrics,
  block = NULL,
  num.mads = NULL,
  sum.num.mads = num.mads,
  detected.num.mads = num.mads,
  subset.proportion.num.mads = num.mads
)

filterRnaQcMetrics(thresholds, metrics, block = NULL)

Arguments

x

A matrix-like object where rows are genes and columns are cells. Values are expected to be counts.

subsets

Named list of vectors specifying gene subsets of interest, typically for control-like features like mitochondrial genes or spike-in transcripts. Each vector may be logical (whether to keep each row), integer (row indices) or character (row names). For character vectors, strings not present in rownames(x) are ignored.

num.threads

Integer scalar specifying the number of threads to use.

If NULL, the default value in computeRnaQcMetricsDefaults is used.

metrics

DataFrame of per-cell QC metrics. This should have the same structure as the return value of computeRnaQcMetrics.

block

Factor specifying the block of origin (e.g., batch, sample) for each cell in metrics. Alternatively NULL if all cells are from the same block.

For filterRnaQcMetrics, a blocking factor should be provided if block was used to construct thresholds.

num.mads

Number of median absolute deviations (MADs) from the median to define the threshold for outliers in each metric.

sum.num.mads

Number of MADS from the median to define the threshold for outliers in the total sum of counts.

If NULL, the default value in suggestRnaQcThresholdsDefaults is used.

detected.num.mads

Number of MADS from the median to define the threshold for outliers in the number of detected genes.

If NULL, the default value in suggestRnaQcThresholdsDefaults is used.

subset.proportion.num.mads

Number of MADS from the median to define the threshold for outliers in the subset proportions.

If NULL, the default value in suggestRnaQcThresholdsDefaults is used.

thresholds

List with the same structure as produced by suggestRnaQcThresholds.

Value

For computeRnaQcMetrics, a DataFrame is returned with one row per cell in x. This contains the following columns:

  • sum, a numeric vector containing the total RNA count for each cell. This represents the efficiency of library preparation and sequencing. Low totals indicate that the library was not successfully captured.

  • detected, an integer vector containing the number of detected genes per cell. This also quantifies library preparation efficiency but with greater focus on capturing transcriptional complexity.

  • subsets, a nested DataFrame where each column corresponds to a feature subset and is a numeric vector containing the proportion of counts in that subset. The exact interpretation of which depends on the nature of the subset. For example, if one subset contains all genes on the mitochondrial chromosome, higher proportions are representative of cell damage; the assumption is that cytoplasmic transcripts leak through tears in the cell membrane while the mitochondria are still trapped inside. The proportion of spike-in transcripts can be interpreted in a similar manner, where the loss of endogenous transcripts results in higher spike-in proportions.

Each vector is of length equal to the number of cells.

For suggestRnaQcThresholds, a named list is returned.

  • If block=NULL, the list contains:

    • sum, a numeric scalar containing the lower bound on the sum. This is defined as num.mads MADs below the median of the log-transformed metrics across all cells.

    • detected, a numeric scalar containing the lower bound on the number of detected genes. This is defined as num.mads MADs below the median of the log-transformed metrics across all cells.

    • subsets, a numeric vector containing the upper bound on the sum of counts in each feature subset. This is defined as num.mads MADs above the median across all cells.

  • Otherwise, if block is supplied, the list contains:

    • sum, a numeric vector containing the lower bound on the sum for each blocking level. Here, the threshold is computed independently for each block, using the same method as the unblocked case.

    • detected, a numeric vector containing the lower bound on the number of detected genes for each blocking level. Here, the threshold is computed independently for each block, using the same method as the unblocked case.

    • subsets, a list of numeric vectors containing the upper bound on the sum of counts in each feature subset for each blocking level. Here, the threshold is computed independently for each block, using the same method as the unblocked case.

    • block.ids, a vector containing the identities of the unique blocks.

    Each vector is of length equal to the number of levels in block and is named accordingly.

For filterRnaQcMetrics, a logical vector of length ncol(x) is returned indicating which cells are of high quality. High-quality cells are defined as those with sums and detected genes above their respective thresholds and subset proportions below the subsets threshold.

Author(s)

Aaron Lun

See Also

The compute_rna_qc_metrics, compute_rna_qc_filters and compute_rna_qc_filters_blocked functions in https://libscran.github.io/scran_qc/.

quickRnaQc.se, to run all of the RNA-related QC functions on a SummarizedExperiment.

Examples

# Mocking a matrix:
library(Matrix)
x <- round(abs(rsparsematrix(1000, 100, 0.1) * 100))

# Mocking up a control set.
sub <- list(mito=rbinom(nrow(x), 1, 0.1) > 0)

qc <- computeRnaQcMetrics(x, sub)
qc

filt <- suggestRnaQcThresholds(qc)
str(filt)

keep <- filterRnaQcMetrics(filt, qc)
summary(keep)

Default parameters for RNA quality control

Description

Default parameters from the underlying C++ library. These may be overridden by defaults in each function's signature.

Usage

computeRnaQcMetricsDefaults()

suggestRnaQcThresholdsDefaults()

Value

Named list containing default values for the various function arguments.

Author(s)

Aaron Lun

Examples

computeRnaQcMetricsDefaults()
suggestRnaQcThresholdsDefaults()

Run all neighbor-related steps

Description

Run all steps that require a nearest-neighbor search. This includs runUmap, runTsne and buildSnnGraph with clusterGraph. The idea is to build the index once, perform the neighbor search, and run each task in parallel to save time.

Usage

runAllNeighborSteps(
  x,
  runUmap.args = list(),
  runTsne.args = list(),
  buildSnnGraph.args = list(),
  clusterGraph.args = list(),
  BNPARAM = AnnoyParam(),
  return.graph = FALSE,
  collapse.search = TRUE,
  num.threads = 3
)

Arguments

x

Numeric matrix where rows are dimensions and columns are cells, typically containing a low-dimensional representation from, e.g., runPca.

Alternatively, an index constructed by buildIndex.

runUmap.args

Named list of further arguments to pass to runUmap. This can be set to NULL to omit the UMAP.

runTsne.args

Named list of further arguments to pass to runTsne. This can be set to NULL to omit the t-SNE.

buildSnnGraph.args

Named list of further arguments to pass to buildSnnGraph. Ignored if clusterGraph.args=NULL.

clusterGraph.args

Named list of further arguments to pass to clusterGraph. This can be set to NULL to omit the clustering.

BNPARAM

A BiocNeighborParam instance specifying the nearest-neighbor search algorithm to use.

return.graph

Logical scalar indicating whether to return the output of buildSnnGraph. By default, only the output of clusterGraph is returned.

collapse.search

Logical scalar indicating whether to collapse the nearest-neighbor search for each step into a single search. Steps that need fewer neighbors will take a subset of the neighbors from the collapsed search. Setting this to TRUE is faster but may not give the same results as separate searches for some nearest-neighbor algorithms (e.g., approximate methods).

num.threads

Integer scalar specifying the number of threads to use. At least one thread should be available for each step.

Value

A named list containing the results of each step. See each individual function for the format of the results.

Author(s)

Aaron Lun

See Also

runAllNeighborSteps.se, to run each neighbor-related step on a SingleCellExperiment.

Examples

x <- t(as.matrix(iris[,1:4]))
# (Turning down the number of threads so that R CMD check is happy.)
res <- runAllNeighborSteps(x, num.threads=2)
str(res)

Run all nearest neighbor steps on a SummarizedExperiment

Description

Concurrently run all steps involving a nearest-neighbor search (t-SNE, UMAP and graph-based clustering) using the same nearest-neighbor index, by calling runAllNeighborSteps on a reduced dimension entry of a SingleCellExperiment.

Usage

runAllNeighborSteps.se(
  x,
  umap.output.name = "UMAP",
  umap.dim.prefix = "UMAP",
  more.umap.args = list(),
  tsne.output.name = "TSNE",
  tsne.dim.prefix = "TSNE",
  more.tsne.args = list(),
  build.graph.name = NULL,
  more.build.graph.args = list(),
  cluster.output.name = "clusters",
  cluster.meta.name = NULL,
  more.cluster.graph.args = list(),
  BNPARAM = AnnoyParam(),
  num.threads = 3,
  more.neighbor.args = list(),
  reddim.type = "PCA"
)

Arguments

x

A SingleCellExperiment object or one of its subclasses. Rows correspond to genomic features and columns correspond to cells.

umap.output.name

String containing the name of the reducedDim entry to store the UMAP coordinates. If NULL, the UMAP is not computed.

umap.dim.prefix

String containing a prefix for the column names of the UMAP coordinate matrix, see the dim.prefix argument in runUmap.se for details.

more.umap.args

Named list of additional arguments to pass to runAllNeighborSteps as runUmap.args.

tsne.output.name

String containing the name of the reducedDim entry to store the t-SNE coordinates. If NULL, the t-SNE is not computed.

tsne.dim.prefix

String containing a prefix for the column names of the t-SNE coordinate matrix, see the dim.prefix argument in runTsne.se for details.

more.tsne.args

Named list of additional arguments to pass to runAllNeighborSteps as runTsne.args.

build.graph.name

String containing the name of the metadata entry in which to store the nearest neighbor graph. If NULL, the graph is not stored.

more.build.graph.args

Named list of additional arguments to pass to runAllNeighborSteps as buildSnnGraph.args.

cluster.output.name

String containing the name of the colData column in which to store the cluster assignments. If NULL, graph-based clustering is not performed.

cluster.meta.name

String containing the name of the metadata entry in which to store additional clustering outputs. If NULL, these additional outputs are not stored.

more.cluster.graph.args

Named list of additional arguments to pass to runAllNeighborSteps as clusterGraph.args.

BNPARAM, num.threads

Arguments to pass to runAllNeighborSteps.

more.neighbor.args

Named list of additional arguments to pass to runAllNeighborSteps.

reddim.type

String or integer specifying the reducedDim entry on which to perform a nearest neighbor search. Alternatively, a named integer or character vector of length 1, where the name specifies an alternative experiment of x and the value is the name/index of a reducedDim entry in that alternative experiment.

Value

x is returned with additional coordinates stored in its reducedDims and clustering output in its colData. Additional information may also be stored in its metadata.

Author(s)

Aaron Lun

Examples

library(SingleCellExperiment)
sce <- getTestRnaData.se("pca")
sce <- runAllNeighborSteps.se(
   sce,
   more.tsne.args=list(max.iterations=50),
   more.umap.args=list(num.epochs=50),
   num.threads=2 # to keep R CMD check happy
)
reducedDimNames(sce)
table(sce$clusters)

Principal components analysis

Description

Run a PCA on the gene-by-cell log-expression matrix and extract the top principal components (PCs). This yields a low-dimensional representation that reduces noise and compute time in downstream analyses. For efficiency, the PCA itself is approximated using IRLBA.

Usage

runPca(
  x,
  number = NULL,
  scale = NULL,
  block = NULL,
  block.weight.policy = NULL,
  variable.block.weight = NULL,
  components.from.residuals = NULL,
  center.scores.by.block = components.from.residuals,
  subset = NULL,
  extra.work = NULL,
  iterations = NULL,
  tolerance = NULL,
  seed = NULL,
  realized = NULL,
  warn = TRUE,
  num.threads = NULL
)

Arguments

x

A matrix-like object where rows correspond to genes or genomic features and columns correspond to cells. Typically, the matrix is expected to contain log-expression values (see normalizeCounts) for “interesting” genes (see chooseHighlyVariableGenes).

number

Integer scalar specifying the number of top PCs to retain. More PCs will capture more biological signal at the cost of increasing noise and compute time. If this is greater than the maximum number of PCs (i.e., the smaller dimension of x), only the maximum number of PCs will be reported in the results.

If NULL, the default value in runPcaDefaults is used.

scale

Logical scalar indicating whether to scale all genes to have the same variance. This ensures that each gene contributes equally to the PCA, favoring consistent variation across many genes rather than large variation in a few genes. If block is specified, each gene's variance is calculated as a weighted sum of the variances from each block. Genes with zero variance are ignored.

If NULL, the default value in runPcaDefaults is used.

block

Factor specifying the block of origin (e.g., batch, sample) for each cell in x. The PCA will be performed on the residuals after regressing out the block effect, ensuring that differences between block do not dominate the variation in the dataset. Alternatively NULL if all cells are from the same block.

block.weight.policy

String specifying the policy to use for weighting the contribution of different blocks to the PCA. See the argument of the same name in computeBlockWeights for more detail.

If NULL, the default value in runPcaDefaults is used.

This argument is only used if block is not NULL.

variable.block.weight

Numeric vector of length 2, specifying the parameters for variable block weighting. See the argument of the same name in computeBlockWeights for more detail.

If NULL, the default value in runPcaDefaults is used.

This argument is only used if block is not NULL and block.weight.policy = "variable".

components.from.residuals

Deprecated, use center.scores.by.block instead.

center.scores.by.block

Boolean indicating whether to center the PC scores within each block, see Details.

If NULL, the default value in runPcaDefaults is used.

This argument is only used if block is not NULL.

subset

Integer, logical or character vector specifying the rows of x to use for the PCA. This yields the same results as runPca on x[subset,], except that entries of the rotation matrix will also be computed for rows outside of the subset. If NULL, all rows of x are used.

extra.work

Integer scalar specifying the number of extra dimensions for the IRLBA workspace. Larger values improve accuracy at the cost of compute time.

If NULL, this defaults to the larger of 7 and number.

iterations

Integer scalar specifying the maximum number of restart iterations for IRLBA. Larger values improve accuracy at the cost of compute time.

If NULL, the default value in runPcaDefaults is used.

tolerance

Number specifying the tolerance on the approximation error of the singular triplets, to determine IRLBA convergence. Lower values improve accuracy at the cost of compute time.

If NULL, the default value in runPcaDefaults is used.

seed

Integer scalar specifying the seed for the initial random vector in IRLBA.

If NULL, the default value in runPcaDefaults is used.

realized

Logical scalar indicating whether to realize x into an optimal memory layout for IRLBA. This speeds up computation at the cost of increased memory usage.

If NULL, the default value in runPcaDefaults is used.

warn

Boolean specifying whether a warning should be emitted if IRLBA failed to converge.

num.threads

Number of threads to use.

If NULL, the default value in runPcaDefaults is used.

Details

When block is specified, the interpretation of the reported PC scores in components depends on the choice of center.scores.by.block:

  • If FALSE (the default), the centroid of all cells is shifted to the origin. This does not attempt to explicitly remove any differences between blocks. Any differences in expression that are not orthogonal to the rotation vectors will still manifest in the PC scores. In this mode, blocking only reduces the impact of inter-block differences on the identification of the rotation vectors.

  • If TRUE, the scores are centered at the origin within each block. This yields a low-dimensional space where inter-block differences have been removed, assuming that all blocks have the same subpopulation composition and the inter-block differences are consistent for all cell subpopulations. Under these assumptions, we could use these components for downstream analysis without any concern for block-wise effects.

In complex datasets, the assumptions mentioned for TRUE not hold and more sophisticated batch correction methods like MNN correction are required. Functions like correctMnn will accept a low-dimensional embedding of cells that can be created as described above with FALSE.

Value

List containing:

  • components, a matrix of PC scores. Rows are dimensions (i.e., PCs) and columns are cells.

  • rotation, the rotation matrix. Rows are genes and columns are dimensions.

  • variance.explained, the vector of variances explained by each PC.

  • total.variance, the total variance in the dataset. This can be used to divide variance.explained to obtain the proportion of variance explained by each PC.

  • center, a numeric vector containing the mean for each gene. If block is provided, this is instead a matrix containing the mean for each gene (column) in each block (row).

  • block.ids, a vector containing the identities of the unique blocks in the same order as the rows of center. Only reported if block is provided.

  • scale, a numeric vector containing the scaling for each gene. Only reported if scale=TRUE.

  • metrics, a named list containing IRLBA metrics. This includes converged, whether the algorithm converged successfully; iterations, the number of IRLBA restart iterations; and multiplications, the number of matrix multiplications.

Author(s)

Aaron Lun

See Also

The simple_pca and blocked_pca functions for https://libscran.github.io/scran_pca/.

runPca.se, to run a PCA on a SummarizedExperiment.

Examples

library(Matrix)
x <- abs(rsparsematrix(1000, 100, 0.1) * 10)
y <- normalizeCounts(x, size.factors=centerSizeFactors(colSums(x)))

# A simple PCA:
out <- runPca(y)
str(out)

# Blocking on uninteresting factors:
block <- sample(LETTERS[1:3], ncol(y), replace=TRUE)
bout <- runPca(y, block=block)
str(bout)

Principal components analysis of a Summarizedexperiment

Description

Compact and denoise the dataset by performing PCA on the (log-)normalized expression matrix, by calling runPca on an assay of a SummarizedExperiment.

Usage

runPca.se(
  x,
  features,
  number = 25,
  block = NULL,
  num.threads = 1,
  more.pca.args = list(),
  assay.type = "logcounts",
  output.name = "PCA",
  meta.name = "PCA",
  dim.prefix = "PC",
  delayed.transpose = FALSE
)

Arguments

x

A SummarizedExperiment object or one of its subclasses. Rows correspond to genomic features and columns correspond to cells.

features

Integer, logical or character vector containing the features of interest to use in the PCA. For RNA data, this is typically the hvg vector added by chooseRnaHvgs.se. If NULL, all available features are used.

number

Number of PCs to retain, passed to runPca.

block

Block assignment for each cell, passed to runPca.

num.threads

Number of threads for the PCA, passed to runPca.

more.pca.args

Named list of additional arguments to pass to runPca.

assay.type

Integer or string specifying the assay of x to be used for PCA. This is typically the log-normalized expression matrix created by normalizeRnaCounts.se.

output.name

String containing the name of the reducedDim entry in which to store the PC scores.

meta.name

String containing the name of the link[S4Vectors]{metadata} entry in which to store other PCA statistics.

dim.prefix

String containing a prefix for the column names of the score and rotation matrices. Each column is named as dim.prefix followed by its column number. If NULL, no column names are added.

delayed.transpose

Logical scalar indicating whether to delay the transposition when storing coordinates in the reducedDims.

Value

x is returned with the principal component scores in the reducedDim. (This is converted to a SingleCellExperiment if it wasn't one already.) Additional outputs (e.g., rotation matrix, variance explained) are stored in the metadata.

Author(s)

Aaron Lun

Examples

library(SingleCellExperiment)
sce <- getTestRnaData.se("hvg")
sce <- runPca.se(sce, rowData(sce)$hvg)
dim(reducedDim(sce, "PCA"))
plot(metadata(sce)$PCA$variance.explained / metadata(sce)$PCA$total.variance)

Default parameters for runPca

Description

Default parameters from the underlying C++ library. These may be overridden by defaults in the runPca function signature.

Usage

runPcaDefaults(block = NULL, subset = NULL)

Arguments

block

See the argument of the same name in runPca.

subset

See the argument of the same name in runPca.

Value

Named list containing default values for various function arguments. These defaults may change depending on the values for subset and block.

Author(s)

Aaron Lun

Examples

runPcaDefaults()

t-stochastic neighbor embedding

Description

Compute t-SNE coordinates to visualize similarities between cells.

Usage

runTsne(
  x,
  perplexity = NULL,
  num.neighbors = NULL,
  theta = NULL,
  early.exaggeration.iterations = NULL,
  exaggeration.factor = NULL,
  momentum.switch.iterations = NULL,
  start.momentum = NULL,
  final.momentum = NULL,
  eta = NULL,
  max.depth = 7,
  leaf.approximation = NULL,
  max.iterations = NULL,
  seed = 42,
  num.threads = NULL,
  BNPARAM = AnnoyParam()
)

tsnePerplexityToNeighbors(perplexity)

Arguments

x

Numeric matrix where rows are dimensions and columns are cells, typically containing a low-dimensional representation from, e.g., runPca.

Alternatively, a named list of nearest-neighbor search results like that returned by findKNN. This should contain index, an integer matrix where rows are neighbors and columns are cells; and distance, a numeric matrix of the same dimensions containing the distances to each neighbor. Each column contains 1-based indices for the nearest neighbors of the corresponding cell, ordered by increasing distance. The number of neighbors should be the same as tsnePerplexityToNeighbors(perplexity).

Alternatively, an index constructed by buildIndex.

perplexity

Numeric scalar specifying the perplexity to use in the t-SNE algorithm. Higher perplexities will focus on global structure, at the cost of increased runtime and decreased local resolution.

If NULL, the default value in runTsneDefaults is used.

num.neighbors

Deprecated and ignored. The number of neighbors is now computed from perplexity via tsnePerplexityToNeighbors.

theta

Number specifying the approximation level for the Barnes-Hut calculation of repulsive forces. Lower values increase accuracy at the cost of increased compute time. Any value should be non-negative.

If NULL, the default value in runTsneDefaults is used.

early.exaggeration.iterations

Integer scalar specifying the number of iterations of the early exaggeration phase, where clusters are artificially compacted to leave more empty space so that cells can easily relocate to find a good global organization. Larger values improve convergence within this phase at the cost of reducing the remaining iterations in max.iterations.

If NULL, the default value in runTsneDefaults is used.

exaggeration.factor

Numeric scalar containing the exaggeration factor for the early exaggeration phase (see early.exaggeration.iterations). Larger values increase the attraction between nearest neighbors to favor local structure.

If NULL, the default value in runTsneDefaults is used.

momentum.switch.iterations

Integer scalar specifying the number of iterations to perform before switching from the starting momentum to the final momentum. Higher momentums can improve convergence by increasing the step size and smoothing over local oscillations, at the risk of potentially skipping over relevant minima.

If NULL, the default value in runTsneDefaults is used.

start.momentum

Numeric scalar containing the starting momentum, to be used in the iterations before the momentum switch at momentum.switch.iterations. This is usually lower than final.momentum to avoid skipping over suitable local minima.

If NULL, the default value in runTsneDefaults is used.

final.momentum

Numeric scalar containing the final momentum, to be used in the iterations after the momentum switch at momentum.switch.iterations. This is usually higher than start.momentum to accelerate convergence to the local minima once the observations are moderately well-organized.

If NULL, the default value in runTsneDefaults is used.

eta

Numeric scalar containing the learning rate, used to scale the updates for each cell. Larger values can speed up convergence at the cost of skipping over local minima.

If NULL, the default value in runTsneDefaults is used.

max.depth

Integer scalar specifying the maximum depth of the Barnes-Hut quadtree. If neighboring cells cannot be separated before the maximum depth is reached, they will be assigned to the same leaf node of the quadtree. Smaller values (7-10) improve speed by bounding the recursion depth at the cost of accuracy.

If NULL, the default value in runTsneDefaults is used.

leaf.approximation

Logical scalar indicating whether to use the “leaf approximation”. If TRUE, repulsive forces are computed between leaf nodes and re-used across all cells assigned to that leaf node. This sacrifices some accuracy for greater speed, assuming that max.depth is small enough for multiple cells to be assigned to the same leaf.

If NULL, the default value in runTsneDefaults is used.

max.iterations

Integer scalar specifying the maximum number of iterations to perform. Larger values improve convergence at the cost of compute time.

If NULL, the default value in runTsneDefaults is used.

seed

Integer scalar specifying the seed to use for generating the initial coordinates.

num.threads

Integer scalar specifying the number of threads to use.

If NULL, the default value in runTsneDefaults is used.

BNPARAM

A BiocNeighborParam object specifying the algorithm to use. Only used if x is not a prebuilt index or a list of existing nearest-neighbor search results.

Value

For runTsne, a numeric matrix where rows are cells and columns are the two dimensions of the embedding.

For tsnePerplexityToNeighbors, an integer scalar specifying the number of neighbors to use for a given perplexity.

Author(s)

Aaron Lun

References

van der Maaten LJP and Hinton GE (2008). Visualizing high-dimensional data using t-SNE. Journal of Machine Learning Research_ 9, 2579-2605.

van der Maaten LJP (2014). Accelerating t-SNE using tree-based algorithms. Journal of Machine Learning Research 15, 3221-3245.

See Also

https://libscran.github.io/qdtsne/, for an explanation of the approximations.

runTsne.se, to run t-SNE on a SingleCellExperiment.

Examples

x <- t(as.matrix(iris[,1:4]))
embedding <- runTsne(x)
plot(embedding[,1], embedding[,2], col=iris[,5])

t-SNE on a SummarizedExperiment

Description

Generate a t-SNE visualization from an existing embedding, by calling runUmap on a reduced dimension entry in SingleCellExperiment.

Usage

runTsne.se(
  x,
  perplexity = 30,
  num.threads = 1,
  more.tsne.args = list(),
  reddim.type = "PCA",
  output.name = "TSNE",
  dim.prefix = "TSNE"
)

Arguments

x

A SingleCellExperiment object or one of its subclasses. Rows correspond to genomic features and columns correspond to cells.

perplexity

Perplexity to use in the t-SNE algorithm, passed to runTsne.

num.threads

Number of threads for the neighbor search and optimization, passed to runTsne.

more.tsne.args

Named list of further arguments to pass to runTsne.

reddim.type

Integer or string specifying the existing embedding in the reducedDim of x. Alternatively, a named integer or character vector of length 1, where the name specifies an alternative experiment of x and the value is the name/index of a reducedDim entry in that alternative experiment.

output.name

String containing the name of the output reducedDim.

dim.prefix

String containing a prefix for the column names of the t-SNE coordinate matrix. Each column is named as dim.prefix followed by its column number. If NULL, no column names are added.

Value

x is returned with the t-SNE coordinates stored in the reducedDim.

Author(s)

Aaron Lun

Examples

library(SingleCellExperiment)
sce <- getTestRnaData.se("pca")
# Using fewer iterations for a faster-running example.
sce <- runTsne.se(sce, more.tsne.args=list(max.iterations=50))
head(reducedDim(sce, "TSNE"))

Default parameters for runTsne

Description

Default parameters from the underlying C++ library. These may be overridden by defaults in the runTsne function signature.

Usage

runTsneDefaults()

Value

Named list containing default values for various function arguments.

Author(s)

Aaron Lun

Examples

runTsneDefaults()

Uniform manifold approximation and projection

Description

Compute UMAP coordinates to visualize similarities between cells.

Usage

runUmap(
  x,
  num.dim = 2,
  num.neighbors = NULL,
  local.connectivity = NULL,
  bandwidth = NULL,
  mix.ratio = NULL,
  spread = NULL,
  min.dist = NULL,
  a = NULL,
  b = NULL,
  repulsion.strength = NULL,
  initialize.method = NULL,
  initial.coordinates = NULL,
  initialize.random.on.spectral.fail = NULL,
  initialize.spectral.scale = NULL,
  initialize.spectral.jitter = NULL,
  initialize.spectral.jitter.sd = NULL,
  initialize.random.scale = NULL,
  initialize.seed = NULL,
  num.epochs = NULL,
  learning.rate = NULL,
  negative.sample.rate = NULL,
  optimize.seed = NULL,
  num.threads = NULL,
  num.threads.optimize = NULL,
  parallel.optimization = FALSE,
  BNPARAM = AnnoyParam()
)

Arguments

x

Numeric matrix where rows are dimensions and columns are cells, typically containing a low-dimensional representation from, e.g., runPca.

Alternatively, a named list of nearest-neighbor search results like that returned by findKNN. This should contain index, an integer matrix where rows are neighbors and columns are cells; and distance, a numeric matrix of the same dimensions containing the distances to each neighbor. Each column contains 1-based indices for the nearest neighbors of the corresponding cell, ordered by increasing distance.

Alternatively, an index constructed by buildIndex.

num.dim

Integer scalar specifying the number of dimensions of the output embedding.

num.neighbors

Integer scalar specifying the number of neighbors to use to define the fuzzy sets. Larger values improve connectivity and favor preservation of global structure, at the cost of increased compute time.

If NULL, the default value from runUmapDefaults is used.

This argument is ignored if x contains pre-computed neighbor search results, in which case the number of neighbors is determined from x.

local.connectivity

Numeric scalar specifying the number of nearest neighbors that are assumed to be always connected, with maximum membership confidence. Larger values increase the connectivity of the embedding and reduce the focus on local structure. This may be a fractional number of neighbors, in which case interpolation is performed when computing the membership confidence.

If NULL, the default value from runUmapDefaults is used.

bandwidth

Numeric scalar specifying the effective bandwidth of the kernel when converting the distance to a neighbor into a fuzzy set membership confidence. Larger values reduce the decay in confidence with respect to distance, increasing connectivity and favoring global structure.

If NULL, the default value from runUmapDefaults is used.

mix.ratio

Numeric scalar between 0 and 1 specifying the mixing ratio when combining fuzzy sets. A mixing ratio of 1 will take the union of confidences, a ratio of 0 will take the intersection, and intermediate values will interpolate between them. Larger values favor connectivity and more global structure.

If NULL, the default value from runUmapDefaults is used.

spread

Numeric scalar specifying the scale of the coordinates of the final low-dimensional embedding.

If NULL, the default value from runUmapDefaults is used.

This argument is ignored if a and b are provided.

min.dist

Numeric scalar specifying the minimum distance between observations in the final low-dimensional embedding. Smaller values will increase local clustering while larger values favor a more even distribution of observations throughout the low-dimensional space. This is interpreted relative to spread.

If NULL, the default value from runUmapDefaults is used.

This argument is ignored if a and b are provided.

a

Numeric scalar specifying the aa parameter for the fuzzy set membership strength calculations. Larger values yield a sharper decay in membership strength with increasing distance between observations.

If this or b is NULL, a suitable value for this parameter is automatically determined from spread and min.dist.

b

Numeric scalar specifying the bb parameter for the fuzzy set membership strength calculations. Larger values yield an earlier decay in membership strength with increasing distance between observations.

If this or a is NULL, a suitable value for this parameter is automatically determined from spread and min.dist.

repulsion.strength

Numeric scalar specifying the modifier for the repulsive force. Larger values increase repulsion and favor local structure.

If NULL, the default value from runUmapDefaults is used.

initialize.method

String specifying how to initialize the embedding. This should be one of:

  • spectral: spectral decomposition of the normalized graph Laplacian. Specifically, the initial coordinates are defined from the eigenvectors corresponding to the smallest non-zero eigenvalues. This fails in the presence of multiple graph components or if the approximate SVD fails to converge.

  • random: fills the embedding with random draws from a normal distribution.

  • none: uses initial values from initial.coordinates.

If NULL, the default value from runUmapDefaults is used.

initial.coordinates

Numeric matrix of initial coordinates, with number of rows equal to the number of observations and number of columns equal to num.dim.

This argument is required if initialize.method = "none"; or initialize.method = "spectral" and initialize.random.on.spectral.fail = FALSE, where it is used if spectral initialization fails.

initialize.random.on.spectral.fail

Logical scalar indicating whether to fall back to random sampling (i.e., same as random) if spectral initialization fails due to the presence of multiple components in the graph. If FALSE, the values in initial.coordinates will be used instead, i.e., same as none.

If NULL, the default value from runUmapDefaults is used.

This argument is only used if initialize.method = "spectral" and spectral initialization fails.

initialize.spectral.scale

Numeric scalar specifying the maximum absolute magnitude of the coordinates after spectral initialization. All initial coordinates are scaled such that the maximum of the absolute values is equal to initialize.spectral.scale. This ensures that outlier observations will not have large absolute distances that may interfere with optimization.

If NULL, the default value from runUmapDefaults is used.

This argument is only used if initialize.method = "spectral" and spectral initialization does not fail.

initialize.spectral.jitter

Logical scalar indicating whether to jitter coordinates after spectral initialization to separate duplicate observations (e.g., to avoid overplotting). This is done using normally-distributed noise of mean zero and standard deviation of initialize.spectral.jitter.sd.

If NULL, the default value from runUmapDefaults is used.

This argument is only used if initialize.method = "spectral" and spectral initialization does not fail.

initialize.spectral.jitter.sd

Numeric scalar specifying the standard deviation of the jitter to apply after spectral initialization. Only relevant if initialize.method = "spectral" and spectral initialization does not fail and initialize.spectral.jitter = TRUE.

initialize.random.scale

Numeric scalar specifying the scale of the randomly generated initial coordinates. Coordinates are sampled from a uniform distribution from [x,x)[-x, x) where xx is initialize.random.scale.

If NULL, the default value from runUmapDefaults is used.

This argument is only used if initialize.method = "random", or initialize.method = "spectral" and spectral initialization fails and initialize.random.on.spectral.fail = TRUE.

initialize.seed

Numeric scalar specifying the seed for the random number generation during initialization.

If NULL, the default value from runUmapDefaults is used.

This argument is only used if initialize.method = "random", or initialize.method = "spectral" and initialize.spectral.jitter = TRUE; or initialize.method = "spectral" and spectral initialization fails and initialize.random.on.spectral.fail = TRUE.

num.epochs

Integer scalar specifying the number of epochs for the gradient descent, i.e., optimization iterations. Larger values improve accuracy at the cost of increased compute time.

If NULL, the default value from runUmapDefaults is used. This depends on the size of the dataset:

  • For datasets with no more than 10000 observations, the default number of epochs is set to 500.

  • For larger datasets, the number of epochs is inversely proportional to the number of cells, starting from 500 and decreasing asymptotically to a lower limit of 200. This choice aims to reduce computational work for very large datasets.

learning.rate

Numeric scalar specifying the initial learning rate used in the gradient descent. Larger values can accelerate convergence but at the risk of skipping over suitable local optima.

If NULL, the default value from runUmapDefaults is used.

negative.sample.rate

Numeric scalar specifying the rate of sampling negative observations to compute repulsive forces. Greater values will improve accuracy but increase compute time.

If NULL, the default value from runUmapDefaults is used.

optimize.seed

Numeric scalar specifying the seed to use for the optimization epochs.

If NULL, the default value from runUmapDefaults is used.

num.threads

Integer scalar specifying the number of threads to use for most UMAP steps.

If NULL, the default value from runUmapDefaults is used.

num.threads.optimize

Integer scalar specifying the number of threads to use during UMAP layout optimization. Parallel optimization uses spinlocks so the number of threads should not be greater than the number of available CPUs.

If NULL, the default value from runUmapDefaults is used.

parallel.optimization

Deprecated, use num.threads.optimize instead.

BNPARAM

A BiocNeighborParam object specifying the algorithm to use. Only used if x is not a prebuilt index or a list of existing nearest-neighbor search results.

Value

A numeric matrix where rows are cells and columns are the two dimensions of the embedding.

Author(s)

Aaron Lun

References

McInnes L, Healy J, Melville J (2020). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv, https://arxiv.org/abs/1802.03426

See Also

https://libscran.github.io/umappp/, for details on the underlying implementation.

runUmap.se, to run UMAP on a SingleCellExperiment.

Examples

x <- t(as.matrix(iris[,1:4]))
embedding <- runUmap(x)
plot(embedding[,1], embedding[,2], col=iris[,5])

UMAP on a SummarizedExperiment

Description

Generate a UMAP visualization from an existing embedding, by calling runUmap on a reduced dimension entry in SingleCellExperiment.

Usage

runUmap.se(
  x,
  num.dim = 2,
  min.dist = 0.1,
  num.neighbors = 15,
  num.threads = 1,
  more.umap.args = list(),
  reddim.type = "PCA",
  output.name = "UMAP",
  dim.prefix = "UMAP"
)

Arguments

x

A SingleCellExperiment object or one of its subclasses. Rows correspond to genomic features and columns correspond to cells.

num.dim

Number of dimensions in the output embedding, passed to runUmap.

min.dist

Minimum distance between observations, passed to runUmap.

num.neighbors

Number of neighbors for constructing the fuzzy sets, passed to runUmap.

num.threads

Number of threads for the UMAP, passed to runUmap.

more.umap.args

Named list of further arguments to pass to runUmap.

reddim.type

Integer or string specifying the existing embedding in the reducedDim of x. Alternatively, a named integer or character vector of length 1, where the name specifies an alternative experiment of x and the value is the name/index of a reducedDim entry in that alternative experiment.

output.name

String containing the name of the output reducedDim.

dim.prefix

String containing a prefix for the column names of the t-SNE coordinate matrix. Each column is named as dim.prefix followed by its column number. If NULL, no column names are added.

Value

x is returned with the UMAP coordinates stored in the reducedDim.

Author(s)

Aaron Lun

Examples

library(SingleCellExperiment)
sce <- getTestRnaData.se("pca")
# Using fewer epochs for a faster-running example.
sce <- runUmap.se(sce, more.umap.args=list(num.epochs=50))
head(reducedDim(sce, "UMAP"))

Default parameters for runUmap

Description

Default parameters from the underlying C++ library. These may be overridden by defaults in the runUmap function signature.

Usage

runUmapDefaults()

Value

Named list containing default values for various function arguments.

Author(s)

Aaron Lun

Examples

runUmapDefaults()

Sanitize size factors

Description

Replace invalid size factors, i.e., zero, negative, infinite or NaN values. Such size factors can occasionally arise if, e.g., insufficient quality control was performed upstream. Removing them ensures that the normalized values from normalizeCounts remain finite for sensible downstream processing.

Usage

sanitizeSizeFactors(
  size.factors,
  replace.zero = NULL,
  replace.negative = NULL,
  replace.infinite = NULL,
  replace.nan = NULL,
  handle.zero = NULL,
  handle.negative = NULL,
  handle.infinite = NULL,
  handle.nan = NULL
)

Arguments

size.factors

Numeric vector of size factors across cells.

replace.zero

Deprecated, use handle.zero instead.

replace.negative

Deprecated, use handle.negative instead.

replace.infinite

Deprecated, use handle.infinite instead.

replace.nan

Deprecated, use handle.nan instead.

handle.zero

String specifying how to handle replace size factors of zero. If "sanitize", zero size factors are replaced with the lowest positive factor in size.factors. This ensures that the normalized values will be large to reflect the extremity of the scaling, but still finite for sensible downstream processing. If "error", an error is raised upon encountering a size factor of zero. If "ignore", no action is taken.

If NULL, the default value in sanitizeSizeFactorsDefaults is used.

handle.negative

String specifying how to handle negative size factors. If "sanitize", negative size factors are replaced with the lowest positive factor in size.factors. This ensures that the normalized values will be large to reflect the extremity of the scaling, but still finite for sensible downstream processing. If "error", an error is raised upon encountering a negative size factor. If "ignore", no action is taken.

If NULL, the default value in sanitizeSizeFactorsDefaults is used.

handle.infinite

String specifying how to handle infinite size factors. If "sanitize", infinite size factors are replaced with the largest positive factor in size.factors. This ensures that any normalized values will be, at least, finite; the choice of a relatively large replacement value reflects the extremity of the scaling. If "error", an error is raised upon encountering an infinite size factor. If "ignore", no action is taken.

If NULL, the default value in sanitizeSizeFactorsDefaults is used.

handle.nan

String specifying how to handle NaN size factors. If "sanitize", NaN size factors are replaced with unity, e.g., scaling normalization is a no-op. If "error", an error is raised upon encountering an NaN size factor. If "ignore", no action is taken.

If NULL, the default value in sanitizeSizeFactorsDefaults is used.

Value

Numeric vector of length equal to size.factors, containing the sanitized size factors.

Author(s)

Aaron Lun

See Also

The sanitize_size_factors function in https://libscran.github.io/scran_norm/.

Examples

sf <- 2^rnorm(100)
sf[1] <- 0
sf[2] <- -1
sf[3] <- Inf
sf[4] <- NaN
sanitizeSizeFactors(sf)

Default parameters for sanitizeSizeFactors

Description

Default parameters from the underlying C++ library. These may be overridden by defaults in the sanitizeSizeFactors function signature.

Usage

sanitizeSizeFactorsDefaults()

Value

Named list containing default values for various function arguments.

Author(s)

Aaron Lun

Examples

sanitizeSizeFactorsDefaults()

Scale and combine multiple embeddings

Description

Scale multiple embeddings (usually derived from different modalities for the same cells) so that their within-population variances are comparable, and then combine them into a single embedding matrix for further analyses like clustering, t-SNE, etc. The aim is to equalize uninteresting variance across modalities so that high technical variance in one modality does not drown out interesting biology in another modality.

Usage

scaleByNeighbors(
  x,
  num.neighbors = NULL,
  block = NULL,
  block.weight.policy = NULL,
  variable.block.weight = NULL,
  num.threads = NULL,
  weights = NULL,
  BNPARAM = AnnoyParam()
)

Arguments

x

List of numeric matrices of principal components or other embeddings, one for each modality. For each entry, rows are dimensions and columns are cells. All entries should have the same number of columns but may have different numbers of rows.

num.neighbors

Integer scalar specifying the number of neighbors to use to define the scaling factor.

If NULL, the default value in scaleByNeighborsDefaults is used.

block

Factor specifying the block of origin (e.g., batch, sample) for each cell in x. If provided, the scaling factor is computed as a weighted average across blocks to ensure that block effects do not inflate the within-population variance. Alternatively NULL, if all cells are from the same block.

block.weight.policy

String specifying the policy to use for weighting different blocks when computing the average scaling factor. See the argument of the same name in computeBlockWeights for more detail.

If NULL, the default value in scaleByNeighborsDefaults is used.

This argument is only used if block is not NULL.

variable.block.weight

Numeric vector of length 2, specifying the parameters for variable block weighting. See the argument of the same name in computeBlockWeights for more detail.

If NULL, the default value in scaleByNeighborsDefaults is used.

This argument is only used if block is not NULL and block.weight.policy = "variable".

num.threads

Integer scalar specifying the number of threads to use.

If NULL, the default value in scaleByNeighborsDefaults is used.

weights

Numeric vector of length equal to that of x, specifying the weights to apply to each modality. Each value represents a multiplier of the within-population variance of its modality, i.e., larger values increase the contribution of that modality in the combined output matrix. NULL is equivalent to an all-1 vector, i.e., all modalities are scaled to have the same within-population variance.

BNPARAM

A BiocNeighborParam object specifying how to perform the neighbor search.

Value

List containing scaling, a vector of scaling factors to be aplied to each embedding; and combined, a numeric matrix creating by scaling each entry of x by scaling and then rbinding them together.

Author(s)

Aaron Lun

See Also

https://libscran.github.io/mumosa/, for the basis and caveats of this approach.

scaleByNeighbors.se, to combine embeddings in a SingleCellExperiment.

Examples

pcs <- list(
    gene = matrix(rnorm(10000), ncol=200),
    protein = matrix(rnorm(1000, sd=3), ncol=200),
    guide = matrix(rnorm(2000, sd=5), ncol=200)
)

out <- scaleByNeighbors(pcs)
out$scaling
dim(out$combined)

Scale and combine multiple embeddings in a SingleCellExperiment

Description

Scale embeddings for different modalities to equalize their intra-population variance, and combine them into a single embedding for downstream analysis. This calls scaleByNeighbors on the reduced dimensions of the main/alternative experiments in a SingleCellExperiment.

Usage

scaleByNeighbors.se(
  x,
  altexp.reddims,
  main.reddims = "PCA",
  num.neighbors = 20,
  block = NULL,
  BNPARAM = AnnoyParam(),
  num.threads = 1,
  more.scale.args = list(),
  output.name = "combined",
  meta.name = "combined",
  delayed.transpose = FALSE
)

Arguments

x

A SingleCellExperiment object or one of its subclasses. Rows correspond to genomic features and columns correspond to cells.

altexp.reddims

Named list of character or integer vectors. Each entry is named after an alternative experiment. Each vector contains the names/indices of the reducedDim embeddings from that experiment to be combined.

main.reddims

Character or integer vector specifying the names/indices of the reducedDim entries from x to be combined.

num.neighbors

Number of neighbors used to define the scaling factor, passed to scaleByNeighbors.

block

Block assignment for each cell, passed to scaleByNeighbors.

BNPARAM

Algorithm for the nearest neighbor search, passed to scaleByNeighbors.

num.threads

Number of threads for the neighbor search, passed to scaleByNeighbors.

more.scale.args

Named list of additional arguments to pass to scaleByNeighbors.

output.name

String containing the name of the reducedDim entry in which to store the combined embeddings.

meta.name

String containing the name of the metadata entry in which to store additional metrics. If NULL, additional metrics are not stored.

delayed.transpose

Logical scalar indicating whether to delay the transposition when storing coordinates in the reducedDims.

Value

x is returned with the combined embeddings stored in its reducedDims. The scaling factors for all embeddings are stored in the metadata.

Author(s)

Aaron Lun

Examples

library(SingleCellExperiment)
sce <- getTestAdtData.se("pca")
sce <- scaleByNeighbors.se(sce, altexp.reddims=list(ADT="PCA"))
reducedDimNames(sce) 
metadata(sce)$combined

Default parameters for scaleByNeighbors

Description

Default parameters from the underlying C++ library. These may be overridden by defaults in the scaleByNeighbors function signature.

Usage

scaleByNeighborsDefaults(block = NULL)

Arguments

block

See the argument of the same name in scaleByNeighbors.

Value

Named list containing default values for various function arguments. These values may change depending on whether block is supplied.

Author(s)

Aaron Lun

Examples

scaleByNeighborsDefaults()

Score gene set activity for each cell

Description

Compute per-cell scores for a gene set, defined as the column sums of a rank-1 approximation to the submatrix for the gene set. This uses the same approach as the GSDecon package by Jason Hackney, adapted to use an approximate PCA (via IRLBA) and to support blocking.

Usage

scoreGeneSet(
  x,
  set,
  rank = NULL,
  scale = NULL,
  block = NULL,
  block.weight.policy = NULL,
  variable.block.weight = NULL,
  extra.work = NULL,
  iterations = NULL,
  tolerance = NULL,
  seed = NULL,
  realized = NULL,
  warn = TRUE,
  num.threads = NULL
)

Arguments

x

A matrix-like object where rows correspond to genes or genomic features and columns correspond to cells. Typically, the matrix is expected to contain log-expression values.

set

Vector specifying the rows of x that belong to the gene set. This may be an integer vector of row indices, a logical vector of length equal to the number of rows, or a character vector of row names. For integer and character vectors, duplicate entries are ignored. For a character vector, any string not present in rownames(x) is ignored.

rank

Integer scalar specifying the rank of the approximation. The default value of 1 assumes that each gene set only describes a single coordinated biological function.

If NULL, the default value in scoreGeneSetDefaults is used.

scale

Logical scalar indicating whether to scale all genes to have the same variance. This ensures that each gene contributes equally to the PCA, favoring consistent variation across many genes rather than large variation in a few genes. If block is specified, each gene's variance is calculated as a weighted sum of the variances from each block. Genes with zero variance are ignored.

If NULL, the default value in scoreGeneSetDefaults is used.

block

Factor specifying the block of origin (e.g., batch, sample) for each cell in x. The PCA will be performed on the residuals after regressing out the block effect, ensuring that differences between block do not dominate the variation in the dataset. Alternatively NULL if all cells are from the same block.

block.weight.policy

String specifying the policy to use for weighting the contribution of different blocks to the PCA. See the argument of the same name in computeBlockWeights for more detail.

If NULL, the default value in scoreGeneSetDefaults is used.

This argument is only used if block is not NULL.

variable.block.weight

Numeric vector of length 2, specifying the parameters for variable block weighting. See the argument of the same name in computeBlockWeights for more detail.

If NULL, the default value in scoreGeneSetDefaults is used.

This argument is only used if block is not NULL and block.weight.policy = "variable".

extra.work

Integer scalar specifying the number of extra dimensions for the IRLBA workspace. Larger values improve accuracy at the cost of compute time.

If NULL, this defaults to the larger of 7 and number.

iterations

Integer scalar specifying the maximum number of restart iterations for IRLBA. Larger values improve accuracy at the cost of compute time.

If NULL, the default value in scoreGeneSetDefaults is used.

tolerance

Number specifying the tolerance on the approximation error of the singular triplets, to determine IRLBA convergence. Lower values improve accuracy at the cost of compute time.

If NULL, the default value in scoreGeneSetDefaults is used.

seed

Integer scalar specifying the seed for the initial random vector in IRLBA.

If NULL, the default value in scoreGeneSetDefaults is used.

realized

Logical scalar indicating whether to realize x into an optimal memory layout for IRLBA. This speeds up computation at the cost of increased memory usage.

If NULL, the default value in scoreGeneSetDefaults is used.

warn

Boolean specifying whether a warning should be emitted if IRLBA failed to converge.

num.threads

Number of threads to use.

If NULL, the default value in scoreGeneSetDefaults is used.

Value

List containing:

  • scores, a numeric vector of per-cell scores for each column in x.

  • weights, a DataFrame containing row, an integer vector of ordered and unique row indices corresponding to the genes in set; and weight, a numeric vector of per-gene weights for each gene in row.

Author(s)

Aaron Lun

See Also

The compute and compute_blocked functions in https://libscran.github.io/gsdecon/.

scoreGeneSet.se, to compute gene set scores from a SummarizedExperiment.

Examples

library(Matrix)
x <- round(abs(rsparsematrix(1000, 100, 0.1) * 100))
normed <- normalizeCounts(x, size.factors=centerSizeFactors(colSums(x)))
scoreGeneSet(normed, set=c(1,3,5,10,20,100))

Score a gene set in a SummarizedExperiment

Description

Compute a gene set activity score for each cell based on the expression values of the genes in the set, by calling scoreGeneSet on an assay of a SummarizedExperiment.

Usage

scoreGeneSet.se(
  x,
  set,
  block = NULL,
  num.threads = 1,
  more.score.args = list(),
  assay.type = "logcounts"
)

Arguments

x

A SummarizedExperiment object or one of its subclasses. Rows correspond to genes and columns correspond to cells.

set

Vector containing the gene set, see ?scoreGeneSet for details.

block

Block assignment for each cell, passed to scoreGeneSet.

num.threads

Number of threads for scoreGeneSet.

more.score.args

Named list of further arguments to pass to scoreGeneSet.

assay.type

Integer or string specifying the relevant assay in x, usually containing log-normalized expression values.

Value

List containing scores, a numeric vector of the gene set scores across all cells in x; and weights, a numeric vector of weights for all genes in set.

Author(s)

Aaron Lun

Examples

# Defining a gene set of oligodendrocyte genes.
library(org.Mm.eg.db)
oligo.set <- select(org.Mm.eg.db, keytype="GO", keys="GO:0048709", columns="SYMBOL")
oligo.set <- unique(oligo.set$SYMBOL)

sce <- getTestRnaData.se("norm")
oligo.scores <- scoreGeneSet.se(sce, oligo.set)
summary(oligo.scores$scores)

Default parameters for scoreGeneSet

Description

Default parameters from the underlying C++ library. These may be overridden by defaults in the scoreGeneSet function signature.

Usage

scoreGeneSetDefaults()

Value

Named list containing default values for various function arguments.

Author(s)

Aaron Lun

Examples

scoreGeneSetDefaults()

Score marker genes

Description

Score marker genes for each group using a variety of effect sizes from pairwise comparisons between groups. This includes Cohen's d, the area under the curve (AUC), the difference in the means (delta-mean) and the difference in the proportion of detected cells (delta-detected). For each group, the strongest markers are those genes with the largest effect sizes (i.e., upregulated) when compared to all other groups.

Usage

scoreMarkers(
  x,
  groups,
  block = NULL,
  block.average.policy = NULL,
  block.weight.policy = NULL,
  variable.block.weight = NULL,
  block.quantile = NULL,
  compute.group.mean = NULL,
  compute.group.detected = NULL,
  compute.delta.mean = NULL,
  compute.delta.detected = NULL,
  compute.cohens.d = NULL,
  compute.auc = NULL,
  compute.summary.min = NULL,
  compute.summary.mean = NULL,
  compute.summary.median = NULL,
  compute.summary.max = NULL,
  compute.summary.quantiles = NULL,
  compute.summary.min.rank = NULL,
  threshold = NULL,
  all.pairwise = FALSE,
  top.index.only = FALSE,
  min.rank.limit = NULL,
  num.threads = NULL
)

Arguments

x

A matrix-like object where rows correspond to genes or genomic features and columns correspond to cells. It is typically expected to contain log-expression values, e.g., from normalizeCounts.

groups

A vector specifying the group assignment for each cell in x.

block

Factor specifying the block of origin (e.g., batch, sample) for each cell in x. If provided, comparisons are performed within each block to ensure that block effects do not confound the estimates. The weighted average of the effect sizes across all blocks is reported for each gene. Alternatively NULL, if all cells are from the same block.

block.average.policy

String specifying the policy to use for average statistics across blocks. This can either be a (weighted) "mean" or a "quantile".

If NULL, the default value in scoreMarkersDefaults is used.

This argument is only used if block is not NULL.

block.weight.policy

String specifying the policy to use for weighting different blocks when computing the average for each statistic. See the argument of the same name in computeBlockWeights for more detail.

If NULL, the default value in scoreMarkersDefaults is used.

This argument is only used if block is not NULL.

variable.block.weight

Numeric vector of length 2, specifying the parameters for variable block weighting. See the argument of the same name in computeBlockWeights for more detail.

If NULL, the default value in scoreMarkersDefaults is used.

THis arugment is only used if block is not NULL and block.weight.policy = "variable".

block.quantile

Number specifying the probability of the quantile of statistics across blocks. Defaults to 0.5, i.e., the median of per-block statistics.

If NULL, the default value in scoreMarkersDefaults is used.

This argument is only used if block is not NULL and block.average.policy="quantile".

compute.group.mean

Boolean indicating whether to compute the group-wise mean expression for each gene.

If NULL, the default value in scoreMarkersDefaults is used.

compute.group.detected

Boolean indicating whether to compute the group-wise proportion of detected cells for each gene.

If NULL, the default value in scoreMarkersDefaults is used.

compute.delta.mean

Boolean indicating whether to compute the delta-means, i.e., the log-fold change when x contains log-expression values.

If NULL, the default value in scoreMarkersDefaults is used.

compute.delta.detected

Boolean indicating whether to compute the delta-detected, i.e., differences in the proportion of cells with detected expression.

If NULL, the default value in scoreMarkersDefaults is used.

compute.cohens.d

Boolean indicating whether to compute Cohen's d.

If NULL, the default value in scoreMarkersDefaults is used.

compute.auc

Boolean indicating whether to compute the AUC. Setting this to FALSE can improve speed and memory efficiency.

If NULL, the default value in scoreMarkersDefaults is used.

compute.summary.min

Boolean specifying whether to compute the minimum as a summary statistic for each effect size.

If NULL, the default value in scoreMarkersDefaults is used.

This argument is only used if all.pairwise=FALSE.

compute.summary.mean

Boolean specifying whether to compute the mean as a summary statistic for each effect size.

If NULL, the default value in scoreMarkersDefaults is used.

This argument is only used if all.pairwise=FALSE.

compute.summary.median

Boolean specifying whether to compute the median as a summary statistic for each effect size.

If NULL, the default value in scoreMarkersDefaults is used.

This argument is only used if all.pairwise=FALSE.

compute.summary.max

Boolean specifying whether to compute the maximum as a summary statistic for each effect size.

If NULL, the default value in scoreMarkersDefaults is used.

This argument is only used if all.pairwise=FALSE.

compute.summary.quantiles

Numeric vector containing the probabilities of quantiles to compute as summary statistics for each effect size.

If NULL, no summary quantiles are computed.

This argument is only used if all.pairwise=FALSE.

compute.summary.min.rank

Boolean specifying whether to compute the mininum rank as a summary statistic for each effect size.

If NULL, the default value in scoreMarkersDefaults is used.

This argument is only used if all.pairwise=FALSE.

threshold

Non-negative number specifying the minimum threshold on the differences in means (i.e., the log-fold change, if x contains log-expression values). This is incorporated into the effect sizes for Cohen's d and the AUC. Larger thresholds will favor genes with large differences at the expense of genes with low variance that would otherwise have comparable effect sizes.

If NULL, the default value in scoreMarkersDefaults is used.

all.pairwise

Boolean indicating whether to report the effect sizes for every pairwise comparison between groups. Alternatively, an integer indicating the number of top markers to report from each pairwise comparison between groups. If FALSE, only the summary statistics are reported.

top.index.only

Boolean indicating whether to only report the indices of the top genes when all.pairwise is an integer. This is more efficient when the effect sizes of the top genes are not required.

This argument is ignored for all other values of all.pairwise.

min.rank.limit

Integer specifying the maximum value of the min-rank to report. Lower values improve memory efficiency at the cost of discarding information about lower-ranked genes.

If NULL, the default value in scoreMarkersDefaults is used.

This argument is only used if all.pairwise=FALSE.

num.threads

Integer specifying the number of threads to use.

If NULL, the default value in scoreMarkersDefaults is used.

Value

A named list containing:

  • nrow, integer specifying the number of rows in x.

  • row.names, character vector or NULL containing the row names of x.

  • group.ids, vector contaning the identities of the unique groups.

  • mean, a numeric matrix containing the mean expression for each group. Each row is a gene and each column is a group in group.ids. Omitted if compute.group.mean=FALSE.

  • detected, a numeric matrix containing the proportion of detected cells in each group. Each row is a gene and each column is a group in group.ids. Omitted if compute.group.detected=FALSE.

If all.pairwise=FALSE, the list also contains:

  • cohens.d, a list of DataFrames where each DataFrame corresponds to a group in group.ids. Each row of a DataFrame represents a gene, while each column contains a summary of Cohen's d from pairwise comparisons to all other groups. This includes min, mean, median, max, quantile.* and min.rank - check out ?summarizeEffects for details. Omitted if compute.cohens.d=FALSE.

  • auc, a list like cohens.d but containing the summaries of the AUCs from each pairwise comparison. Omitted if compute.auc=FALSE.

  • delta.mean, a list like cohens.d but containing the summaries of the delta-mean from each pairwise comparison. Omitted if compute.delta.mean=FALSE.

  • delta.detected, a list like cohens.d but containing the summaries of the delta-detected from each pairwise comparison. Omitted if compute.delta.detected=FALSE.

If all.pairwise=TRUE, the list also contains:

  • cohens.d, a 3-dimensional numeric array containing the Cohen's d from each pairwise comparison between groups. The extents of the first two dimensions are equal to the number of groups in group.ids, while the extent of the final dimension is equal to the number of genes. The entry cohens.d[i, j, k] represents Cohen's d from the comparison of group group.ids[j] over group group.ids[i] for gene k. Omitted if compute.cohens.d=FALSE.

  • auc, an array like cohens.d but containing the AUCs from each pairwise comparison. Omitted if compute.auc=FALSE.

  • delta.mean, an array like cohens.d but containing the delta-mean from each pairwise comparison. Omitted if compute.delta.mean=FALSE.

  • delta.detected, an array like cohens.d but containing the delta-detected from each pairwise comparison. Omitted if compute.delta.detected=FALSE.

If all.pairwise is an integer and top.index.only=FALSE, the list also contains:

  • cohens.d, a list of list of DataFrames containing the top genes with the largest Cohen's d for each pairwise comparison. Specifically, cohens.d[[i]][[j]] is a DataFrame that contains the top all.pairwise genes from the comparison of group group.ids[i] over group group.ids[j]. Each DataFrame contains an index column, the row index of the gene; and an effect column, the Cohen's d for that gene. Omitted if compute.cohens.d=FALSE.

  • auc, a list of list of DataFrames like cohens.d but containing the AUCs from each pairwise comparison. Omitted if compute.auc=FALSE.

  • delta.mean, a list of list of DataFrames like cohens.d but containing the delta-mean from each pairwise comparison. Omitted if compute.delta.mean=FALSE.

  • delta.detected, a list of list of DataFrames like cohens.d but containing the delta-detected from each pairwise comparison. Omitted if compute.delta.detected=FALSE.

If all.pairwise is an integer and top.index.only=TRUE, the list also contains:

  • cohens.d, a list of list of integer vectors containing the row indices of the top genes with the largest Cohen's d for each pairwise comparison. Specifically, cohens.d[[i]][[j]] is a vector that contains the top all.pairwise genes from the comparison of group group.ids[i] over group group.ids[j]. Omitted if compute.cohens.d=FALSE.

  • auc, a list of list of DataFrames like cohens.d but containing the AUCs from each pairwise comparison. Omitted if compute.auc=FALSE.

  • delta.mean, a list of list of DataFrames like cohens.d but containing the delta-mean from each pairwise comparison. Omitted if compute.delta.mean=FALSE.

  • delta.detected, a list of list of DataFrames like cohens.d but containing the delta-detected from each pairwise comparison. Omitted if compute.delta.detected=FALSE.

Choice of effect size

The delta-mean is the difference in the mean expression between groups. This is fairly straightforward to interpret - a positive delta-mean corresponds to increased expression in the first group compared to the second. The delta-mean can also be treated as the log-fold change if the input matrix contains log-transformed normalized expression values.

The delta-detected is the difference in the proportion of cells with detected expression between groups. This lies between 1 and -1, with the extremes occurring when a gene is silent in one group and detected in all cells of the other group. For this interpretation, we assume that the input matrix contains non-negative expression values, where a value of zero corresponds to lack of detectable expression.

Cohen's d is the standardized difference between two groups. This is defined as the difference in the mean for each group scaled by the average standard deviation across the two groups. (Technically, we should use the pooled variance; however, this introduces some unintuitive asymmetry depending on the variance of the larger group, so we take a simple average instead.) A positive value indicates that the gene has increased expression in the first group compared to the second. Cohen's d is analogous to the t-statistic in a two-sample t-test and avoids spuriously large effect sizes from comparisons between highly variable groups. We can also interpret Cohen's d as the number of standard deviations between the two group means.

The area under the curve (AUC) is the probability that a randomly chosen observation in one group is greater than a randomly chosen observation in the other group. Values greater than 0.5 indicate that a gene is upregulated in the first group. The AUC is closely related to the U-statistic used in the Wilcoxon rank sum test. The key difference between the AUC and Cohen's d is that the former is less sensitive to the variance within each group, e.g., if two distributions exhibit no overlap, the AUC is the same regardless of the variance of each distribution. This may or may not be desirable as it improves robustness to outliers but reduces the information available to obtain a fine-grained ranking.

With a minimum change threshold

Setting a minimum change threshold (i.e., threshold) prioritizes genes with large shifts in expression instead of those with low variances. Currently, only positive thresholds are supported, which focuses on genes that are upregulated in the first group compared to the second. The effect size definitions are generalized when testing against a non-zero threshold:

  • Cohen's d is redefined as the standardized difference between the difference in means and the specified threshold, analogous to the TREAT method from the limma package. Large positive values are only obtained when the observed difference in means is significantly greater than the threshold. For example, if we had a threshold of 2 and we obtained a Cohen's d of 3, this means that the observed difference in means was 3 standard deviations greater than 2. Note that a negative Cohen's d cannot be intepreted as downregulation, as the difference in means may still be positive but less than the threshold.

  • The AUC is generalized to the probability of obtaining a random observation in one group that is greater than a random observation plus the threshold in the other group. For example, if we had a threshold of 2 and we obtained an AUC of 0.8, this means that, 80 the random observation from the first group would be greater than a random observation from the second group by 2 or more. Again, AUCs below 0.5 cannot be interpreted as downregulation, as it may be caused by a positive shift that is less than the threshold.

See Also

The score_markers_summary, score_markers_pairwise and score_markers_best functions in https://libscran.github.io/scran_markers/. See their blocked equivalents (e.g., score_markers_summary_blocked) when block is specified.

summarizeEffects, to summarize the pairwise effects returned when all.pairwise=TRUE.

reportGroupMarkerStatistics, to consolidate the statistics for a single group into its own data frame.

scoreMarkers.se, to score markers from a SummarizedExperiment.

Examples

# Mocking a matrix:
library(Matrix)
x <- round(abs(rsparsematrix(1000, 100, 0.1) * 100))
normed <- normalizeCounts(x, size.factors=centerSizeFactors(colSums(x)))

# Compute marker summaries for each group:
g <- sample(letters[1:4], ncol(x), replace=TRUE)
scores <- scoreMarkers(normed, g)
names(scores)
head(scores$mean)
head(scores$cohens.d[["a"]])

# Report marker statistics for a single group:
reportGroupMarkerStatistics(scores, "b")

Score marker genes in a SummarizedExperiment

Description

Identify candidate marker genes based on effect sizes from pairwise comparisons between groups of cells, by calling scoreMarkers on an assay of a SummarizedExperiment.

Usage

scoreMarkers.se(
  x,
  groups,
  block = NULL,
  num.threads = 1,
  more.marker.args = list(),
  assay.type = "logcounts",
  extra.columns = NULL,
  order.by = TRUE
)

formatScoreMarkersResult(marker.res, extra.columns = NULL, order.by = TRUE)

previewMarkers(
  marker.df,
  columns = c("mean", "detected", lfc = "delta.mean.mean"),
  pre.columns = NULL,
  post.columns = NULL,
  rows = 10,
  order.by = NULL,
  include.order.by = !is.null(order.by)
)

Arguments

x

A SummarizedExperiment object or one of its subclasses. Rows correspond to genes and columns correspond to cells.

groups

Group assignment for each cell, passed to scoreMarkers.

block

Block assignment for each cell, passed to scoreMarkers.

num.threads

Number of threads for marker scoring, passed to scoreMarkers.

more.marker.args

Named list of additional arguments to pass to scoreMarkers.

assay.type

Integer or string specifying the assay to use for differential comparisons, usually containing log-normalized expression values.

extra.columns

DataFrame containing extra columns to add each DataFrame. This should have the same number of rows as x. For scoreMarkers.se, this may also be a character vector specifying the columns of rowData to be added.

order.by

String specifying the column to order each DataFrame by. Alternatively TRUE, a column is automatically chosen from the effect size summaries. If NULL or FALSE, no ordering is performed.

marker.res

List containing the result of scoreMarkers.

marker.df

DataFrame containing the marker statistics for a single group.

columns

Character vector of the names of columns to retain in the preview. This may be named, in which the names are used as the column names.

pre.columns, post.columns

Character vector of the names of additional columns to retain in the preview. These are added before or after the columns in columns, for pre.columns and post.columns respectively.

rows

Integer specifying the number of rows to show. If NULL, all rows are returned.

include.order.by

Boolean indicating whether the column specified by order.by should be included in the output DataFrame. A string may also be supplied and will be treated as TRUE; the value of the string will be used as the column name in the output DataFrame.

Value

For scoreMarkers.se and formatScoreMarkersResult, a List of DataFrames is returned. Each DataFrame corresponds to a unique group in groups. Each row contains statistics for a gene in x, with the following columns:

  • mean, the mean expression in the current group.

  • detected, the proportion of cells with detected expression in the current group.

  • <effect>.<summary>, a summary statistic for an effect size, e.g., cohens.d.mean contains the mean Cohen's d across comparisons involving the current group.

For previewMarkers, a DataFrame is returned containing the specified columns and rows.

Author(s)

Aaron Lun

Examples

sce <- getTestRnaData.se("cluster")
markers <- scoreMarkers.se(sce, sce$clusters)
previewMarkers(markers[["1"]])

Default parameters for scoreMarkers

Description

Default parameters from the underlying C++ library. These may be overridden by defaults in the scoreMarkers function signature.

Usage

scoreMarkersDefaults(all.pairwise = FALSE)

Arguments

all.pairwise

See the argument of the same name in scoreMarkers.

Value

Named list containing default values for various function arguments. These defaults may change depending on the values for all.pairwise.

Examples

scoreMarkersDefaults()

Subsample cells based on their neighbors

Description

Subsample a dataset by selecting cells to represent all of their nearest neighbors. The aim is to preserve the relative density of the original dataset while guaranteeing representation of low-frequency subpopulations.

Usage

subsampleByNeighbors(
  x,
  num.neighbors = NULL,
  min.remaining = NULL,
  num.threads = NULL,
  BNPARAM = AnnoyParam()
)

Arguments

x

A numeric matrix where rows are dimensions and columns are cells, typically containing a low-dimensional representation from, e.g., runPca.

Alternatively, an index constructed by buildIndex.

Alternatively, a list containing existing nearest-neighbor search results. This should contain:

  • index, an integer matrix where rows are neighbors and columns are cells. Each column contains 1-based indices for the nearest neighbors of the corresponding cell, ordered by increasing distance.

  • distance, a numeric matrix of the same dimensions as index, containing the distances to each of the nearest neighbors.

The number of neighbors should be equal to num.neighbors, otherwise a warning is raised.

num.neighbors

Integer scalar specifying the number of neighbors to use. Larger values result in stronger downsampling.

If NULL, the default value in subsampleByNeighborsDefaults is used.

This argument is ignored if x contains pre-computed neighbor search results.

min.remaining

Integer specifying the minimum number of remaining neighbors that a cell must have in order to be considered for selection. This should be less than or equal to num.neighbors. Larger values result in stronger downsampling.

If NULL, the default value in subsampleByNeighborsDefaults is used.

num.threads

Integer scalar specifying the number of threads to use for the nearest-neighbor search.

If NULL, the default value in subsampleByNeighborsDefaults is used.

This argument is only used if x does not contain existing nearest-neighbor results.

BNPARAM

A BiocNeighborParam object specifying the algorithm to use.

This argument is only used if x does not contain existing nearest-neighbor results.

Details

Starting from the densest region in the high-dimensional space, we select an observation for inclusion into the subsampled dataset. Every time we select an observation, we remove it and all of its nearest neighbors from the dataset. We then select the next observation with the most remaining neighbors, with ties broken by density; this is repeated until there are no more observations.

The premise is that each selected observation serves as a representative for its nearest neighbors. This ensures that the subsampled points are well-distributed across the original dataset. Low-frequency subpopulations will always have at least a few representatives if they are sufficiently distant from other subpopulations. We also preserve the relative density of the original dataset as more representatives will be generated from high-density regions.

Value

Integer vector with the indices of the selected cells in the subsample.

Author(s)

Aaron Lun

See Also

https://libscran.github.io/nenesub/, for more details on the underlying algorithm.

Examples

x <- matrix(rnorm(10000), nrow=2)
keep <- subsampleByNeighbors(x, 10)
plot(x[1,], x[2,])
points(x[1,keep], x[2,keep], col="red")
legend('topright', col=c('black', 'red'), legend=c('all', 'subsample'), pch=1)

Default parameters for subsampleByNeighbors

Description

Default parameters from the underlying C++ library. These may be overridden by defaults in the subsampleByNeighbors function signature.

Usage

subsampleByNeighborsDefaults()

Value

Named list containing default values for various function arguments.

Author(s)

Aaron Lun

Examples

subsampleByNeighborsDefaults()

Subsample by partition

Description

Subsample from a partitioning of cells, preserving the frequency of cells across partitions. The subsample is typically used as a smaller representative dataset that still exhibits the same distribution of cells as the full dataset.

Usage

subsampleByPartition(partitions, number, seed = NULL, force.non.empty = NULL)

Arguments

partitions

Factor, integer vector or character vector of length equal to the number of cells, containing the assignment of each cell to a partition (e.g., clustering).

number

Integer specifying the number of cells to retain in the subsample. If greater than the length of partitions, all cells are retained.

seed

Integer specifying the seed for the random number generator.

If NULL, the default value in subsampleByPartitionDefaults is used.

force.non.empty

Boolean indicating whether each partition should have at least one cell in the subsample. If FALSE, partitions may not be represented if the number of cells is less than the ratio of number to length(partitions).

If NULL, the default value in subsampleByPartitionDefaults is used.

Value

Integer vector containing the indices of the cells retained in the subsample.

Author(s)

Aaron Lun

See Also

subsampleByNeighbors, for an alternative downsampling strategy.

Examples

x <- matrix(rnorm(10000), nrow=5)
partitions <- clusterKmeans(x, k=20)
keep <- subsampleByPartition(partitions$clusters, 100)
table(partitions$clusters[keep])
sub.x <- x[,keep] # subsample to be used for expensive downstream steps.

# Rare partitions are retained by default:
grouping <- c(LETTERS[1:10], sample(LETTERS[11:26], 990, replace=TRUE))
keep <- subsampleByPartition(grouping, 100)
table(grouping[keep])
keep <- subsampleByPartition(grouping, 100, force.non.empty=FALSE)
table(grouping[keep])

Default parameters for subsampleByPartition

Description

Default parameters from the underlying C++ library. These may be overridden by defaults in the subsampleByPartition function signature.

Usage

subsampleByPartitionDefaults()

Value

Named list containing default values for various function arguments.

Author(s)

Aaron Lun

Examples

subsampleByPartitionDefaults()

Summarize pairwise effect sizes for each group

Description

For each group, summarize the effect sizes for all pairwise comparisons to other groups. This yields a set of summary statistics that can be used to rank marker genes for each group.

Usage

summarizeEffects(
  effects,
  compute.summary.min = NULL,
  compute.summary.mean = NULL,
  compute.summary.median = NULL,
  compute.summary.max = NULL,
  compute.summary.quantiles = NULL,
  compute.summary.min.rank = NULL,
  num.threads = NULL
)

Arguments

effects

A 3-dimensional numeric containing the effect sizes from each pairwise comparison between groups. The extents of the first two dimensions are equal to the number of groups, while the extent of the final dimension is equal to the number of genes. The entry [i, j, k] represents the effect size from the comparison of group j against group i for gene k. See also the output of scoreMarkers with all.pairwise=TRUE.

compute.summary.min

Boolean specifying whether to compute the minimum as a summary statistic.

If NULL, the default value in summarizeEffectsDefaults is used.

compute.summary.mean

Boolean specifying whether to compute the mean as a summary statistic.

If NULL, the default value in summarizeEffectsDefaults is used.

compute.summary.median

Boolean specifying whether to compute the median as a summary statistic.

If NULL, the default value in summarizeEffectsDefaults is used.

compute.summary.max

Boolean specifying whether to compute the maximum as a summary statistic.

If NULL, the default value in summarizeEffectsDefaults is used.

compute.summary.quantiles

Numeric scalars containing the probabilities of quantiles to compute as summary statistics.

If NULL, no quantiles are computed.

compute.summary.min.rank

Boolean specifying whether to compute the mininum rank as a summary statistic.

If NULL, the default value in summarizeEffectsDefaults is used.

num.threads

Integer scalar specifying the number of threads to use.

If NULL, the default value in summarizeEffectsDefaults is used.

Details

Each summary statistic can be used to prioritize different sets of marker genes for the group of interest, by ranking them in decreasing order according to said statistic:

  • min contains the minimum effect size across all comparisons involving the group of interest. Genes with large values are upregulated in all comparisons. As such, it is the most stringent summary as markers will only have large values if they are uniquely upregulated in the group of interest compared to every other group.

  • mean contains the mean effect size across all comparisons involving the group of interest. Genes with large values are upregulated on average compared to the other groups. This is a good general-purpose summary statistic.

  • median contains the median effect size across all comparisons involving the group of interest. Genes with large values are upregulated compared to most (i.e., at least 50 Compared to the mean, this is more robust to outlier effects but less sensitive to strong effects in a minority of comparisons.

  • max contains the maximum effect size across all comparisons involving the group of interest. Using this to define markers will focus on genes that are upregulated in at least one comparison. As such, it is the least stringent summary as markers can achieve large values if they are upregulated in the group of interest compared to any one other group.

  • quantile[[P]] contains the quantile P across all comparisons involving the group of interest. This is a generalization of the minimum, median and maximum for arbitrary quantile probabilities. For example, a large quantile[["20"]] would mean that the gene is upregulated in the group of interest compared to 80

The exact definition of “large” depends on the choice of effect size. For signed effects like Cohen's d, delta-mean and delta-detected, the value must be positive to be considered “large”. For the AUC, a value greater than 0.5 is considered “large”. This interpretation is also affected by the choice of threshold used to compute each effect size in scoreMarkers, e.g., a negative Cohen's d cannot be interpreted as downregulation when the threshold is positive.

The min.rank is a more exotic summary statistic, containing the minimum rank for each gene across all comparisons involving the group of interest. This is defined by ranking the effect sizes across genes within each comparison, and then taking the minimum of these ranks across comparisons. Taking all genes with min.rank <= T will yield a set containing the top T genes from each comparison. The idea is to ensure that there are at least T genes that can distinguish the group of interest from any other group.

NaN effect sizes are allowed, e.g., if two groups do not exist in the same block for a blocked analysis in scoreMarkers with block. This function will ignore NaN values when computing each summary. If all effects are NaN for a particular group, the summary statistic will also be NaN.

Value

List of DataFrames containing summary statistics for the effect sizes. Each DataFrame corresponds to a group, each row corresponds to a gene, and each column contains a summary statistic. If compute.summary.quantiles is provided, the "quantile" column is a nested DataFrame where each column coresponds to a probability in compute.summary.quantiles.

Author(s)

Aaron Lun

See Also

The summarize_effects function in https://libscran.github.io/scran_markers/.

scoreMarkers, to compute the pairwise effects in the first place.

Examples

# Mocking a matrix:
library(Matrix)
x <- round(abs(rsparsematrix(1000, 100, 0.1) * 100))
normed <- normalizeCounts(x, size.factors=centerSizeFactors(colSums(x)))

g <- sample(letters[1:4], ncol(x), replace=TRUE)
effects <- scoreMarkers(normed, g, all.pairwise=TRUE)

summarized <- summarizeEffects(effects$cohens.d)
str(summarized)

Default parameters for summarizeEffectsDefaults

Description

Default parameters from the underlying C++ library. These may be overridden by defaults in the summarizeEffects function signature.

Usage

summarizeEffectsDefaults()

Value

Named list containing default values for various function arguments.

Author(s)

Aaron Lun

Examples

summarizeEffectsDefaults()

Test for gene set enrichment

Description

Perform a hypergeometric test for enrichment of gene sets in a list of interesting genes (e.g., markers).

Usage

testEnrichment(x, sets, universe = NULL, log = NULL, num.threads = 1)

Arguments

x

Vector of identifiers for some interesting genes, e.g., symbols or Ensembl IDs. This is usually derived from a selection of top markers, e.g., from scoreMarkers.

sets

List of vectors of identifiers for the pre-defined gene sets. Each inner vector corresponds to a gene set and should contain the same type of identifiers as x.

universe

Vector of identifiers for the universe of genes in the dataset. x and each vector in sets will be subsetted to only include those genes in universe. If NULL, the universe is defined as the union of all genes in x and sets.

Alternatively, an integer scalar specifying the number of genes in the universe. This is assumed to be greater than or equal to the number of unique genes in x and sets.

log

Logical scalar indicating whether to report log-transformed p-values. This may be desirable to avoid underflow at near-zero p-values.

If NULL, the default value in testEnrichmentDefaults is used.

num.threads

Integer scalar specifying the number of threads to use.

Value

DataFrame with one row per gene set and the following columns:

  • overlap, the overlap between x and each entry of sets, i.e., the number of genes in the intersection.

  • size, the set of each entry of sets.

  • p.value, the (possibly log-transformed) p-value for overrepresentation of the gene set in x.

Author(s)

Aaron Lun

See Also

phyper and https://libscran.github.io/phyper/, which is the basis for the underlying calculation.

Examples

testEnrichment(
    x=LETTERS[1:5], 
    sets=list(
        first=LETTERS[1:10],
        second=LETTERS[1:5 * 2],
        third=LETTERS[10:20]
    ),
    universe=LETTERS
)

Default parameters for testEnrichment

Description

Default parameters from the underlying C++ library. These may be overridden by defaults in the testEnrichment function signature.

Usage

testEnrichmentDefaults()

Value

Named list containing default values for various function arguments.

Author(s)

Aaron Lun

Examples

testEnrichmentDefaults()