| Title: | Signature scoring for single cell analysis |
|---|---|
| Description: | This package provides functions used in Seqtometry (Kousnetsov et al. 2024), a method for analyzing single cell (scRNA-seq or scATAC-seq) data via signature (gene set) enrichment scores. The Seqtometry scores may be useful for annotating or characterizing cells, either in a flow cytometry like workflow (where scores are standalone features used for progressive partitoning as described in the Seqtometry publication) or in a cluster-based workflow (as features of clusters). The exported impute function (a port of Python's MAGIC-impute, van Dijk et al. 2018), may also be useful for single cell analysis on its own. |
| Authors: | Robert Kousnetsov [aut, cre], Daniel Hawiger [cph, fnd] |
| Maintainer: | Robert Kousnetsov <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.1.0 |
| Built: | 2026-05-30 09:48:39 UTC |
| Source: | https://github.com/bioc/Seqtometry |
This package provides functions used in Seqtometry (Kousnetsov et al. 2024), a method for analyzing single cell (scRNA-seq or scATAC-seq) data via signature (gene set) enrichment scores. The Seqtometry scores may be useful for annotating or characterizing cells, either in a flow cytometry like workflow (where scores are standalone features used for progressive partitoning as demonstrated in the Seqtometry publication) or in a cluster-based workflow (as features of clusters). The exported impute function (a port of Python's MAGIC-impute, van Dijk et al. 2018), may also be useful for single cell analysis on its own.
Robert Kousnetsov <[email protected]>
Maintainer: Robert Kousnetsov <[email protected]>
Apply diffusion operator to data in order to perform imputation.
.apply_diff_op(gex, pcs, aff, dft, t_max, tol, exact_solver).apply_diff_op(gex, pcs, aff, dft, t_max, tol, exact_solver)
gex |
matrix or Matrix Gene expression matrix |
pcs |
matrix Principal components matrix |
aff |
dgCMatrix Markov affinity matrix |
dft |
NULL or integer(1) Diffusion time |
t_max |
integer(1) Maximum diffusion time |
tol |
numeric(1) Tolerance for Procrustes disparity |
exact_solver |
logical(1) Perform imputation in gene space |
list(matrix, integer(1)) Imputed matrix and diffusion time used
Calculate graph diffusion operator (Markov affinity matrix).
.calc_diff_op(pcs, knn, ka, dist_metric).calc_diff_op(pcs, knn, ka, dist_metric)
pcs |
matrix Principal components matrix (used for kNN search) |
knn |
integer(1) Number of nearest neighbors to search for |
ka |
integer(1) Number of nearest neighbors to use for adaptive kernel |
dist_metric |
character(1) Type of metric to use for distance calculations during kNN search |
dgCMatrix Markov affinity matrix
Calculate leading principal components (via truncated singular value decomposition).
.calc_pca(gex, npc, scale).calc_pca(gex, npc, scale)
gex |
matrix or Matrix Gene expression matrix (without any zero variance genes) |
npc |
numeric(1) Number of leading principal components to compute |
scale |
logical(1) Whether to scale genes to unit variance |
list PC loading/rotation matrices as well as centering/scaling vectors
Checks that all parameters used in impute are valid.
.check_params(args).check_params(args)
args |
list The arguments to the magic_impute function |
NULL (but stops execution for invalid parameters)
For converting from character to integer based indexing
.gene_indices(mat, gss).gene_indices(mat, gss)
mat |
matrix-like Gene expression data (genes x cells) |
gss |
named list of character Signature genes (with same nomenclature system as |
integer 0-based indices (for passing to Rcpp function) of signature genes
Reverses operations done for PCA: back-rotation, unscaling, and uncentering.
.invert_pca(pcs, rot, ctr, sdv, low_mem).invert_pca(pcs, rot, ctr, sdv, low_mem)
pcs |
matrix The principal components (scaled left singular vectors) |
rot |
matrix The rotation matrix (right singular vectors) |
ctr |
integer The centering vector |
sdv |
integer or NULL The scaling vector (or NULL if no scaling was applied) |
low_mem |
logical(1) Whether to use delayed operations to reduce memory usage |
matrix or DelayedMatrix rot %*% t(pcs) * sdv + ctr
Scales input vector to unit range
.minmax_scale(x).minmax_scale(x)
x |
numeric Values to be scaled |
numeric Minmax transformed values
Simple normalization method for scRNA-seq data.
.normalize(gex).normalize(gex)
gex |
matrix or Matrix Gene expression matrix (cells x genes) |
matrix or Matrix Transformed (normalized) matrix
Calculates symmetric Procrustes distance (adapted from MATLAB procrustes).
.procrustes(x, y).procrustes(x, y)
x |
matrix |
y |
matrix |
numeric(1) Procrustes disparity between input matrices
Calculates a graph diffusion operator for the given input matrix and applies it to produce an imputed matrix.
impute( gex, transpose = TRUE, do_norm = FALSE, pca = NULL, npc = 100L, scale = TRUE, knn = 16L, ka = 6L, dist_metric = "euclidean", dft = NULL, t_max = 16L, tol = 0.001, exact_solver = TRUE, conserve_memory = FALSE, env_ret = FALSE, verbose = FALSE )impute( gex, transpose = TRUE, do_norm = FALSE, pca = NULL, npc = 100L, scale = TRUE, knn = 16L, ka = 6L, dist_metric = "euclidean", dft = NULL, t_max = 16L, tol = 0.001, exact_solver = TRUE, conserve_memory = FALSE, env_ret = FALSE, verbose = FALSE )
gex |
matrix or Matrix Gene expression values (that has passed quality control). |
transpose |
logical(1) Whether to transpose gex (make it cells x genes) prior to downstream operations. |
do_norm |
logical(1) Whether to perform LogCP10K normalization on gex. |
pca |
matrix (cells x PCs) or NULL Precomputed principal component matrix (or NULL to derive it from gex). |
npc |
integer(1) Number of principal components (min = 1) to calculate. |
scale |
logical(1) Whether to scale columns of input matrix to unit variance prior to PCA. |
knn |
integer(1) Number of nearest neighbors (min = 2) to consider during distance calculation. |
ka |
integer(1) Number of nearest neighbors (min = 2, max <= knn) to use for the adaptive kernel. |
dist_metric |
character(1) Type of metric to use for distance calculations during kNN search. |
dft |
NULL or integer(1) Automatic (NULL) or user-defined (integer) diffusion time (min = 1, max = 16). |
t_max |
integer(1) Maximum diffusion time to test when using automatic diffusion time (min = 1, max = 16). |
tol |
numeric(1) Threshold for Procrustes disparity (min = 0, max = 1) between successive diffusion times. |
exact_solver |
logical(1) Whether to perform imputation in gene space (TRUE) or PCA space (FALSE). |
conserve_memory |
logical(1)
Whether to avoid allocating a large dense matrix when |
env_ret |
logical(1) Return all variables in the environment (TRUE) or just the imputed matrix (FALSE). |
verbose |
logical(1) Whether to print messages at different major parts of the algorithm. |
matrix-like or list
If env_ret = FALSE, then just the imputed matrix.
Otherwise the function environment as a list containing all parameters (possibly modified) as well as
imp matrix or DelayedMatrix Imputed matrix.
aff dgCMatrix Markov affinity matrix (graph diffusion operator).
pca list Possibly computed (if pca was NULL), yielding a four element list, where:
x matrix (cells x PCs) The principal components matrix (scaled left singular vectors).
v matrix (genes x PCs) The rotation matrix (right singular vectors).
center integer (cells) The centering vector.
scale integer (cells) or NULL The scaling vector (or NULL if no scaling was applied).
box::use( TENxPBMCData[TENxPBMCData], SingleCellExperiment[rowData, logcounts], scuttle[quickPerCellQC, logNormCounts], scater[runUMAP, plotReducedDim], patchwork[wrap_plots]) # PBMC data, basic processing pipeline dat <- TENxPBMCData(dataset = "pbmc3k") dimnames(dat) <- list( rowData(dat)[["Symbol_TENx"]], dat[["Barcode"]]) dat <- dat |> quickPerCellQC() |> logNormCounts() |> runUMAP() # MAGIC imputation imp <- logcounts(dat) |> as("dgCMatrix") |> impute() # Visualize unimputed versus imputed expression # on UMAP plots for a gene of interest (GOI) goi <- "CD19" dat[["Imputed_GOI"]] <- imp[goi, ] p1 <- plotReducedDim(dat, "UMAP", color_by = goi) p2 <- plotReducedDim(dat, "UMAP", color_by = "Imputed_GOI") wrap_plots(p1, p2, ncol = 2)box::use( TENxPBMCData[TENxPBMCData], SingleCellExperiment[rowData, logcounts], scuttle[quickPerCellQC, logNormCounts], scater[runUMAP, plotReducedDim], patchwork[wrap_plots]) # PBMC data, basic processing pipeline dat <- TENxPBMCData(dataset = "pbmc3k") dimnames(dat) <- list( rowData(dat)[["Symbol_TENx"]], dat[["Barcode"]]) dat <- dat |> quickPerCellQC() |> logNormCounts() |> runUMAP() # MAGIC imputation imp <- logcounts(dat) |> as("dgCMatrix") |> impute() # Visualize unimputed versus imputed expression # on UMAP plots for a gene of interest (GOI) goi <- "CD19" dat[["Imputed_GOI"]] <- imp[goi, ] p1 <- plotReducedDim(dat, "UMAP", color_by = goi) p2 <- plotReducedDim(dat, "UMAP", color_by = "Imputed_GOI") wrap_plots(p1, p2, ncol = 2)
Computes signature scores (a weighted KS-like statistic) for single cell expression data
score(mat, signatures, minmax = TRUE)score(mat, signatures, minmax = TRUE)
mat |
matrix, Matrix, or DelayedMatrix Gene expression data (genes x cells) |
signatures |
named list of character
Signature genes (with same nomenclature system used in |
minmax |
logical(1) Whether to perform minmax transform on scoring results (default: TRUE) |
data.table Single cell scores (cells x signatures) for each signature, where cell barcodes are stored in the "id" column
box::use( TENxPBMCData[TENxPBMCData], SingleCellExperiment[rowData, logcounts], scuttle[quickPerCellQC, logNormCounts], scater[runUMAP, plotReducedDim], patchwork[wrap_plots]) # PBMC data, basic processing pipeline dat <- TENxPBMCData(dataset = "pbmc3k") dimnames(dat) <- list( rowData(dat)[["Symbol_TENx"]], dat[["Barcode"]]) dat <- dat |> quickPerCellQC() |> logNormCounts() |> runUMAP() # MAGIC imputation imp <- logcounts(dat) |> as("dgCMatrix") |> impute() # Score with a B cell signature (gene set) options(future.globals.maxSize = 1024^3) b_cell_sig <- list("B_cell" = c("CD19", "MS4A1", "CD79A", "CD79B")) dat[["B cell signature"]] <- Seqtometry::score(imp, b_cell_sig)[["B_cell"]] # Visualize a hallmark B cell gene versus a B cell signature score p1 <- plotReducedDim(dat, "UMAP", color_by = "CD19") p2 <- plotReducedDim(dat, "UMAP", color_by = "B cell signature") wrap_plots(p1, p2, ncol = 2)box::use( TENxPBMCData[TENxPBMCData], SingleCellExperiment[rowData, logcounts], scuttle[quickPerCellQC, logNormCounts], scater[runUMAP, plotReducedDim], patchwork[wrap_plots]) # PBMC data, basic processing pipeline dat <- TENxPBMCData(dataset = "pbmc3k") dimnames(dat) <- list( rowData(dat)[["Symbol_TENx"]], dat[["Barcode"]]) dat <- dat |> quickPerCellQC() |> logNormCounts() |> runUMAP() # MAGIC imputation imp <- logcounts(dat) |> as("dgCMatrix") |> impute() # Score with a B cell signature (gene set) options(future.globals.maxSize = 1024^3) b_cell_sig <- list("B_cell" = c("CD19", "MS4A1", "CD79A", "CD79B")) dat[["B cell signature"]] <- Seqtometry::score(imp, b_cell_sig)[["B_cell"]] # Visualize a hallmark B cell gene versus a B cell signature score p1 <- plotReducedDim(dat, "UMAP", color_by = "CD19") p2 <- plotReducedDim(dat, "UMAP", color_by = "B cell signature") wrap_plots(p1, p2, ncol = 2)
Helper function for performing a weighted Kolmogorov-Smirnov-like procedure.
wks(gex, gss, mus, sds)wks(gex, gss, mus, sds)
gex |
numeric: normalized gene expression values |
gss |
list: indices of genes in each gene set |
mus |
numeric: means of all genes |
sds |
numeric: standard deviations of all genes |
Modified Kuiper statistic (sum of minimal and maximal deviations during running sum)