| Title: | Rank-based reference mapping for fast and robust cell type annotation in spatial and single-cell transcriptomics |
|---|---|
| Description: | RankMap is a fast and scalable tool for reference-based cell type annotation of single-cell and spatial transcriptomics data. It uses ranked gene expression and multinomial regression to achieve robust predictions, even with partial gene coverage. Compatible with Seurat, SingleCellExperiment, and SpatialExperiment objects, RankMap offers flexible preprocessing and significantly faster runtime than tools like SingleR, Azimuth, and RCTD. |
| Authors: | Jinming Cheng [aut, cre] (ORCID: <https://orcid.org/0000-0003-3806-4694>) |
| Maintainer: | Jinming Cheng <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 1.1.1 |
| Built: | 2026-05-11 10:39:25 UTC |
| Source: | https://github.com/bioc/RankMap |
This function transforms a gene expression matrix by applying top-k filtering, optional rank transformation, binning, scaling, and/or expression weighting.
computeRankedMatrix( data, weight_by_expr = TRUE, rank_zeros = FALSE, bin_rank = TRUE, scale_rank = TRUE, k = 20, use_data = FALSE )computeRankedMatrix( data, weight_by_expr = TRUE, rank_zeros = FALSE, bin_rank = TRUE, scale_rank = TRUE, k = 20, use_data = FALSE )
data |
A gene expression matrix with genes as rows and cells
(or spatial spots) as columns.
Can be a dense numeric matrix, or a sparse |
weight_by_expr |
Logical. Whether to weight the ranks by
log-transformed expression values.
Default is |
rank_zeros |
Logical. Whether to include zero values in the ranking.
Default is |
bin_rank |
Logical. Whether to discretize ranks into bins.
Default is |
scale_rank |
Logical. Whether to z-score normalize the ranks
across columns. Default is |
k |
Integer. Number of top expressed genes to retain per column.
Default is |
use_data |
Logical. If |
If use_data = TRUE, the function bypasses all transformation
steps and returns the full input expression matrix.
This is useful for benchmarking performance
of rank-based versus raw log-normalized expression models.
A numeric matrix of ranked expression values
(or the original expression matrix if use_data = TRUE).
Output is always returned as a dense matrix.
maskTopKGenes,
trainRankModel,
predictRankModel
mat <- matrix(runif(1000), nrow = 100) ranked <- computeRankedMatrix(mat, k = 10) raw_expr <- computeRankedMatrix(mat, use_data = TRUE)mat <- matrix(runif(1000), nrow = 100) ranked <- computeRankedMatrix(mat, k = 10) raw_expr <- computeRankedMatrix(mat, use_data = TRUE)
Computes overall accuracy, per-class accuracy, and confusion matrix given predicted vs. true labels.
evaluatePredictionPerformance( prediction_df = NULL, truth = NULL, low_conf_label = "unknown" )evaluatePredictionPerformance( prediction_df = NULL, truth = NULL, low_conf_label = "unknown" )
prediction_df |
Output from |
truth |
A vector of true labels matching |
low_conf_label |
Character. Label used for low-confidence predictions
(e.g., |
A list with:
overall_accuracy |
Proportion of correct predictions |
per_class_accuracy |
Accuracy per true cell type |
confusion_matrix |
Contingency table (true × predicted) |
# Read in single-cell reference data seu_sc <- readRDS(system.file("extdata", "seu_sc.rds", package = "RankMap" )) # Read in Xenium spatial data seu_xen <- readRDS(system.file("extdata", "seu_xen.rds", package = "RankMap" )) # Predict cell type for Xenium data prediction_df <- RankMap( ref_data = seu_sc, ref_labels = seu_sc$cell_type, new_data = seu_xen ) performance <- evaluatePredictionPerformance( prediction_df = prediction_df, truth = seu_xen$cell_type_SingleR ) performance# Read in single-cell reference data seu_sc <- readRDS(system.file("extdata", "seu_sc.rds", package = "RankMap" )) # Read in Xenium spatial data seu_xen <- readRDS(system.file("extdata", "seu_xen.rds", package = "RankMap" )) # Predict cell type for Xenium data prediction_df <- RankMap( ref_data = seu_sc, ref_labels = seu_sc$cell_type, new_data = seu_xen ) performance <- evaluatePredictionPerformance( prediction_df = prediction_df, truth = seu_xen$cell_type_SingleR ) performance
Extracts a gene expression matrix from a Seurat object
(default assay and "data" slot),
a SummarizedExperiment object (assay named "logcounts"),
or returns the input matrix if it is already a dense or sparse matrix.
extractData(data)extractData(data)
data |
A |
A numeric gene expression matrix (dense or sparse), with genes as rows and cells or spots as columns.
seu_sc <- readRDS(system.file("extdata", "seu_sc.rds", package = "RankMap" )) # From Seurat object: mat <- extractData(seu_sc)seu_sc <- readRDS(system.file("extdata", "seu_sc.rds", package = "RankMap" )) # From Seurat object: mat <- extractData(seu_sc)
Takes a vector, converts it to a factor, and reorders its levels by their frequency of occurrence (most to least frequent by default).
factorSorted(x, decreasing = TRUE)factorSorted(x, decreasing = TRUE)
x |
A vector (character or factor) to be reordered by level frequency. |
decreasing |
Logical. If TRUE (default), levels are ordered from most to least frequent; if FALSE, from least to most frequent. |
Frequencies are computed with table(x), which ignores NA values
by default. NA entries in x are preserved in the output but are
not considered a level.
A factor with levels ordered by frequency.
factorSorted(c("a", "b", "a", "c", "b", "a")) factorSorted(c("a", "b", "a", "c", "b", "a"), decreasing = FALSE) factorSorted(factor(c("x", "y", "x", NA)))factorSorted(c("a", "b", "a", "c", "b", "a")) factorSorted(c("a", "b", "a", "c", "b", "a"), decreasing = FALSE) factorSorted(factor(c("x", "y", "x", NA)))
Replaces low-confidence cell type predictions with a placeholder label (e.g., "unknown") and adds a confidence status column ("confident" or "uncertain").
filterLowConfidenceCells( prediction_df, threshold = 0.5, low_conf_label = "unknown", keep_confidence = TRUE )filterLowConfidenceCells( prediction_df, threshold = 0.5, low_conf_label = "unknown", keep_confidence = TRUE )
prediction_df |
A data frame from |
threshold |
Numeric. Confidence threshold below which predictions
are flagged. Default is |
low_conf_label |
Character. Replacement for low-confidence predictions.
Default is |
keep_confidence |
Logical. Whether to retain the original
|
A data frame with columns:
cell_id,
predicted_cell_type,
status,
nd optionally confidence.
# Simulated predictions with confidence pred_df <- data.frame( cell_id = paste0("cell", 1:5), predicted_cell_type = c("B", "T", "B", "Myeloid", "T"), confidence = c(0.92, 0.47, 0.88, 0.33, 0.76) ) # Apply threshold of 0.5 to flag low-confidence cells result <- filterLowConfidenceCells(pred_df, threshold = 0.5) # Show result print(result) # Remove confidence column and use custom label # for low-confidence predictions result2 <- filterLowConfidenceCells(pred_df, threshold = 0.6, low_conf_label = "low_conf", keep_confidence = FALSE )# Simulated predictions with confidence pred_df <- data.frame( cell_id = paste0("cell", 1:5), predicted_cell_type = c("B", "T", "B", "Myeloid", "T"), confidence = c(0.92, 0.47, 0.88, 0.33, 0.76) ) # Apply threshold of 0.5 to flag low-confidence cells result <- filterLowConfidenceCells(pred_df, threshold = 0.5) # Show result print(result) # Remove confidence column and use custom label # for low-confidence predictions result2 <- filterLowConfidenceCells(pred_df, threshold = 0.6, low_conf_label = "low_conf", keep_confidence = FALSE )
This function zeroes out all but the top-k most highly expressed genes in each column of a gene expression matrix. It supports both dense and sparse inputs.
maskTopKGenes(data, k = 20)maskTopKGenes(data, k = 20)
data |
A gene expression matrix with genes as rows and cells
(or spatial spots) as columns.
Can be a dense numeric matrix, or a sparse |
k |
Integer. Number of top expressed genes to retain per column.
Default is |
A dense numeric matrix with the same dimensions as data,
with only the top-k values retained per column.
mat <- matrix(runif(1000), nrow = 100) masked <- maskTopKGenes(mat, k = 10)mat <- matrix(runif(1000), nrow = 100) masked <- maskTopKGenes(mat, k = 10)
Finds the confidence threshold that best balances prediction accuracy and retention rate. Requires ground truth labels for comparison.
optimizeConfidenceThreshold( prediction_df, truth, thresholds = seq(0.1, 0.9, by = 0.05), plot = TRUE )optimizeConfidenceThreshold( prediction_df, truth, thresholds = seq(0.1, 0.9, by = 0.05), plot = TRUE )
prediction_df |
A data frame from |
truth |
A character or factor vector of true labels,
aligned with rows of |
thresholds |
Numeric vector of thresholds to test. Default is seq(0.1, 0.9, by = 0.05). |
plot |
Logical. If |
A data frame summarizing accuracy and retained cell count at each threshold.
# Simulated prediction data pred_df <- data.frame( cell_id = paste0("cell", 1:10), predicted_cell_type = c( "A", "B", "A", "B", "A", "B", "A", "A", "B", "B" ), confidence = c( 0.95, 0.87, 0.65, 0.48, 0.92, 0.55, 0.73, 0.33, 0.99, 0.60 ) ) # Ground truth labels truth <- c("A", "B", "A", "B", "B", "B", "A", "A", "B", "B") # Evaluate how accuracy and coverage change with threshold summary_df <- optimizeConfidenceThreshold(pred_df, truth, plot = TRUE) # View result print(summary_df)# Simulated prediction data pred_df <- data.frame( cell_id = paste0("cell", 1:10), predicted_cell_type = c( "A", "B", "A", "B", "A", "B", "A", "A", "B", "B" ), confidence = c( 0.95, 0.87, 0.65, 0.48, 0.92, 0.55, 0.73, 0.33, 0.99, 0.60 ) ) # Ground truth labels truth <- c("A", "B", "A", "B", "B", "B", "A", "A", "B", "B") # Evaluate how accuracy and coverage change with threshold summary_df <- optimizeConfidenceThreshold(pred_df, truth, plot = TRUE) # View result print(summary_df)
Predicts cell type labels or class probabilities using a trained
RankMap model. Supports both glmnet and cv.glmnet.
Optionally returns a data frame with predicted cell types
and confidence scores.
predictRankModel( model, new_data, lambda = NULL, return_probs = FALSE, return_confidence = FALSE, ... )predictRankModel( model, new_data, lambda = NULL, return_probs = FALSE, return_confidence = FALSE, ... )
model |
A fitted model from |
new_data |
A gene expression matrix with genes as rows and cells
(or spatial spots) as columns.
Can be a |
lambda |
Optional lambda value. If |
return_probs |
Logical. If |
return_confidence |
Logical. If |
... |
Additional arguments passed to
|
A character vector (default),
a matrix (if return_probs = TRUE),
or a data frame (if return_confidence = TRUE).
# Read in single-cell reference data seu_sc <- readRDS(system.file("extdata", "seu_sc.rds", package = "RankMap" )) # Read in Xenium spatial data seu_xen <- readRDS(system.file("extdata", "seu_xen.rds", package = "RankMap" )) # Extract normalized expression data common_genes <- intersect(rownames(seu_sc), rownames(seu_xen)) mat <- extractData(seu_sc)[common_genes, ] new_mat <- extractData(seu_xen)[common_genes, ] # Train a model set.seed(42) model <- trainRankModel(mat, seu_sc$cell_type) # Predict cell type pred <- predictRankModel(model, new_mat) table(predict = pred, truth = seu_xen$cell_type_SingleR)# Read in single-cell reference data seu_sc <- readRDS(system.file("extdata", "seu_sc.rds", package = "RankMap" )) # Read in Xenium spatial data seu_xen <- readRDS(system.file("extdata", "seu_xen.rds", package = "RankMap" )) # Extract normalized expression data common_genes <- intersect(rownames(seu_sc), rownames(seu_xen)) mat <- extractData(seu_sc)[common_genes, ] new_mat <- extractData(seu_xen)[common_genes, ] # Train a model set.seed(42) model <- trainRankModel(mat, seu_sc$cell_type) # Predict cell type pred <- predictRankModel(model, new_mat) table(predict = pred, truth = seu_xen$cell_type_SingleR)
A unified function that trains a multinomial classifier on reference expression data and predicts cell types for a new dataset using top-k ranked genes. Automatically matches genes between datasets, optionally performs cross-validation, and returns predictions with confidence scores or probabilities.
RankMap( ref_data = NULL, ref_labels = NULL, new_data = NULL, n_feature_max = 500, k = 20, alpha = 0.1, cv = FALSE, nfolds = 5, lambda = NULL, return_probs = FALSE, return_confidence = TRUE, threshold = NULL, return_model = FALSE, ... )RankMap( ref_data = NULL, ref_labels = NULL, new_data = NULL, n_feature_max = 500, k = 20, alpha = 0.1, cv = FALSE, nfolds = 5, lambda = NULL, return_probs = FALSE, return_confidence = TRUE, threshold = NULL, return_model = FALSE, ... )
ref_data |
Reference gene expression matrix (genes x cells),
a |
ref_labels |
A character or factor vector of cell type labels for
columns of |
new_data |
New data to annotate. Same format as |
n_feature_max |
Maximum number of genes to use when more than
500 genes are shared. Default is |
k |
Number of top expressed genes to retain per cell (ranking).
Default is |
alpha |
Elastic net mixing parameter for |
cv |
Logical. Whether to use |
nfolds |
Number of folds for cross-validation. Default is |
lambda |
Optional lambda value for prediction. If |
return_probs |
Logical. If |
return_confidence |
Logical. If |
threshold |
Optional numeric threshold.
If set and |
return_model |
Logical. If |
... |
Additional arguments passed to
|
A data frame of predictions (by default),
or a list with elements predictions and
model if return_model = TRUE.
# Read in single-cell reference data seu_sc <- readRDS(system.file("extdata", "seu_sc.rds", package = "RankMap" )) # Read in Xenium spatial data seu_xen <- readRDS(system.file("extdata", "seu_xen.rds", package = "RankMap" )) # Predict cell type for spatial data using single-cell data as reference pred_df <- RankMap( ref_data = seu_sc, ref_labels = seu_sc$cell_type, new_data = seu_xen ) head(pred_df)# Read in single-cell reference data seu_sc <- readRDS(system.file("extdata", "seu_sc.rds", package = "RankMap" )) # Read in Xenium spatial data seu_xen <- readRDS(system.file("extdata", "seu_xen.rds", package = "RankMap" )) # Predict cell type for spatial data using single-cell data as reference pred_df <- RankMap( ref_data = seu_sc, ref_labels = seu_sc$cell_type, new_data = seu_xen ) head(pred_df)
This function samples a specified total number of cells from a metadata table, ensuring that each cell type is represented by at least a minimum number of cells.
sampleCellsByType(cell_metadata, n_total_cells = 5000, min_per_type = 50)sampleCellsByType(cell_metadata, n_total_cells = 5000, min_per_type = 50)
cell_metadata |
A data frame containing at least two columns:
|
n_total_cells |
Total number of cells to sample.
Default is |
min_per_type |
Minimum number of cells to sample from each
|
A data frame of sampled cells with at least
min_per_type cells per cell_type.
meta <- data.frame( cell_id = paste0("cell", 1:10000), cell_type = sample(c("T", "B", "Mac"), 10000, replace = TRUE) ) set.seed(42) sampled <- sampleCellsByType(meta, n_total_cells = 1000, min_per_type = 50 )meta <- data.frame( cell_id = paste0("cell", 1:10000), cell_type = sample(c("T", "B", "Mac"), 10000, replace = TRUE) ) set.seed(42) sampled <- sampleCellsByType(meta, n_total_cells = 1000, min_per_type = 50 )
Identify the top n highly expressed genes for each cell type
based on average expression across cells. This function is useful for
summarizing representative genes from a reference dataset and can be
optionally used to restrict features for downstream analysis.
topCellTypeGenes(data, cell_type, n = 50, genes = NULL, return_list = FALSE)topCellTypeGenes(data, cell_type, n = 50, genes = NULL, return_list = FALSE)
data |
A gene expression object. Supported inputs include
|
cell_type |
A vector of cell type labels corresponding to columns
of |
n |
Integer. Number of top expressed genes to return for each
cell type. Default is |
genes |
Optional character vector of gene names to restrict the
analysis. Only genes present in |
return_list |
Logical. If |
For each cell type, genes are ranked by their average expression
(computed using Matrix::rowMeans) across cells of that type.
Cells with missing (NA) cell type labels are removed prior to
computation.
If return_list = TRUE, a named list where each element contains
the top n genes for a cell type. Otherwise, a character vector
of unique top genes across all cell types.
# Read in single-cell reference data seu_sc <- readRDS(system.file("extdata", "seu_sc.rds", package = "RankMap" )) # Get top genes across all cell types top_genes <- topCellTypeGenes( data = seu_sc, cell_type = seu_sc$cell_type, n = 50 ) # Get top genes per cell type top_list <- topCellTypeGenes( data = seu_sc, cell_type = seu_sc$cell_type, n = 50, return_list = TRUE )# Read in single-cell reference data seu_sc <- readRDS(system.file("extdata", "seu_sc.rds", package = "RankMap" )) # Get top genes across all cell types top_genes <- topCellTypeGenes( data = seu_sc, cell_type = seu_sc$cell_type, n = 50 ) # Get top genes per cell type top_list <- topCellTypeGenes( data = seu_sc, cell_type = seu_sc$cell_type, n = 50, return_list = TRUE )
This function trains a multinomial logistic regression model using ranked gene expression. Optionally performs cross-validation to select the optimal regularization parameter.
trainRankModel( data = NULL, labels = NULL, alpha = 0.1, cv = FALSE, nfolds = 5, ... )trainRankModel( data = NULL, labels = NULL, alpha = 0.1, cv = FALSE, nfolds = 5, ... )
data |
A gene expression matrix with genes as rows and cells
(or spatial spots) as columns.
Can be a |
labels |
A vector of cell type labels (one per column of |
alpha |
Elastic net mixing parameter. 1 = LASSO, 0 = Ridge.
Default is |
cv |
Logical. If |
nfolds |
Number of folds for cross-validation
(only used if |
... |
Additional arguments passed to |
A fitted glmnet or cv.glmnet model object.
computeRankedMatrix, predictRankModel
# Read in single-cell reference data seu_sc <- readRDS(system.file("extdata", "seu_sc.rds", package = "RankMap" )) # Extract normalized expression data mat <- extractData(seu_sc) # Train a model set.seed(42) model <- trainRankModel(mat, seu_sc$cell_type) # Train a model with cross-validation model <- trainRankModel( data = mat, labels = seu_sc$cell_type, cv = TRUE, nfolds = 3 )# Read in single-cell reference data seu_sc <- readRDS(system.file("extdata", "seu_sc.rds", package = "RankMap" )) # Extract normalized expression data mat <- extractData(seu_sc) # Train a model set.seed(42) model <- trainRankModel(mat, seu_sc$cell_type) # Train a model with cross-validation model <- trainRankModel( data = mat, labels = seu_sc$cell_type, cv = TRUE, nfolds = 3 )