Title: | Rank-based signature enrichment analysis for single-cell data |
---|---|
Description: | UCell is a package for evaluating gene signatures in single-cell datasets. UCell signature scores, based on the Mann-Whitney U statistic, are robust to dataset size and heterogeneity, and their calculation demands less computing time and memory than other available methods, enabling the processing of large datasets in a few minutes even on machines with limited computing power. UCell can be applied to any single-cell data matrix, and includes functions to directly interact with SingleCellExperiment and Seurat objects. |
Authors: | Massimo Andreatta [aut, cre] , Santiago Carmona [aut] |
Maintainer: | Massimo Andreatta <[email protected]> |
License: | GPL-3 + file LICENSE |
Version: | 2.11.1 |
Built: | 2024-11-01 03:40:13 UTC |
Source: | https://github.com/bioc/UCell |
Given a Seurat object, calculates module/signature enrichment scores at single-cell level using the Mann-Whitney U statistic. UCell scores are normalized U statistics (between 0 and 1), and they are mathematically related to the Area under the ROC curve (see Mason and Graham)
AddModuleScore_UCell( obj, features, maxRank = 1500, chunk.size = 100, BPPARAM = NULL, ncores = 1, storeRanks = FALSE, w_neg = 1, assay = NULL, slot = "counts", ties.method = "average", force.gc = FALSE, name = "_UCell" )
AddModuleScore_UCell( obj, features, maxRank = 1500, chunk.size = 100, BPPARAM = NULL, ncores = 1, storeRanks = FALSE, w_neg = 1, assay = NULL, slot = "counts", ties.method = "average", force.gc = FALSE, name = "_UCell" )
obj |
Seurat object |
features |
A list of signatures, for example:
|
maxRank |
Maximum number of genes to rank per cell; above this rank, a given gene is considered as not expressed. |
chunk.size |
Number of cells to be processed simultaneously (lower size requires slightly more computation but reduces memory demands) |
BPPARAM |
A |
ncores |
Number of processors to parallelize computation. If
|
storeRanks |
Store ranks matrix in Seurat object ('UCellRanks' assay) for fast subsequent computations. This option may demand large amounts of RAM. |
w_neg |
Weight on negative genes in signature. e.g. |
assay |
Pull out data from this assay of the Seurat object
(if NULL, use |
slot |
Pull out data from this slot (layer in v5) of the Seurat object |
ties.method |
How ranking ties should be resolved - passed on to data.table::frank |
force.gc |
Explicitly call garbage collector to reduce memory footprint |
name |
Name tag that will be appended at the end of each signature name, "_UCell" by default (e.g. signature score in meta data will be named: Myeloid_signature_UCell) |
In contrast to Seurat's AddModuleScore, which is normalized by binning genes of similar expression at the population level, UCell scores depend only on the gene expression ranks of individual cell, and therefore they are robust across datasets regardless of dataset composition.
Returns a Seurat object with module/signature enrichment scores
added to object meta data; each score is stored as the corresponding
signature name provided in features
followed by the tag given
in name
(or "_UCell" by default )
library(UCell) gene.sets <- list(Tcell = c("CD2","CD3E","CD3D"), Myeloid = c("SPI1","FCER1G","CSF1R")) data(sample.matrix) obj <- Seurat::CreateSeuratObject(sample.matrix) obj <- AddModuleScore_UCell(obj,features = gene.sets) head(obj[[]]) ## Using positive and negative gene sets gene.sets <- list() gene.sets$Tcell_gd <- c("TRDC+","TRGC1+","TRGC2+","TRDV1+", "TRAC-","TRBC1-","TRBC2-") gene.sets$NKcell <- c("FGFBP2+", "SPON2+", "KLRF1+", "FCGR3A+", "CD3E-","CD3G-") obj <- AddModuleScore_UCell(obj, features = gene.sets, name=NULL) head(obj$NKcell)
library(UCell) gene.sets <- list(Tcell = c("CD2","CD3E","CD3D"), Myeloid = c("SPI1","FCER1G","CSF1R")) data(sample.matrix) obj <- Seurat::CreateSeuratObject(sample.matrix) obj <- AddModuleScore_UCell(obj,features = gene.sets) head(obj[[]]) ## Using positive and negative gene sets gene.sets <- list() gene.sets$Tcell_gd <- c("TRDC+","TRGC1+","TRGC2+","TRDV1+", "TRAC-","TRBC1-","TRBC2-") gene.sets$NKcell <- c("FGFBP2+", "SPON2+", "KLRF1+", "FCGR3A+", "CD3E-","CD3G-") obj <- AddModuleScore_UCell(obj, features = gene.sets, name=NULL) head(obj$NKcell)
Calculate rankings and scores for query data and given signature set
calculate_Uscore( matrix, features, maxRank = 1500, chunk.size = 100, BPPARAM = NULL, ncores = 1, w_neg = 1, ties.method = "average", storeRanks = FALSE, force.gc = FALSE, name = "_UCell" )
calculate_Uscore( matrix, features, maxRank = 1500, chunk.size = 100, BPPARAM = NULL, ncores = 1, w_neg = 1, ties.method = "average", storeRanks = FALSE, force.gc = FALSE, name = "_UCell" )
matrix |
Input data matrix |
features |
List of signatures |
maxRank |
Rank cutoff (1500) |
chunk.size |
Cells per sub-matrix (100) |
BPPARAM |
A BioParallel object to instruct UCell how to parallelize |
ncores |
Number of cores to use for parallelization |
w_neg |
Weight on negative signatures |
ties.method |
How to break ties, for data.table::frankv method ("average") |
storeRanks |
Store ranks? (FALSE) |
force.gc |
Force garbage collection? (FALSE) |
name |
Suffix for metadata columns ("_UCell") |
A list of signature scores
Check if all genes in signatures are found in data matrix - otherwise add zero counts in data-matrix to complete it
check_genes(matrix, features)
check_genes(matrix, features)
matrix |
Input data matrix |
features |
List of genes that must be present (otherwise they are added) |
Same input matrix, extended to comprise any missing genes
Check signature names and add standard names is missing
check_signature_names(features)
check_signature_names(features)
features |
List of signatures for scoring |
The input list of signatures, with standard names if provided un-named
Calculate per-cell feature rankings
data_to_ranks_data_table(data, ties.method = "average")
data_to_ranks_data_table(data, ties.method = "average")
data |
Expression data matrix |
ties.method |
How to break ties (passed on to data.table::frankv) |
A data.table of ranks
Smoothing scores by KNN
knn_smooth_scores(matrix = NULL, nn = NULL, decay = 0.1, up.only = FALSE)
knn_smooth_scores(matrix = NULL, nn = NULL, decay = 0.1, up.only = FALSE)
matrix |
Input data matrix |
nn |
A nearest neighbor object returned by findKNN |
decay |
Exponential decay for nearest neighbor weight: (1-decay)^n |
up.only |
If set to TRUE, smoothed scores will only be allowed to increase by smoothing |
A dataframe of knn-smoothed scores
Get signature scores from pre-computed rank matrix
rankings2Uscore( ranks_matrix, features, chunk.size = 100, w_neg = 1, BPPARAM = NULL, ncores = 1, force.gc = FALSE, name = "_UCell" )
rankings2Uscore( ranks_matrix, features, chunk.size = 100, w_neg = 1, BPPARAM = NULL, ncores = 1, force.gc = FALSE, name = "_UCell" )
ranks_matrix |
A rank matrix |
features |
List of signatures |
chunk.size |
How many cells per matrix chunk |
w_neg |
Weight on negative signatures |
BPPARAM |
A BioParallel object to instruct UCell how to parallelize |
ncores |
How many cores to use for parallelization? |
force.gc |
Force garbage collection to recover RAM? (FALSE) |
name |
Name suffix for metadata columns ("_UCell") |
A list of signature scores
A sparse matrix (class "dgCMatrix") of single-cell transcriptomes
(scRNA-seq) for 30 cells and 20729 genes.
Single-cell UMI counts were normalized using a standard log-normalization:
counts for each cell were divided by the total counts for that cell and
multiplied by 10,000, then natural-log transformed using log1p
.
This a subsample of T cells from the
large scRNA-seq PBMC dataset published
by Hao et al. and
available as UMI counts at
https://atlas.fredhutch.org/data/nygc/multimodal/pbmc_multimodal.h5seurat
sample.matrix
sample.matrix
A sparse matrix of 30 cells and 20729 genes.
https://doi.org/10.1016/j.cell.2021.04.048
Given a gene vs. cell matrix, calculates module/signature enrichment scores on single-cell level using Mann-Whitney U statistic. UCell scores are normalized U statistics (between 0 and 1), and they are mathematically related to the Area under the ROC curve (see Mason and Graham) These scores only depend on the gene expression ranks of individual cell, and therefore they are robust across datasets regardless of dataset composition.
ScoreSignatures_UCell( matrix = NULL, features, precalc.ranks = NULL, maxRank = 1500, w_neg = 1, name = "_UCell", assay = "counts", chunk.size = 100, BPPARAM = NULL, ncores = 1, ties.method = "average", force.gc = FALSE )
ScoreSignatures_UCell( matrix = NULL, features, precalc.ranks = NULL, maxRank = 1500, w_neg = 1, name = "_UCell", assay = "counts", chunk.size = 100, BPPARAM = NULL, ncores = 1, ties.method = "average", force.gc = FALSE )
matrix |
Input matrix, either stored in a SingleCellExperiment object
or as a raw matrix. |
features |
A list of signatures, for example:
|
precalc.ranks |
If you have pre-calculated ranks using
|
maxRank |
Maximum number of genes to rank per cell; above this rank, a
given gene is considered as not expressed. Note: this parameter is
ignored if |
w_neg |
Weight on negative genes in signature. e.g. |
name |
Name suffix appended to signature names |
assay |
The sce object assay where the data is to be found |
chunk.size |
Number of cells to be processed simultaneously (lower size requires slightly more computation but reduces memory demands) |
BPPARAM |
A |
ncores |
Number of processors to parallelize computation. If
|
ties.method |
How ranking ties should be resolved - passed on to data.table::frank |
force.gc |
Explicitly call garbage collector to reduce memory footprint |
Returns input SingleCellExperiment object with UCell scores added to altExp
library(UCell) # Using sparse matrix data(sample.matrix) gene.sets <- list( Tcell_signature = c("CD2","CD3E","CD3D"), Myeloid_signature = c("SPI1","FCER1G","CSF1R")) scores <- ScoreSignatures_UCell(sample.matrix, features=gene.sets) head(scores) # Using sce object library(SingleCellExperiment) data(sample.matrix) my.sce <- SingleCellExperiment(list(counts=sample.matrix)) gene.sets <- list( Tcell_signature = c("CD2","CD3E","CD3D"), Myeloid_signature = c("SPI1","FCER1G","CSF1R")) my.sce <- ScoreSignatures_UCell(my.sce, features=gene.sets) altExp(my.sce, 'UCell')
library(UCell) # Using sparse matrix data(sample.matrix) gene.sets <- list( Tcell_signature = c("CD2","CD3E","CD3D"), Myeloid_signature = c("SPI1","FCER1G","CSF1R")) scores <- ScoreSignatures_UCell(sample.matrix, features=gene.sets) head(scores) # Using sce object library(SingleCellExperiment) data(sample.matrix) my.sce <- SingleCellExperiment(list(counts=sample.matrix)) gene.sets <- list( Tcell_signature = c("CD2","CD3E","CD3D"), Myeloid_signature = c("SPI1","FCER1G","CSF1R")) my.sce <- ScoreSignatures_UCell(my.sce, features=gene.sets) altExp(my.sce, 'UCell')
This function performs smoothing of single-cell scores by weighted average of the k-nearest neighbors. It can be useful to 'impute' scores by neighboring cells and partially correct data sparsity. While this function has been designed to smooth UCell scores, it can be applied to any numerical metadata contained in SingleCellExperiment or Seurat objects
## S3 method for class 'Seurat' SmoothKNN( obj = NULL, signature.names = NULL, reduction = "pca", k = 10, decay = 0.1, up.only = FALSE, BNPARAM = AnnoyParam(), BPPARAM = SerialParam(), suffix = "_kNN", assay = NULL, slot = "data", sce.expname = NULL, sce.assay = NULL ) ## S3 method for class 'SingleCellExperiment' SmoothKNN( obj = NULL, signature.names = NULL, reduction = "PCA", k = 10, decay = 0.1, up.only = FALSE, BNPARAM = AnnoyParam(), BPPARAM = SerialParam(), suffix = "_kNN", assay = NULL, slot = "data", sce.expname = c("UCell", "main"), sce.assay = NULL ) SmoothKNN( obj = NULL, signature.names = NULL, reduction = "pca", k = 10, decay = 0.1, up.only = FALSE, BNPARAM = AnnoyParam(), BPPARAM = SerialParam(), suffix = "_kNN", assay = NULL, slot = "data", sce.expname = c("UCell", "main"), sce.assay = NULL )
## S3 method for class 'Seurat' SmoothKNN( obj = NULL, signature.names = NULL, reduction = "pca", k = 10, decay = 0.1, up.only = FALSE, BNPARAM = AnnoyParam(), BPPARAM = SerialParam(), suffix = "_kNN", assay = NULL, slot = "data", sce.expname = NULL, sce.assay = NULL ) ## S3 method for class 'SingleCellExperiment' SmoothKNN( obj = NULL, signature.names = NULL, reduction = "PCA", k = 10, decay = 0.1, up.only = FALSE, BNPARAM = AnnoyParam(), BPPARAM = SerialParam(), suffix = "_kNN", assay = NULL, slot = "data", sce.expname = c("UCell", "main"), sce.assay = NULL ) SmoothKNN( obj = NULL, signature.names = NULL, reduction = "pca", k = 10, decay = 0.1, up.only = FALSE, BNPARAM = AnnoyParam(), BPPARAM = SerialParam(), suffix = "_kNN", assay = NULL, slot = "data", sce.expname = c("UCell", "main"), sce.assay = NULL )
obj |
Input object - either a SingleCellExperiment object or a Seurat object. |
signature.names |
The names of the signatures (or any numeric metadata column) for which to calculate kNN-smoothed scores |
reduction |
Which dimensionality reduction to use for kNN smoothing. It must be already present in the input object. |
k |
Number of neighbors for kNN smoothing |
decay |
Exponential decay for nearest neighbor weight: (1-decay)^n |
up.only |
If set to TRUE, smoothed scores will only be allowed to increase by smoothing |
BNPARAM |
A BiocNeighborParam object specifying the algorithm to use for kNN calculation. |
BPPARAM |
A |
suffix |
Suffix to append to metadata columns for the new knn-smoothed scores |
assay |
For Seurat objects only - do smoothing on expression data from this assay. When NULL, only looks in metadata |
slot |
For Seurat objects only - do smoothing on expression data from this slot |
sce.expname |
For sce objects only - which experiment stores the signatures to be smoothed. Set to 'main' for smoothing gene expression stored in the main sce experiment. |
sce.assay |
For sce objects only - pull data from this assay |
An augmented obj
with the smoothed signatures. If obj
is a Seurat object, smoothed signatures are added to metadata; if
obj
is a SingleCellExperiment object, smoothed signatures are
returned in a new altExp. See the examples below.
#### Using Seurat #### library(Seurat) gene.sets <- list(Tcell = c("CD2","CD3E","CD3D"), Myeloid = c("SPI1","FCER1G","CSF1R")) data(sample.matrix) obj <- Seurat::CreateSeuratObject(sample.matrix) # Calculate UCell scores obj <- AddModuleScore_UCell(obj,features = gene.sets, name=NULL) # Run PCA obj <- FindVariableFeatures(obj) |> NormalizeData() |> ScaleData() |> RunPCA(npcs=5) # Smooth signatures obj <- SmoothKNN(obj, k=3, signature.names=names(gene.sets)) head(obj[[]]) #### Using SingleCellExperiment #### library(SingleCellExperiment) library(scater) data(sample.matrix) sce <- SingleCellExperiment(list(counts=sample.matrix)) gene.sets <- list( Tcell = c("CD2","CD3E","CD3D"), Myeloid = c("SPI1","FCER1G","CSF1R")) # Calculate UCell scores sce <- ScoreSignatures_UCell(sce, features=gene.sets, name=NULL) # Run PCA sce <- logNormCounts(sce) sce <- runPCA(sce, scale=TRUE, ncomponents=5) # Smooth signatures sce <- SmoothKNN(sce, k=3, signature.names=names(gene.sets)) # See results altExp(sce, 'UCell') assays(altExp(sce, 'UCell')) # Plot on UMAP sce <- runUMAP(sce, dimred="PCA") plotUMAP(sce, colour_by = "Tcell_kNN", by_exprs_values = "UCell_kNN")
#### Using Seurat #### library(Seurat) gene.sets <- list(Tcell = c("CD2","CD3E","CD3D"), Myeloid = c("SPI1","FCER1G","CSF1R")) data(sample.matrix) obj <- Seurat::CreateSeuratObject(sample.matrix) # Calculate UCell scores obj <- AddModuleScore_UCell(obj,features = gene.sets, name=NULL) # Run PCA obj <- FindVariableFeatures(obj) |> NormalizeData() |> ScaleData() |> RunPCA(npcs=5) # Smooth signatures obj <- SmoothKNN(obj, k=3, signature.names=names(gene.sets)) head(obj[[]]) #### Using SingleCellExperiment #### library(SingleCellExperiment) library(scater) data(sample.matrix) sce <- SingleCellExperiment(list(counts=sample.matrix)) gene.sets <- list( Tcell = c("CD2","CD3E","CD3D"), Myeloid = c("SPI1","FCER1G","CSF1R")) # Calculate UCell scores sce <- ScoreSignatures_UCell(sce, features=gene.sets, name=NULL) # Run PCA sce <- logNormCounts(sce) sce <- runPCA(sce, scale=TRUE, ncomponents=5) # Smooth signatures sce <- SmoothKNN(sce, k=3, signature.names=names(gene.sets)) # See results altExp(sce, 'UCell') assays(altExp(sce, 'UCell')) # Plot on UMAP sce <- runUMAP(sce, dimred="PCA") plotUMAP(sce, colour_by = "Tcell_kNN", by_exprs_values = "UCell_kNN")
Split data matrix into smaller sub-matrices ('chunks')
split_data.matrix(matrix, chunk.size = 100)
split_data.matrix(matrix, chunk.size = 100)
matrix |
Input data matrix |
chunk.size |
How many cells to include in each sub-matrix |
A list of sub-matrices, each with size (n_features x chunk_size)
Given a gene vs. cell matrix, calculates the rankings of expression for all genes in each cell.
StoreRankings_UCell( matrix, maxRank = 1500, chunk.size = 100, BPPARAM = NULL, ncores = 1, assay = "counts", ties.method = "average", force.gc = FALSE )
StoreRankings_UCell( matrix, maxRank = 1500, chunk.size = 100, BPPARAM = NULL, ncores = 1, assay = "counts", ties.method = "average", force.gc = FALSE )
matrix |
Input matrix, either stored in a SingleCellExperiment object
or as a raw matrix. |
maxRank |
Maximum number of genes to rank per cell; above this rank, a given gene is considered as not expressed |
chunk.size |
Number of cells to be processed simultaneously (lower size requires slightly more computation but reduces memory demands) |
BPPARAM |
A |
ncores |
Number of processors to parallelize computation. If
|
assay |
Assay where the data is to be found (for input in 'sce' format) |
ties.method |
How ranking ties should be resolved - passed on to data.table::frank |
force.gc |
Explicitly call garbage collector to reduce memory footprint |
While ScoreSignatures_UCell
can be used 'on the fly' to
evaluate signatures in a query dataset, it requires recalculating gene
ranks at every execution. If you have a large dataset and plan to experiment
with multiple signatures, evaluating the same dataset multiple times,
this function allows you to store pre-calculated ranks so they do not have to
be recomputed every time. Pre-calculated ranks can then be applied to the
function ScoreSignatures_UCell
to evaluate gene signatures in a
significantly faster way on successive iterations.
Returns a sparse matrix of pre-calculated ranks that can be used multiple times to evaluate different signatures
library(UCell) data(sample.matrix) ranks <- StoreRankings_UCell(sample.matrix) ranks[1:5,1:5] gene.sets <- list( Tcell_signature = c("CD2","CD3E","CD3D"), Myeloid_signature = c("SPI1","FCER1G","CSF1R")) scores <- ScoreSignatures_UCell(features=gene.sets, precalc.ranks=ranks) head(scores)
library(UCell) data(sample.matrix) ranks <- StoreRankings_UCell(sample.matrix) ranks[1:5,1:5] gene.sets <- list( Tcell_signature = c("CD2","CD3E","CD3D"), Myeloid_signature = c("SPI1","FCER1G","CSF1R")) scores <- ScoreSignatures_UCell(features=gene.sets, precalc.ranks=ranks) head(scores)
Maximum sum of ranks, rank_sum_max: len_sig * max_Rank Minimum sum of ranks, rank_sum_min: len_sig * (len_sig + 1)/2 Maximum U statistic, Umax: Maximum sum of ranks - Minimum sum of ranks Minimum U statistic, Umin: 0 Normalized U statistic (0 to 1), Unorm: (U - Umin)/(Umax- Umin) = U/Umax UCell score (0 to 1): 1 - Unorm
u_stat(rank_value, maxRank = 1000, sparse = FALSE)
u_stat(rank_value, maxRank = 1000, sparse = FALSE)
rank_value |
A vector of ranks |
maxRank |
Max number of features to include in ranking |
sparse |
Whether the vector of ranks is in sparse format |
Any rank > maxRank is set to maxRank
Normalized U statistic for the vector of ranks
Calculate U scores for a list of signatures, given a rank matrix
u_stat_signature_list( sig_list, ranks_matrix, maxRank = 1000, sparse = FALSE, w_neg = 1 )
u_stat_signature_list( sig_list, ranks_matrix, maxRank = 1000, sparse = FALSE, w_neg = 1 )
sig_list |
A list of signatures |
ranks_matrix |
Matrix of pre-computed ranks |
maxRank |
Max number of features to include in ranking, for u_stat function |
sparse |
Whether the vector of ranks is in sparse format |
w_neg |
Weight on negative signatures |
A matrix of U scores
UCell is an R package for scoring gene signatures in single-cell datasets.
UCell scores, based on the Mann-Whitney U statistic, are robust to dataset
size and heterogeneity, and their calculation demands relatively less
computing time and memory than most other methods, enabling the processing
of large datasets (> cells). UCell can be applied to any cell
vs. gene data matrix, and includes functions to directly interact with
Seurat and SingleCellExperiment objects.
ScoreSignatures_UCell
Calculate module enrichment scores
from single-cell data. Given a gene vs. cell matrix (either as sparse
matrix or stored in a SingleCellExperiment object), it calculates
module/signature enrichment scores. This score depends only on the
gene activity ranks of individual cell, and therefore is robust across
datasets.
AddModuleScore_UCell
A wrapper for UCell to interact
directly with Seurat objects. Given a Seurat object and a set of
signatures, it calculates enrichment scores on single-cell level
and returns them into the meta.data of the input Seurat object.
StoreRankings_UCell
Calculates and stores gene rankings
for a single-cell dataset. Given a gene vs. cell matrix and a set of
signatures, it calculates the rankings of expression for all genes in
each cell. It can then be applied to the function ScoreSignatures_UCell
to evaluate gene signatures on the gene expression ranks of individual
cells.
SmoothKNN
Perform signature score smoothing using a
weighted average of the scores of the first k nearest neighbors (kNN).
It can be useful to 'impute' scores by neighboring cells and partially
correct data sparsity. While this function has been designed to smooth
UCell scores, it can be applied to any numerical metadata contained in
SingleCellExperiment or Seurat objects
UCell evaluates the strength of gene signatures (or gene sets) in individual cells of your dataset. You may specify positive and negative (up- or down-regulated) genes in signatures. See the examples below:
markers <- list() markers$Tcell_CD4 <- c("CD4","CD40LG") markers$Tcell_CD8 <- c("CD8A","CD8B") markers$Tcell_Treg <- c("FOXP3","IL2RA") markers$Tcell_gd <- c("TRDC+", "TRGC1+", "TRGC2+", "TRDV1+","TRAC-","TRBC1-","TRBC2-") markers$Tcell_NK <- c("FGFBP2+", "SPON2+", "KLRF1+", "FCGR3A+", "CD3E-","CD3G-")
If you don't specify +/- for genes, they are assumed to be all as a positive set. The UCell score is calculated as:
where and
are respectively the UCell scores for the
positive and negative set, and
is a weight on the negative set.
When no negative set of genes is present,
Maintainer: Massimo Andreatta [email protected] (ORCID)
Authors:
Santiago Carmona [email protected] (ORCID)
UCell: robust and scalable single-cell gene signature scoring. Massimo Andreatta & Santiago J Carmona (2021) CSBJ https://doi.org/10.1016/j.csbj.2021.06.043
Useful links: