Gene signature scoring with UCell

Introduction

In single-cell RNA-seq analysis, gene signature (or “module”) scoring constitutes a simple yet powerful approach to evaluate the strength of biological signals, typically associated to a specific cell type or biological process, in a transcriptome.

UCell is an R package for evaluating gene signatures in single-cell datasets. UCell signature scores, based on the Mann-Whitney U statistic, are robust to dataset size and heterogeneity, and their calculation demands less computing time and memory than other available methods, enabling the processing of large datasets in a few minutes even on machines with limited computing power. UCell can be applied to any single-cell data matrix, and includes functions to directly interact with Seurat objects.

Quick start

To test your installation, load a small sample dataset and run UCell:

library(UCell)

data(sample.matrix)
gene.sets <- list(Tcell_signature = c("CD2","CD3E","CD3D"),
                  Myeloid_signature = c("SPI1","FCER1G","CSF1R"))

scores <- ScoreSignatures_UCell(sample.matrix, features=gene.sets)
head(scores)
##                       Tcell_signature_UCell Myeloid_signature_UCell
## L5_ATTTCTGAGGTCGTGA               0.8991989                       0
## L4_TCACTATTCATCTCTA               0.6900312                       0
## L1_TCCTTCTTCTTTACAC               0.5932354                       0
## L5_AAAGTGAAGGCGCTCT               0.6090343                       0
## E2L3_CCTCAGTAGTGCAGGT             0.8506898                       0
## L5_CCCTCTCGTTCTAAGC               0.6557632                       0

Get some testing data

For this demo, we will download a single-cell dataset of lung cancer (Zilionis et al. (2019) Immunity) through the scRNA-seq package. This dataset contains >170,000 single cells; for the sake of simplicity, in this demo will we focus on immune cells, according to the annotations by the authors, and downsample to 5000 cells.

library(scRNAseq)

lung <- ZilionisLungData()
immune <- lung$Used & lung$used_in_NSCLC_immune
lung <- lung[,immune]
lung <- lung[,1:5000]

exp.mat <- Matrix::Matrix(counts(lung),sparse = TRUE)
colnames(exp.mat) <- paste0(colnames(exp.mat), seq(1,ncol(exp.mat)))

Define gene signatures

Here we define some simple gene sets based on the “Human Cell Landscape” signatures Han et al. (2020) Nature. You may edit existing signatures, or add new one as elements in a list.

signatures <- list(
    Tcell = c("CD3D","CD3E","CD3G","CD2","TRAC"),
    Myeloid = c("CD14","LYZ","CSF1R","FCER1G","SPI1","LCK-"),
    NK = c("KLRD1","NCR1","NKG7","CD3D-","CD3E-"),
    Plasma_cell = c("MZB1","DERL3","CD19-")
)

Run UCell

Run ScoreSignatures_UCell and get directly signature scores for all cells

u.scores <- ScoreSignatures_UCell(exp.mat,features=signatures)
head(u.scores)
##         Tcell_UCell Myeloid_UCell NK_UCell Plasma_cell_UCell
## bcHTNA1           0     0.5227121        0        0.00000000
## bcHNVA2           0     0.5112892        0        0.00000000
## bcALZN3           0     0.3584502        0        0.07540874
## bcFWBP4           0     0.1546426        0        0.00000000
## bcBJYE5           0     0.4629927        0        0.00000000
## bcGSBJ6           0     0.5452238        0        0.00000000

Show the distribution of predicted scores

library(reshape2)
library(ggplot2)
melted <- reshape2::melt(u.scores)
colnames(melted) <- c("Cell","Signature","UCell_score")
p <- ggplot(melted, aes(x=Signature, y=UCell_score)) + 
    geom_violin(aes(fill=Signature), scale = "width") +
    geom_boxplot(width=0.1, outlier.size=0) +
    theme_bw() + theme(axis.text.x=element_blank())
p

Pre-calculating gene rankings

The time- and memory-demanding step in UCell is the calculation of gene rankings for each individual cell. If we plan to experiment with signatures, editing them or adding new cell subtypes, it is possible to pre-calculate the gene rankings once and for all and then apply new signatures over these pre-calculated ranks. Run the StoreRankings_UCell function to pre-calculate gene rankings over a dataset:

set.seed(123)
ranks <- StoreRankings_UCell(exp.mat)
ranks[1:5,1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
##           bcHTNA1 bcHNVA2 bcALZN3 bcFWBP4 bcBJYE5
## 5S_rRNA         .       .       .       .       .
## 5_8S_rRNA       .       .       .       .       .
## 7SK             .       .       .       .       .
## A1BG            .       .       .       .       .
## A1BG-AS1        .       .       .       .       .

Then, we can apply our signature set, or any other new signature to the pre-calculated ranks. The calculations will be considerably faster.

set.seed(123)
u.scores.2 <- ScoreSignatures_UCell(features=signatures,
                                    precalc.ranks = ranks)

melted <- reshape2::melt(u.scores.2)
colnames(melted) <- c("Cell","Signature","UCell_score")
p <- ggplot(melted, aes(x=Signature, y=UCell_score)) + 
    geom_violin(aes(fill=Signature), scale = "width") +
    geom_boxplot(width=0.1, outlier.size = 0) + 
    theme_bw() + theme(axis.text.x=element_blank())
p

new.signatures <- list(Mast.cell = c("TPSAB1","TPSB2","CPA3","MS4A2"),
                       Lymphoid = c("LCK"))

u.scores.3 <- ScoreSignatures_UCell(features=new.signatures,
                                    precalc.ranks = ranks)
melted <- reshape2::melt(u.scores.3)
colnames(melted) <- c("Cell","Signature","UCell_score")
p <- ggplot(melted, aes(x=Signature, y=UCell_score)) + 
    geom_violin(aes(fill=Signature), scale = "width") +
    geom_boxplot(width=0.1, outlier.size=0) + 
    theme_bw() + theme(axis.text.x=element_blank())
p

Multi-core processing

If your machine has multi-core capabilities and enough RAM, running UCell in parallel can speed up considerably your analysis. The example below runs on a single core - you may modify this behavior by setting e.g. workers=4 to parallelize to 4 cores:

BPPARAM <- BiocParallel::MulticoreParam(workers=1)
u.scores <- ScoreSignatures_UCell(exp.mat,features=signatures,
                                  BPPARAM=BPPARAM)

Interacting with SingleCellExperiment or Seurat

SingleCellExperiment and Seurat are popular environments for single-cell analysis. The UCell package implements functions to interact directly with these pipelines, as described in dedicated demos available on the Bioc landing page.

Resources

Please report any issues at the UCell GitHub repository.

More demos available on the Bioc landing page and at the UCell demo repository.

If you find UCell useful, you may also check out the scGate package, which relies on UCell scores to automatically purify populations of interest based on gene signatures.

See also SignatuR for easy storing and retrieval of gene signatures.

References

  • Andreatta, M., Carmona, S. J. (2021) UCell: Robust and scalable single-cell gene signature scoring Computational and Structural Biotechnology Journal
  • Zilionis, R., Engblom, C., …, Klein, A. M. (2019) Single-Cell Transcriptomics of Human and Mouse Lung Cancers Reveals Conserved Myeloid Populations across Individuals and Species Immunity

Session Info

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] reshape2_1.4.4              scater_1.35.0              
##  [3] scuttle_1.17.0              patchwork_1.3.0            
##  [5] ggplot2_3.5.1               Seurat_5.1.0               
##  [7] SeuratObject_5.0.2          sp_2.1-4                   
##  [9] UCell_2.11.1                scRNAseq_2.20.0            
## [11] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
## [13] Biobase_2.67.0              GenomicRanges_1.59.1       
## [15] GenomeInfoDb_1.43.2         IRanges_2.41.1             
## [17] S4Vectors_0.45.2            BiocGenerics_0.53.3        
## [19] generics_0.1.3              MatrixGenerics_1.19.0      
## [21] matrixStats_1.4.1           BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##   [1] ProtGenerics_1.39.0      spatstat.sparse_3.1-0    bitops_1.0-9            
##   [4] httr_1.4.7               RColorBrewer_1.1-3       tools_4.4.2             
##   [7] sctransform_0.4.1        alabaster.base_1.7.2     utf8_1.2.4              
##  [10] R6_2.5.1                 HDF5Array_1.35.2         lazyeval_0.2.2          
##  [13] uwot_0.2.2               rhdf5filters_1.19.0      withr_3.0.2             
##  [16] gridExtra_2.3            progressr_0.15.1         cli_3.6.3               
##  [19] spatstat.explore_3.3-3   fastDummies_1.7.4        alabaster.se_1.7.0      
##  [22] labeling_0.4.3           sass_0.4.9               spatstat.data_3.1-4     
##  [25] ggridges_0.5.6           pbapply_1.7-2            Rsamtools_2.23.1        
##  [28] parallelly_1.39.0        RSQLite_2.3.8            BiocIO_1.17.1           
##  [31] ica_1.0-3                spatstat.random_3.3-2    dplyr_1.1.4             
##  [34] Matrix_1.7-1             ggbeeswarm_0.7.2         fansi_1.0.6             
##  [37] abind_1.4-8              lifecycle_1.0.4          yaml_2.3.10             
##  [40] rhdf5_2.51.0             SparseArray_1.7.2        BiocFileCache_2.15.0    
##  [43] Rtsne_0.17               grid_4.4.2               blob_1.2.4              
##  [46] promises_1.3.2           ExperimentHub_2.15.0     crayon_1.5.3            
##  [49] miniUI_0.1.1.1           lattice_0.22-6           beachmat_2.23.2         
##  [52] cowplot_1.1.3            GenomicFeatures_1.59.1   KEGGREST_1.47.0         
##  [55] sys_3.4.3                maketools_1.3.1          pillar_1.9.0            
##  [58] knitr_1.49               rjson_0.2.23             future.apply_1.11.3     
##  [61] codetools_0.2-20         leiden_0.4.3.1           glue_1.8.0              
##  [64] spatstat.univar_3.1-1    data.table_1.16.2        vctrs_0.6.5             
##  [67] png_0.1-8                gypsum_1.3.0             spam_2.11-0             
##  [70] gtable_0.3.6             cachem_1.1.0             xfun_0.49               
##  [73] S4Arrays_1.7.1           mime_0.12                survival_3.7-0          
##  [76] fitdistrplus_1.2-1       ROCR_1.0-11              nlme_3.1-166            
##  [79] bit64_4.5.2              alabaster.ranges_1.7.0   filelock_1.0.3          
##  [82] RcppAnnoy_0.0.22         bslib_0.8.0              irlba_2.3.5.1           
##  [85] vipor_0.4.7              KernSmooth_2.23-24       colorspace_2.1-1        
##  [88] DBI_1.2.3                tidyselect_1.2.1         bit_4.5.0               
##  [91] compiler_4.4.2           curl_6.0.1               httr2_1.0.7             
##  [94] BiocNeighbors_2.1.1      DelayedArray_0.33.2      plotly_4.10.4           
##  [97] rtracklayer_1.67.0       scales_1.3.0             lmtest_0.9-40           
## [100] rappdirs_0.3.3           stringr_1.5.1            digest_0.6.37           
## [103] goftest_1.2-3            spatstat.utils_3.1-1     alabaster.matrix_1.7.3  
## [106] rmarkdown_2.29           XVector_0.47.0           htmltools_0.5.8.1       
## [109] pkgconfig_2.0.3          dbplyr_2.5.0             fastmap_1.2.0           
## [112] ensembldb_2.31.0         rlang_1.1.4              htmlwidgets_1.6.4       
## [115] UCSC.utils_1.3.0         shiny_1.9.1              farver_2.1.2            
## [118] jquerylib_0.1.4          zoo_1.8-12               jsonlite_1.8.9          
## [121] BiocParallel_1.41.0      BiocSingular_1.23.0      RCurl_1.98-1.16         
## [124] magrittr_2.0.3           GenomeInfoDbData_1.2.13  dotCall64_1.2           
## [127] Rhdf5lib_1.29.0          munsell_0.5.1            Rcpp_1.0.13-1           
## [130] viridis_0.6.5            reticulate_1.40.0        stringi_1.8.4           
## [133] alabaster.schemas_1.7.0  zlibbioc_1.52.0          MASS_7.3-61             
## [136] AnnotationHub_3.15.0     plyr_1.8.9               parallel_4.4.2          
## [139] listenv_0.9.1            ggrepel_0.9.6            deldir_2.0-4            
## [142] Biostrings_2.75.1        splines_4.4.2            tensor_1.5              
## [145] igraph_2.1.1             spatstat.geom_3.3-4      RcppHNSW_0.6.0          
## [148] buildtools_1.0.0         ScaledMatrix_1.15.0      BiocVersion_3.21.1      
## [151] XML_3.99-0.17            evaluate_1.0.1           BiocManager_1.30.25     
## [154] httpuv_1.6.15            RANN_2.6.2               tidyr_1.3.1             
## [157] purrr_1.0.2              polyclip_1.10-7          future_1.34.0           
## [160] scattermore_1.2          alabaster.sce_1.7.0      rsvd_1.0.5              
## [163] xtable_1.8-4             restfulr_0.0.15          AnnotationFilter_1.31.0 
## [166] RSpectra_0.16-2          later_1.4.1              viridisLite_0.4.2       
## [169] tibble_3.2.1             beeswarm_0.4.0           memoise_2.0.1           
## [172] AnnotationDbi_1.69.0     GenomicAlignments_1.43.0 cluster_2.1.6           
## [175] globals_0.16.3