Droplet-based single-cell RNA sequencing (scRNA-seq) is a powerful
and widely-used approach for profiling genome-wide gene expression in
individual cells. Current commercial droplet-based technologies such as
10X Genomics utilize gel beads, each containing oligonucleotide indexes
made up of bead-specific barcodes combined with unique molecular
identifiers (UMIs) and oligo-dT tags to prime polyadenylated RNA. Single
cells of interest are combined with reagents in one channel of a
microfluidic chip, and gel beads in another, to form gel-beads in
emulsion, or GEMs. Oligonucleotide indexes bind polyadenylated RNA
within each GEM reaction vesicle before gel beads are dissolved
releasing the bound oligos into solution for reverse transcription. By
design, each resulting cDNA molecule contains a UMI and a GEM-specific
barcode that, ideally, tags mRNA from an individual cell, but this is
often not the case in practice. To distinguish true cells from
background barcodes in droplet-based single cell RNA-seq experiments, we
introduce CB2 and scCB2
, its corresponding
R package.
CB2 extends the EmptyDrops approach by introducing a
clustering step that groups similar barcodes and then conducts a
statistical test to identify groups with expression distributions that
vary from the background. While advantages are expected in many
settings, users will benefit from noting that CB2 does
not test for doublets or multiplets and, consequently, some of the high
count identifications may consist of two or more cells. Methods for
identifying multiplets may prove useful after applying
CB2. It is also important to note that any method for
distinguishing cells from background barcodes is technically correct in
identifying low-quality cells given that damaged cells exhibit
expression profiles that differ from the background. Specifically,
mitochondrial gene expression is often high in damaged cells. Such cells
are typically not of interest in downstream analysis and should
therefore be removed. The GetCellMat function in scCB2
can
be used toward this end.
Install from Bioconductor:
QuickCB2
is an all-in-one function to apply CB2 on 10x
Cell Ranger raw data and get a matrix of real cells identified by CB2
under default settings. By specifying AsSeurat = TRUE
, a
Seurat object is returned so that users can directly apply the Seurat
pipeline for downstream analyses.
Usage:
library(scCB2)
# If raw data has three separate files within one directory
# and you want to control FDR at the default 1%:
RealCell <- QuickCB2(dir = "/path/to/raw/data/directory")
# If raw data is in HDF5 format and
# you'd like a Seurat object under default FDR threshold:
RealCell_S <- QuickCB2(h5file = "/path/to/raw/data/HDF5",
AsSeurat = TRUE)
An example illustrating how it works and what the final output looks like can be found at the end of Detailed Steps.
The computational speed is related to the size and structure of input datasets. As a reference, scCB2 running on a medium-size dataset (around 30,000 genes and 8,000 cells after cell calling) using 6 cores takes less than 10 minutes.
For users who would like to save the CB2 output cell matrix to 10x
format (e.g. “matrix.mtx”, “barcodes.tsv” and “genes.tsv”), there are
existing packages to help. For example in package
DropletUtils
:
Currently, the most widely-used droplet-based protocol is 10x Chromium. Our package provides functions to directly read 10x Cell Ranger output files and generate a feature-by-barcode count matrix that may be read into R. Public 10x datasets can be found here.
Our package contains a small subset of 10x data,
mbrainSub
, corresponding to the first 50,000 barcodes of 1k
Brain Cells from an E18 Mouse.
We first generate 10x output files of mbrainSub
, then
read it using our built-in functions.
library(scCB2)
library(SummarizedExperiment)
data(mbrainSub)
data.dir <- file.path(tempdir(),"CB2_example")
DropletUtils::write10xCounts(data.dir,
mbrainSub,
version = "3")
list.files(data.dir)
## [1] "barcodes.tsv.gz" "features.tsv.gz" "matrix.mtx.gz"
For Cell Ranger version <3, the raw data from 10x Cell Ranger output contains “barcodes.tsv”, “genes.tsv” and “matrix.mtx”. For Cell Ranger version >=3, the output files are “barcodes.tsv.gz”, “features.tsv.gz” and “matrix.mtx.gz”. We now read these files back into R and compare with original data matrix.
## [1] TRUE
If raw data is not from the 10x Chromium pipeline, a user may manually create the feature-by-barcode count matrix with rows representing genes and columns representing barcodes. Gene and barcode IDs should be unique. The format of the count matrix can be either a sparse matrix or standard matrix.
The key parameter of CB2 as well as other similar methods is the background cutoff, which divides barcodes into two groups: (1) small barcodes that are most likely to be background empty droplets; (2) the rest barcodes that can be either background or cell, and remain to be tested. Those small barcodes will be used to estimate a background distribution, which guides the identification of cells from background. It is crucial to have an unbiased estimation of the background distribution.
By default, the background cutoff is set to be 100, meaning all barcodes with total UMI counts less or equal to 100 are used to estimate the background distribution. Empirically, this setting has worked well in most real world datasets. However, for datasets with special structures, or with unexpectedly lower or higher number of detected cells, it is recommended to re-evaluate the choice of background cutoff.
An appropriate background cutoff should be reasonably large to contain enough background information, but shouldn’t be too large to mistakenly put real cells into the background group. Based on empirical knowledge, we recommend a background cutoff which (1) puts more than 90% barcodes into background, or (2) puts more than 10% UMI counts into background. This guarantees us to have enough information for an unbiased estimation of the background cutoff. Starting from 100, the smallest cutoff satisfying either condition is the recommended cutoff.
## cutoff n_bg_bcs n_bg_counts prop_bg_bcs prop_bg_counts
## 1 100 49714 97013 0.99428 0.09659609
## 2 150 49791 106163 0.99582 0.10570677
## 3 200 49822 111577 0.99644 0.11109750
## 4 250 49846 116966 0.99692 0.11646334
## 5 300 49861 121121 0.99722 0.12060049
## 6 350 49870 124047 0.99740 0.12351391
## 7 400 49873 125155 0.99746 0.12461715
## 8 450 49876 126456 0.99752 0.12591256
## 9 500 49879 127901 0.99758 0.12735135
## [1] 100
In this example, the default background cutoff 100 is appropriate as it puts more than 90% barcodes into background as well as more than 10% UMI counts into background. In general, we recommend always checking the background cutoff.
The main function CB2FindCell
takes a raw count matrix
as input and returns real cells, test statistics, and p-values. Now we
apply CB2FindCell
on mbrainSub
, controlling
FDR at 0.01 level (Default), assuming all barcodes with total counts
less than or equal to 100 are background empty droplets (Default), using
2 cores parallel computation (Default: 2). For detailed information, see
?CB2FindCell
.
## Time difference of 49.54036 secs
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## ..@ i : int [1:273509] 62 469 538 663 758 903 1251 1266 1396 1532 ...
## ..@ p : int [1:171] 0 219 2503 4608 6288 6512 6796 7194 7335 7789 ...
## ..@ Dim : int [1:2] 27998 170
## ..@ Dimnames:List of 2
## .. ..$ : chr [1:27998] "Xkr4" "Gm1992" "Gm37381" "Rp1" ...
## .. ..$ : chr [1:170] "AAACCTGGTGCTCTTC-1" "AAACGGGAGCCACGTC-1" "AAACGGGAGCGAGAAA-1" "AAACGGGCAGTTTACG-1" ...
## ..@ x : num [1:273509] 1 1 1 1 1 1 1 1 1 1 ...
## ..@ factors : list()
## List of 4
## $ ClusterStat:'data.frame': 5 obs. of 4 variables:
## ..$ corr : num [1:5] 0.6715 0.6773 0.0509 0.38 0.5547
## ..$ count: num [1:5] 260 315 271 113 4679
## ..$ pval : num [1:5] 0.292707 0.102897 0.000999 0.006993 0.000999
## ..$ padj : num [1:5] 0.2927 0.1286 0.0025 0.0117 0.0025
## $ Cluster :List of 5
## ..$ : chr [1:85] "AACACGTAGGCAAAGA-1" "ACAGCCGCAAACGCGA-1" "AACACGTGTATAGGTA-1" "AACTCAGTCTCGGACG-1" ...
## ..$ : chr [1:41] "AACTTTCAGGCCCGTT-1" "AACTCTTTCGCCGTGA-1" "AACTTTCGTACCGTTA-1" "AACCATGGTTGCCTCT-1" ...
## ..$ : chr [1:3] "AAGGCAGGTCTTGCGG-1" "ACATCAGCACATTAGC-1" "AAGACCTAGACTGGGT-1"
## ..$ : chr [1:6] "AATCCAGAGATCTGAA-1" "AACCATGTCTCGCATC-1" "AACACGTTCTCTAGGA-1" "AACTCCCGTGCTTCTC-1" ...
## ..$ : chr [1:23] "ACCAGTAAGTGCGATG-1" "AAGTCTGAGGACAGCT-1" "ACATACGCAAGCCGTC-1" "AAGGCAGGTCTAGAGG-1" ...
## $ BarcodeStat:'data.frame': 209 obs. of 4 variables:
## ..$ logLH: num [1:209] -762 -770 -886 -454 -9992 ...
## ..$ count: num [1:209] 337 304 312 207 5717 ...
## ..$ pval : num [1:209] 0.9439 0.4237 0.001 1 0.0001 ...
## ..$ padj : num [1:209] 1 0.651781 0.00187 1 0.000202 ...
## $ background : Named num [1:11121] 8 3 2 0 2 2 0 1 3 1 ...
## ..- attr(*, "names")= chr [1:11121] "Mrpl15" "Lypla1" "Tcea1" "Rgs20" ...
If readers are not interested in the output testing information,
GetCellMat
can extract the real cell matrix directly from
CB2FindCell
output. It also provides a filtering option to
remove broken cells based on the proportion of mitochondrial gene
expressions. Now we apply GetCellMat
on CBOut
,
filtering out cells whose mitochondrial proportions are greater than
0.25 (Default: 1, No filtering).
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## ..@ i : int [1:272491] 62 469 538 663 758 903 1251 1266 1396 1532 ...
## ..@ p : int [1:168] 0 219 2503 4608 6288 6512 6796 7194 7335 7789 ...
## ..@ Dim : int [1:2] 27998 167
## ..@ Dimnames:List of 2
## .. ..$ : chr [1:27998] "Xkr4" "Gm1992" "Gm37381" "Rp1" ...
## .. ..$ : chr [1:167] "AAACCTGGTGCTCTTC-1" "AAACGGGAGCCACGTC-1" "AAACGGGAGCGAGAAA-1" "AAACGGGCAGTTTACG-1" ...
## ..@ x : num [1:272491] 1 1 1 1 1 1 1 1 1 1 ...
## ..@ factors : list()
After CB2
pre-processing, the real cell matrix is still
in matrix format, so it can be directly used in downstream statistical
analyses. For example, if we want to use the Seurat pipeline,
we can easily create a Seurat object using
## An object of class Seurat
## 27998 features across 167 samples within 1 assay
## Active assay: RNA (27998 features, 0 variable features)
## 1 layer present: counts
Under default parameters, we can directly use the all-in-one function
QuickCB2
to get the real cell matrix from 10x raw data.
## Time difference of 49.85754 secs
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## ..@ i : int [1:274054] 62 469 538 663 758 903 1251 1266 1396 1532 ...
## ..@ p : int [1:175] 0 219 2503 4608 6288 6512 6796 7194 7335 7789 ...
## ..@ Dim : int [1:2] 27998 174
## ..@ Dimnames:List of 2
## .. ..$ : chr [1:27998] "Xkr4" "Gm1992" "Gm37381" "Rp1" ...
## .. ..$ : chr [1:174] "AAACCTGGTGCTCTTC-1" "AAACGGGAGCCACGTC-1" "AAACGGGAGCGAGAAA-1" "AAACGGGCAGTTTACG-1" ...
## ..@ x : num [1:274054] 1 1 1 1 1 1 1 1 1 1 ...
## ..@ factors : list()
Now it’s ready for downstream analysis such as normalization and clustering. Example Seurat tutorial: https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SummarizedExperiment_1.36.0 Biobase_2.67.0
## [3] GenomicRanges_1.59.0 GenomeInfoDb_1.43.0
## [5] IRanges_2.41.0 S4Vectors_0.44.0
## [7] BiocGenerics_0.53.0 MatrixGenerics_1.19.0
## [9] matrixStats_1.4.1 scCB2_1.17.0
## [11] knitr_1.48 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.22 splines_4.4.1
## [3] later_1.3.2 tibble_3.2.1
## [5] R.oo_1.26.0 polyclip_1.10-7
## [7] fastDummies_1.7.4 lifecycle_1.0.4
## [9] edgeR_4.4.0 doParallel_1.0.17
## [11] globals_0.16.3 lattice_0.22-6
## [13] MASS_7.3-61 magrittr_2.0.3
## [15] limma_3.63.0 plotly_4.10.4
## [17] sass_0.4.9 rmarkdown_2.28
## [19] jquerylib_0.1.4 yaml_2.3.10
## [21] httpuv_1.6.15 Seurat_5.1.0
## [23] sctransform_0.4.1 spam_2.11-0
## [25] sp_2.1-4 spatstat.sparse_3.1-0
## [27] reticulate_1.39.0 cowplot_1.1.3
## [29] pbapply_1.7-2 buildtools_1.0.0
## [31] RColorBrewer_1.1-3 abind_1.4-8
## [33] zlibbioc_1.52.0 Rtsne_0.17
## [35] purrr_1.0.2 R.utils_2.12.3
## [37] GenomeInfoDbData_1.2.13 ggrepel_0.9.6
## [39] irlba_2.3.5.1 listenv_0.9.1
## [41] spatstat.utils_3.1-0 maketools_1.3.1
## [43] goftest_1.2-3 RSpectra_0.16-2
## [45] spatstat.random_3.3-2 dqrng_0.4.1
## [47] fitdistrplus_1.2-1 parallelly_1.38.0
## [49] DelayedMatrixStats_1.29.0 leiden_0.4.3.1
## [51] codetools_0.2-20 DropletUtils_1.27.0
## [53] DelayedArray_0.33.1 scuttle_1.16.0
## [55] tidyselect_1.2.1 UCSC.utils_1.2.0
## [57] farver_2.1.2 spatstat.explore_3.3-3
## [59] jsonlite_1.8.9 progressr_0.15.0
## [61] ggridges_0.5.6 survival_3.7-0
## [63] iterators_1.0.14 foreach_1.5.2
## [65] tools_4.4.1 ica_1.0-3
## [67] Rcpp_1.0.13 glue_1.8.0
## [69] gridExtra_2.3 SparseArray_1.6.0
## [71] xfun_0.48 dplyr_1.1.4
## [73] HDF5Array_1.35.0 BiocManager_1.30.25
## [75] fastmap_1.2.0 rhdf5filters_1.18.0
## [77] fansi_1.0.6 digest_0.6.37
## [79] R6_2.5.1 mime_0.12
## [81] colorspace_2.1-1 scattermore_1.2
## [83] tensor_1.5 spatstat.data_3.1-2
## [85] R.methodsS3_1.8.2 utf8_1.2.4
## [87] tidyr_1.3.1 generics_0.1.3
## [89] data.table_1.16.2 httr_1.4.7
## [91] htmlwidgets_1.6.4 S4Arrays_1.6.0
## [93] uwot_0.2.2 pkgconfig_2.0.3
## [95] gtable_0.3.6 lmtest_0.9-40
## [97] SingleCellExperiment_1.28.0 XVector_0.46.0
## [99] sys_3.4.3 htmltools_0.5.8.1
## [101] dotCall64_1.2 SeuratObject_5.0.2
## [103] scales_1.3.0 png_0.1-8
## [105] spatstat.univar_3.0-1 reshape2_1.4.4
## [107] nlme_3.1-166 cachem_1.1.0
## [109] zoo_1.8-12 rhdf5_2.50.0
## [111] stringr_1.5.1 KernSmooth_2.23-24
## [113] parallel_4.4.1 miniUI_0.1.1.1
## [115] pillar_1.9.0 grid_4.4.1
## [117] vctrs_0.6.5 RANN_2.6.2
## [119] promises_1.3.0 beachmat_2.23.0
## [121] xtable_1.8-4 cluster_2.1.6
## [123] evaluate_1.0.1 cli_3.6.3
## [125] locfit_1.5-9.10 compiler_4.4.1
## [127] rlang_1.1.4 crayon_1.5.3
## [129] future.apply_1.11.3 plyr_1.8.9
## [131] stringi_1.8.4 deldir_2.0-4
## [133] viridisLite_0.4.2 BiocParallel_1.41.0
## [135] munsell_0.5.1 lazyeval_0.2.2
## [137] spatstat.geom_3.3-3 Matrix_1.7-1
## [139] RcppHNSW_0.6.0 patchwork_1.3.0
## [141] sparseMatrixStats_1.18.0 future_1.34.0
## [143] ggplot2_3.5.1 Rhdf5lib_1.28.0
## [145] statmod_1.5.0 shiny_1.9.1
## [147] ROCR_1.0-11 igraph_2.1.1
## [149] bslib_0.8.0
Ni, Z., Chen, S., Brown, J., & Kendziorski, C. (2020). CB2 improves power of cell detection in droplet-based single-cell RNA sequencing data. Genome Biology, 21(1), 137. https://doi.org/10.1186/s13059-020-02054-8