Normalizing single-cell RNA-seq data

Introduction

scuttle provides various low-level utilities for single-cell RNA-seq data analysis, typically used at the start of the analysis workflows or within high-level functions in other packages. This vignette will discuss the use of scaling normalization for removing cell-specific biases. To demonstrate, we will obtain the classic Zeisel dataset from the scRNAseq package and apply some quick quality control to remove damaged cells.

library(scRNAseq)
sce <- ZeiselBrainData()

library(scuttle)
sce <- quickPerCellQC(sce, subsets=list(Mito=grep("mt-", rownames(sce))),
    sub.fields=c("subsets_Mito_percent", "altexps_ERCC_percent")) 

sce
## class: SingleCellExperiment 
## dim: 20006 2816 
## metadata(0):
## assays(1): counts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(1): featureType
## colnames(2816): 1772071015_C02 1772071017_G12 ... 1772063068_D01
##   1772066098_A12
## colData names(26): tissue group # ... high_altexps_ERCC_percent discard
## reducedDimNames(0):
## mainExpName: gene
## altExpNames(2): repeat ERCC

Note: A more comprehensive description of the use of scuttle functions (along with other packages) in a scRNA-seq analysis workflow is available at https://osca.bioconductor.org.

Computing size factors

Scaling normalization involves dividing the counts for each cell by a cell-specific “size factor” to adjust for uninteresting differences in sequencing depth and capture efficiency. The librarySizeFactors() function provides a simple definition of the size factor for each cell, computed as the library size of each cell after scaling them to have a mean of 1 across all cells. This is fast but inaccurate in the presence of differential expression between cells that introduce composition biases.

summary(librarySizeFactors(sce))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1757  0.5680  0.8680  1.0000  1.2783  4.0839

The geometricSizeFactors() function instead computes the geometric mean within each cell. This is more robust to composition biases but is only accurate when the counts are large and there are few zeroes.

summary(geometricSizeFactors(sce))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.8261  0.9104  0.9758  1.0000  1.0640  1.5189

The medianSizeFactors() function uses a DESeq2-esque approach based on the median ratio from an average pseudo-cell. Briefly, we assume that most genes are non-DE, such that any systematic fold difference in coverage (as defined by the median ratio) represents technical biases that must be removed. This is highly robust to composition biases but relies on sufficient sequencing coverage to obtain well-defined ratios.

summary(medianSizeFactors(sce))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA    2816

All of these size factors can be stored in the SingleCellExperiment via the sizeFactors<-() setter function. Most downstream functions will pick these up automatically for any calculations that rely on size factors.

sizeFactors(sce) <- librarySizeFactors(sce)

Alternatively, functions like computeLibraryFactors() can automatically compute and attach the size factors to our SingleCellExperiment object.

sce <- computeLibraryFactors(sce)
summary(sizeFactors(sce))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1757  0.5680  0.8680  1.0000  1.2783  4.0839

Pooling normalization

The computePooledFactors method implements the pooling strategy for scaling normalization. This uses an approach similar to medianSizeFactors() to remove composition biases, but pools cells together to overcome problems with discreteness at low counts. Per-cell factors are then obtained from the pools using a deconvolution strategy.

library(scran)
clusters <- quickCluster(sce)

sce <- computePooledFactors(sce, clusters=clusters)
summary(sizeFactors(sce))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1340  0.4910  0.8281  1.0000  1.3181  4.4579

For larger data sets, a rough clustering should be performed prior to normalization. computePooledFactors() will then automatically apply normalization within each cluster first, before adjusting the scaling factors to be comparable across clusters. This reduces the risk of violating our assumptions (of a non-DE majority of genes) when many genes are DE between clusters in a heterogeneous population. In this case, we use the quickCluster() function from the scran package, but any clustering function can be used, for example:

sce <- computePooledFactors(sce, clusters=sce$level1class)
summary(sizeFactors(sce))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1322  0.4797  0.8265  1.0000  1.3355  4.4953

We assume that quality control on the cells has already been performed prior to running this function. Low-quality cells with few expressed genes can often have negative size factor estimates.

Spike-in normalization

An alternative approach is to normalize based on the spike-in counts. The idea is that the same quantity of spike-in RNA was added to each cell prior to library preparation. Size factors are computed to scale the counts such that the total coverage of the spike-in transcripts is equal across cells.

sce2 <- computeSpikeFactors(sce, "ERCC")
summary(sizeFactors(sce2))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.7746  0.9971  1.0000  1.1851  3.4923

The main practical difference from the other strategies is that spike-in normalization preserves differences in total RNA content between cells, whereas computePooledFactors and other non-DE methods do not. This can be important in certain applications where changes in total RNA content are associated with a biological phenomenon of interest.

Computing normalized expression matrices

Regardless of which size factor calculation we pick, the calculation of normalized expression values simply involves dividing each count by the size factor for the cell. This eliminates the cell-specific scaling effect for valid comparisons between cells in downstream analyses. The simplest approach to computing these values is to use the logNormCounts() function:

sce <- logNormCounts(sce)
assayNames(sce)
## [1] "counts"    "logcounts"

This computes log2-transformed normalized expression values by adding a constant pseudo-count and log-transforming. The resulting values can be roughly interpreted on the same scale as log-transformed counts and are stored in "logcounts". This is the most common expression measure for downstream analyses as differences between values can be treated as log-fold changes. For example, Euclidean distances between cells are analogous to the average log-fold change across genes.

Of course, users can construct any arbitrary matrix of the same dimensions as the count matrix and store it as an assay. Here, we use the normalizeCounts() function to perform some custom normalization with random size factors.

assay(sce, "normed") <- normalizeCounts(sce, log=FALSE,
    size.factors=runif(ncol(sce)), pseudo.count=1.5)

scuttle can also calculate counts-per-million using the aptly-named calculateCPM() function. The output is most appropriately stored as an assay named "cpm" in the assays of the SingleCellExperiment object. Related functions include calculateTPM() and calculateFPKM(), which do pretty much as advertised.

assay(sce, "cpm") <- calculateCPM(sce)

Session information

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] scran_1.34.0                ensembldb_2.31.0           
##  [3] AnnotationFilter_1.31.0     GenomicFeatures_1.59.0     
##  [5] AnnotationDbi_1.69.0        scuttle_1.17.0             
##  [7] scRNAseq_2.19.1             SingleCellExperiment_1.28.0
##  [9] SummarizedExperiment_1.36.0 Biobase_2.67.0             
## [11] GenomicRanges_1.59.0        GenomeInfoDb_1.43.0        
## [13] IRanges_2.41.0              S4Vectors_0.44.0           
## [15] BiocGenerics_0.53.0         MatrixGenerics_1.19.0      
## [17] matrixStats_1.4.1           BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##   [1] sys_3.4.3                jsonlite_1.8.9           magrittr_2.0.3          
##   [4] gypsum_1.3.0             rmarkdown_2.28           BiocIO_1.17.0           
##   [7] zlibbioc_1.52.0          vctrs_0.6.5              memoise_2.0.1           
##  [10] Rsamtools_2.22.0         RCurl_1.98-1.16          htmltools_0.5.8.1       
##  [13] S4Arrays_1.6.0           AnnotationHub_3.15.0     curl_5.2.3              
##  [16] BiocNeighbors_2.1.0      Rhdf5lib_1.28.0          SparseArray_1.6.0       
##  [19] rhdf5_2.50.0             sass_0.4.9               alabaster.base_1.7.0    
##  [22] bslib_0.8.0              alabaster.sce_1.7.0      httr2_1.0.5             
##  [25] cachem_1.1.0             buildtools_1.0.0         GenomicAlignments_1.43.0
##  [28] igraph_2.1.1             mime_0.12                lifecycle_1.0.4         
##  [31] pkgconfig_2.0.3          rsvd_1.0.5               Matrix_1.7-1            
##  [34] R6_2.5.1                 fastmap_1.2.0            GenomeInfoDbData_1.2.13 
##  [37] digest_0.6.37            dqrng_0.4.1              irlba_2.3.5.1           
##  [40] ExperimentHub_2.15.0     RSQLite_2.3.7            beachmat_2.23.0         
##  [43] filelock_1.0.3           fansi_1.0.6              httr_1.4.7              
##  [46] abind_1.4-8              compiler_4.4.1           bit64_4.5.2             
##  [49] withr_3.0.2              BiocParallel_1.41.0      DBI_1.2.3               
##  [52] HDF5Array_1.35.0         alabaster.ranges_1.7.0   alabaster.schemas_1.7.0 
##  [55] rappdirs_0.3.3           DelayedArray_0.33.1      rjson_0.2.23            
##  [58] bluster_1.17.0           tools_4.4.1              glue_1.8.0              
##  [61] restfulr_0.0.15          rhdf5filters_1.18.0      grid_4.4.1              
##  [64] cluster_2.1.6            generics_0.1.3           metapod_1.14.0          
##  [67] BiocSingular_1.23.0      ScaledMatrix_1.14.0      utf8_1.2.4              
##  [70] XVector_0.46.0           BiocVersion_3.21.1       pillar_1.9.0            
##  [73] limma_3.63.0             dplyr_1.1.4              BiocFileCache_2.15.0    
##  [76] lattice_0.22-6           rtracklayer_1.66.0       bit_4.5.0               
##  [79] tidyselect_1.2.1         locfit_1.5-9.10          maketools_1.3.1         
##  [82] Biostrings_2.75.0        knitr_1.48               ProtGenerics_1.38.0     
##  [85] edgeR_4.4.0              xfun_0.48                statmod_1.5.0           
##  [88] UCSC.utils_1.2.0         lazyeval_0.2.2           yaml_2.3.10             
##  [91] evaluate_1.0.1           codetools_0.2-20         tibble_3.2.1            
##  [94] alabaster.matrix_1.7.0   BiocManager_1.30.25      cli_3.6.3               
##  [97] jquerylib_0.1.4          Rcpp_1.0.13              dbplyr_2.5.0            
## [100] png_0.1-8                XML_3.99-0.17            parallel_4.4.1          
## [103] blob_1.2.4               sparseMatrixStats_1.18.0 bitops_1.0-9            
## [106] alabaster.se_1.7.0       purrr_1.0.2              crayon_1.5.3            
## [109] rlang_1.1.4              KEGGREST_1.47.0