scuttle provides various low-level utilities for single-cell RNA-seq data analysis, typically used at the start of the analysis workflows or within high-level functions in other packages. This vignette will discuss the use of scaling normalization for removing cell-specific biases. To demonstrate, we will obtain the classic Zeisel dataset from the scRNAseq package and apply some quick quality control to remove damaged cells.
library(scRNAseq)
sce <- ZeiselBrainData()
library(scuttle)
sce <- quickPerCellQC(sce, subsets=list(Mito=grep("mt-", rownames(sce))),
sub.fields=c("subsets_Mito_percent", "altexps_ERCC_percent"))
sce
## class: SingleCellExperiment
## dim: 20006 2816
## metadata(0):
## assays(1): counts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(1): featureType
## colnames(2816): 1772071015_C02 1772071017_G12 ... 1772063068_D01
## 1772066098_A12
## colData names(26): tissue group # ... high_altexps_ERCC_percent discard
## reducedDimNames(0):
## mainExpName: gene
## altExpNames(2): repeat ERCC
Note: A more comprehensive description of the use of scuttle functions (along with other packages) in a scRNA-seq analysis workflow is available at https://osca.bioconductor.org.
Scaling normalization involves dividing the counts for each cell by a
cell-specific “size factor” to adjust for uninteresting differences in
sequencing depth and capture efficiency. The
librarySizeFactors()
function provides a simple definition
of the size factor for each cell, computed as the library size of each
cell after scaling them to have a mean of 1 across all cells. This is
fast but inaccurate in the presence of differential expression between
cells that introduce composition biases.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1757 0.5680 0.8680 1.0000 1.2783 4.0839
The geometricSizeFactors()
function instead computes the
geometric mean within each cell. This is more robust to composition
biases but is only accurate when the counts are large and there are few
zeroes.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.8261 0.9104 0.9758 1.0000 1.0640 1.5189
The medianSizeFactors()
function uses a DESeq2-esque
approach based on the median ratio from an average pseudo-cell. Briefly,
we assume that most genes are non-DE, such that any systematic fold
difference in coverage (as defined by the median ratio) represents
technical biases that must be removed. This is highly robust to
composition biases but relies on sufficient sequencing coverage to
obtain well-defined ratios.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 2816
All of these size factors can be stored in the
SingleCellExperiment
via the
sizeFactors<-()
setter function. Most downstream
functions will pick these up automatically for any calculations that
rely on size factors.
Alternatively, functions like computeLibraryFactors()
can automatically compute and attach the size factors to our
SingleCellExperiment
object.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1757 0.5680 0.8680 1.0000 1.2783 4.0839
The computePooledFactors
method implements the pooling strategy for
scaling normalization. This uses an approach similar to
medianSizeFactors()
to remove composition biases, but pools
cells together to overcome problems with discreteness at low counts.
Per-cell factors are then obtained from the pools using a deconvolution
strategy.
library(scran)
clusters <- quickCluster(sce)
sce <- computePooledFactors(sce, clusters=clusters)
summary(sizeFactors(sce))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1340 0.4910 0.8281 1.0000 1.3181 4.4579
For larger data sets, a rough clustering should be performed prior to
normalization. computePooledFactors()
will then
automatically apply normalization within each cluster first, before
adjusting the scaling factors to be comparable across clusters. This
reduces the risk of violating our assumptions (of a non-DE majority of
genes) when many genes are DE between clusters in a heterogeneous
population. In this case, we use the quickCluster()
function from the scran
package, but any clustering function can be used, for example:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1322 0.4797 0.8265 1.0000 1.3355 4.4953
We assume that quality control on the cells has already been performed prior to running this function. Low-quality cells with few expressed genes can often have negative size factor estimates.
An alternative approach is to normalize based on the spike-in counts. The idea is that the same quantity of spike-in RNA was added to each cell prior to library preparation. Size factors are computed to scale the counts such that the total coverage of the spike-in transcripts is equal across cells.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.7746 0.9971 1.0000 1.1851 3.4923
The main practical difference from the other strategies is that
spike-in normalization preserves differences in total RNA content
between cells, whereas computePooledFactors
and other
non-DE methods do not. This can be important in certain applications
where changes in total RNA content are associated with a biological
phenomenon of interest.
Regardless of which size factor calculation we pick, the calculation
of normalized expression values simply involves dividing each count by
the size factor for the cell. This eliminates the cell-specific scaling
effect for valid comparisons between cells in downstream analyses. The
simplest approach to computing these values is to use the
logNormCounts()
function:
## [1] "counts" "logcounts"
This computes log2-transformed normalized expression
values by adding a constant pseudo-count and log-transforming. The
resulting values can be roughly interpreted on the same scale as
log-transformed counts and are stored in "logcounts"
. This
is the most common expression measure for downstream analyses as
differences between values can be treated as log-fold changes. For
example, Euclidean distances between cells are analogous to the average
log-fold change across genes.
Of course, users can construct any arbitrary matrix of the same
dimensions as the count matrix and store it as an assay. Here, we use
the normalizeCounts()
function to perform some custom
normalization with random size factors.
assay(sce, "normed") <- normalizeCounts(sce, log=FALSE,
size.factors=runif(ncol(sce)), pseudo.count=1.5)
scuttle
can also calculate counts-per-million using the aptly-named
calculateCPM()
function. The output is most appropriately
stored as an assay named "cpm"
in the assays of the
SingleCellExperiment
object. Related functions include
calculateTPM()
and calculateFPKM()
, which do
pretty much as advertised.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] scran_1.34.0 ensembldb_2.31.0
## [3] AnnotationFilter_1.31.0 GenomicFeatures_1.59.0
## [5] AnnotationDbi_1.69.0 scuttle_1.17.0
## [7] scRNAseq_2.19.1 SingleCellExperiment_1.28.0
## [9] SummarizedExperiment_1.36.0 Biobase_2.67.0
## [11] GenomicRanges_1.59.0 GenomeInfoDb_1.43.0
## [13] IRanges_2.41.0 S4Vectors_0.44.0
## [15] BiocGenerics_0.53.0 MatrixGenerics_1.19.0
## [17] matrixStats_1.4.1 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] sys_3.4.3 jsonlite_1.8.9 magrittr_2.0.3
## [4] gypsum_1.3.0 rmarkdown_2.28 BiocIO_1.17.0
## [7] zlibbioc_1.52.0 vctrs_0.6.5 memoise_2.0.1
## [10] Rsamtools_2.22.0 RCurl_1.98-1.16 htmltools_0.5.8.1
## [13] S4Arrays_1.6.0 AnnotationHub_3.15.0 curl_5.2.3
## [16] BiocNeighbors_2.1.0 Rhdf5lib_1.28.0 SparseArray_1.6.0
## [19] rhdf5_2.50.0 sass_0.4.9 alabaster.base_1.7.0
## [22] bslib_0.8.0 alabaster.sce_1.7.0 httr2_1.0.5
## [25] cachem_1.1.0 buildtools_1.0.0 GenomicAlignments_1.43.0
## [28] igraph_2.1.1 mime_0.12 lifecycle_1.0.4
## [31] pkgconfig_2.0.3 rsvd_1.0.5 Matrix_1.7-1
## [34] R6_2.5.1 fastmap_1.2.0 GenomeInfoDbData_1.2.13
## [37] digest_0.6.37 dqrng_0.4.1 irlba_2.3.5.1
## [40] ExperimentHub_2.15.0 RSQLite_2.3.7 beachmat_2.23.0
## [43] filelock_1.0.3 fansi_1.0.6 httr_1.4.7
## [46] abind_1.4-8 compiler_4.4.1 bit64_4.5.2
## [49] withr_3.0.2 BiocParallel_1.41.0 DBI_1.2.3
## [52] HDF5Array_1.35.0 alabaster.ranges_1.7.0 alabaster.schemas_1.7.0
## [55] rappdirs_0.3.3 DelayedArray_0.33.1 rjson_0.2.23
## [58] bluster_1.17.0 tools_4.4.1 glue_1.8.0
## [61] restfulr_0.0.15 rhdf5filters_1.18.0 grid_4.4.1
## [64] cluster_2.1.6 generics_0.1.3 metapod_1.14.0
## [67] BiocSingular_1.23.0 ScaledMatrix_1.14.0 utf8_1.2.4
## [70] XVector_0.46.0 BiocVersion_3.21.1 pillar_1.9.0
## [73] limma_3.63.0 dplyr_1.1.4 BiocFileCache_2.15.0
## [76] lattice_0.22-6 rtracklayer_1.66.0 bit_4.5.0
## [79] tidyselect_1.2.1 locfit_1.5-9.10 maketools_1.3.1
## [82] Biostrings_2.75.0 knitr_1.48 ProtGenerics_1.38.0
## [85] edgeR_4.4.0 xfun_0.48 statmod_1.5.0
## [88] UCSC.utils_1.2.0 lazyeval_0.2.2 yaml_2.3.10
## [91] evaluate_1.0.1 codetools_0.2-20 tibble_3.2.1
## [94] alabaster.matrix_1.7.0 BiocManager_1.30.25 cli_3.6.3
## [97] jquerylib_0.1.4 Rcpp_1.0.13 dbplyr_2.5.0
## [100] png_0.1-8 XML_3.99-0.17 parallel_4.4.1
## [103] blob_1.2.4 sparseMatrixStats_1.18.0 bitops_1.0-9
## [106] alabaster.se_1.7.0 purrr_1.0.2 crayon_1.5.3
## [109] rlang_1.1.4 KEGGREST_1.47.0