scuttle provides various low-level utilities for single-cell RNA-seq data analysis, typically used at the start of the analysis workflows or within high-level functions in other packages. This vignette will discuss one of the earliest steps, namely, quality control (QC) to remove damaged cells.
To demonstrate, we will obtain the classic Zeisel dataset from the
scRNAseq
package. In this case, the dataset is provided as a
SingleCellExperiment
object. However, most scuttle
functions can also be used with raw expression matrices or instances of
the more general SummarizedExperiment
class.
## class: SingleCellExperiment
## dim: 20006 3005
## metadata(0):
## assays(1): counts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(1): featureType
## colnames(3005): 1772071015_C02 1772071017_G12 ... 1772066098_A12
## 1772058148_F03
## colData names(9): tissue group # ... level1class level2class
## reducedDimNames(0):
## mainExpName: gene
## altExpNames(2): repeat ERCC
Note: A more comprehensive description of the use of scuttle functions (along with other packages) in a scRNA-seq analysis workflow is available at https://osca.bioconductor.org.
The perCellQCMetrics()
function computes a variety of
basic cell-level metrics, including sum
, total number of
counts for the cell (i.e., the library size); and detected
,
the number of features for the cell that have counts above the detection
limit (default of zero). Low values for either metric are indicative of
poor cDNA capture.
library(scuttle)
is.mito <- grep("mt-", rownames(sce))
per.cell <- perCellQCMetrics(sce, subsets=list(Mito=is.mito))
summary(per.cell$sum)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2574 8130 12913 14954 19284 63505
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 785 2484 3656 3777 4929 8167
In addition, we can compute subsets_X_percent
,
percentage of counts that come from the feature control set named
X
; and altexps_Y_percent
, the percentage of
all counts that come from an alternative feature set Y
.
Here, X
contains the mitochondrial genes and Y
contains the set of spike-ins. High proportions for either metric are
indicative of cell damage.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.992 6.653 7.956 10.290 56.955
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.912 14.186 20.699 23.199 29.950 76.881
It is often convenient to store this in the colData()
of
our SingleCellExperiment
object for future reference. One
can either do so manually:
… or just use the addPerCellQCMetrics()
function, which
does this automatically:
## [1] "tissue" "group #"
## [3] "total mRNA mol" "well"
## [5] "sex" "age"
## [7] "diameter" "level1class"
## [9] "level2class" "sum"
## [11] "detected" "subsets_Mito_sum"
## [13] "subsets_Mito_detected" "subsets_Mito_percent"
## [15] "altexps_repeat_sum" "altexps_repeat_detected"
## [17] "altexps_repeat_percent" "altexps_ERCC_sum"
## [19] "altexps_ERCC_detected" "altexps_ERCC_percent"
## [21] "total" "sum"
## [23] "detected" "subsets_Mito_sum"
## [25] "subsets_Mito_detected" "subsets_Mito_percent"
## [27] "altexps_repeat_sum" "altexps_repeat_detected"
## [29] "altexps_repeat_percent" "altexps_ERCC_sum"
## [31] "altexps_ERCC_detected" "altexps_ERCC_percent"
## [33] "total"
We identify low-quality cells by setting a threshold on each of these
metrics using the isOutlier()
function. This defines the
threshold at a certain number of median absolute deviations (MADs) away
from the median; values beyond this threshold are considered outliers
and can be filtered out, assuming that they correspond to low-quality
cells. Here, we define small outliers (using type="lower"
)
for the log-total counts at 3 MADs from the median.
## Mode FALSE
## logical 3005
Note that the attributes of the isOutlier()
output
contain the thresholds, if this is of interest.
## lower higher
## 1928.56 Inf
Advanced users can set batch=
to compute outliers within
each batch, avoiding inflated MADs due to batch effects.
low.total.batched <- isOutlier(per.cell$sum, type="lower", log=TRUE, batch=sce$tissue)
summary(low.total.batched)
## Mode FALSE TRUE
## logical 2996 9
In cases where entire batches contain a majority of low-quality
cells, we can set subset=
to only use high-quality batches
for computing the thresholds. Those thresholds are then extrapolated
back to the low-quality batches for some more stringent QC.
We could manually apply isOutlier()
to all of our
metrics, but it is easier to use perCellQCFilters()
to do
this for us. This identifies low-quality cells as those that are low
outliers for the log-total count, low outliers for the log-number of
detected features, or high outliers for the percentage of counts in
specified gene sets (e.g., mitochondrial genes, spike-in transcripts).
The discard
column contains the final call for whether a
particular cell should be discarded.
# An example with just mitochondrial filters.
qc.stats <- perCellQCFilters(per.cell, sub.fields="subsets_Mito_percent")
colSums(as.matrix(qc.stats))
## low_lib_size low_n_features high_subsets_Mito_percent
## 0 3 128
## discard
## 131
# Another example with mitochondrial + spike-in filters.
qc.stats2 <- perCellQCFilters(per.cell,
sub.fields=c("subsets_Mito_percent", "altexps_ERCC_percent"))
colSums(as.matrix(qc.stats2))
## low_lib_size low_n_features high_subsets_Mito_percent
## 0 3 128
## high_altexps_ERCC_percent discard
## 65 189
For typical scRNA-seq applications, quickPerCellQC()
will wrap the perCellQCMetrics()
and
perCellQCFilters()
calls. This returns a filtered
SingleCellExperiment
that can be immediately used in
downstream analyses.
filtered <- quickPerCellQC(sce, subsets=list(Mito=is.mito), sub.fields="subsets_Mito_percent")
filtered
## class: SingleCellExperiment
## dim: 20006 2874
## metadata(0):
## assays(1): counts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(1): featureType
## colnames(2874): 1772071015_C02 1772071017_G12 ... 1772063068_D01
## 1772066098_A12
## colData names(37): tissue group # ... high_subsets_Mito_percent discard
## reducedDimNames(0):
## mainExpName: gene
## altExpNames(2): repeat ERCC
The outlier-based approach adjusts to experiment-specific aspects of the data, e.g., sequencing depth, amount of spike-in RNA added, differences in total RNA content between cell types. In contrast, a fixed threshold would require manual adjustment to account for changes to the experimental protocol or system. We refer readers to the book for more details.
Some basic feature-level statistics are computed by the
perFeatureQCMetrics()
function. This includes
mean
, the mean count of the gene/feature across all cells;
detected
, the percentage of cells with non-zero counts for
each gene; subsets_Y_ratio
, ratio of mean counts between
the cell control set named Y and all cells.
# Pretending that the first 10 cells are empty wells, for demonstration.
per.feat <- perFeatureQCMetrics(sce, subsets=list(Empty=1:10))
summary(per.feat$mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0007 0.0097 0.1338 0.7475 0.5763 732.1524
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03328 0.76539 9.01830 18.87800 31.24792 99.96672
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.601 1.872 2.016 300.500
A more refined calculation of the average is provided by the
calculateAverage()
function, which adjusts the counts by
the relative library size (or size factor) prior to taking the mean.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0002 0.0109 0.1443 0.7475 0.5674 850.6880
These metrics tend to be more useful for informing the analyst about the overall behavior of the experiment, rather than being explicitly used to filter out genes. For example, one would hope that the most abundant genes are the “usual suspects”, e.g., ribosomal proteins, actin, histones.
Users may wish to filter out very lowly or non-expressed genes to reduce the size of the dataset prior to downstream processing. However, be sure to publish the unfiltered count matrix! It is very difficult to re-use a count matrix where the features are filtered; information from the filtered features cannot be recovered, complicating integration with other datasets.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] scran_1.35.0 ensembldb_2.31.0
## [3] AnnotationFilter_1.31.0 GenomicFeatures_1.59.1
## [5] AnnotationDbi_1.69.0 scuttle_1.17.0
## [7] scRNAseq_2.20.0 SingleCellExperiment_1.29.1
## [9] SummarizedExperiment_1.37.0 Biobase_2.67.0
## [11] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
## [13] IRanges_2.41.1 S4Vectors_0.45.2
## [15] BiocGenerics_0.53.3 generics_0.1.3
## [17] MatrixGenerics_1.19.0 matrixStats_1.4.1
## [19] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] sys_3.4.3 jsonlite_1.8.9 magrittr_2.0.3
## [4] gypsum_1.3.0 rmarkdown_2.29 BiocIO_1.17.1
## [7] zlibbioc_1.52.0 vctrs_0.6.5 memoise_2.0.1
## [10] Rsamtools_2.23.1 RCurl_1.98-1.16 htmltools_0.5.8.1
## [13] S4Arrays_1.7.1 AnnotationHub_3.15.0 curl_6.0.1
## [16] BiocNeighbors_2.1.1 Rhdf5lib_1.29.0 SparseArray_1.7.2
## [19] rhdf5_2.51.0 sass_0.4.9 alabaster.base_1.7.2
## [22] bslib_0.8.0 alabaster.sce_1.7.0 httr2_1.0.7
## [25] cachem_1.1.0 buildtools_1.0.0 GenomicAlignments_1.43.0
## [28] igraph_2.1.1 mime_0.12 lifecycle_1.0.4
## [31] pkgconfig_2.0.3 rsvd_1.0.5 Matrix_1.7-1
## [34] R6_2.5.1 fastmap_1.2.0 GenomeInfoDbData_1.2.13
## [37] digest_0.6.37 dqrng_0.4.1 irlba_2.3.5.1
## [40] ExperimentHub_2.15.0 RSQLite_2.3.8 beachmat_2.23.2
## [43] filelock_1.0.3 fansi_1.0.6 httr_1.4.7
## [46] abind_1.4-8 compiler_4.4.2 bit64_4.5.2
## [49] withr_3.0.2 BiocParallel_1.41.0 DBI_1.2.3
## [52] HDF5Array_1.35.2 alabaster.ranges_1.7.0 alabaster.schemas_1.7.0
## [55] rappdirs_0.3.3 DelayedArray_0.33.2 rjson_0.2.23
## [58] bluster_1.17.0 tools_4.4.2 glue_1.8.0
## [61] restfulr_0.0.15 rhdf5filters_1.19.0 grid_4.4.2
## [64] cluster_2.1.6 metapod_1.15.0 BiocSingular_1.23.0
## [67] ScaledMatrix_1.15.0 utf8_1.2.4 XVector_0.47.0
## [70] BiocVersion_3.21.1 pillar_1.9.0 limma_3.63.2
## [73] dplyr_1.1.4 BiocFileCache_2.15.0 lattice_0.22-6
## [76] rtracklayer_1.67.0 bit_4.5.0 tidyselect_1.2.1
## [79] locfit_1.5-9.10 maketools_1.3.1 Biostrings_2.75.1
## [82] knitr_1.49 ProtGenerics_1.39.0 edgeR_4.5.0
## [85] xfun_0.49 statmod_1.5.0 UCSC.utils_1.3.0
## [88] lazyeval_0.2.2 yaml_2.3.10 evaluate_1.0.1
## [91] codetools_0.2-20 tibble_3.2.1 alabaster.matrix_1.7.2
## [94] BiocManager_1.30.25 cli_3.6.3 jquerylib_0.1.4
## [97] Rcpp_1.0.13-1 dbplyr_2.5.0 png_0.1-8
## [100] XML_3.99-0.17 parallel_4.4.2 blob_1.2.4
## [103] sparseMatrixStats_1.19.0 bitops_1.0-9 alabaster.se_1.7.0
## [106] purrr_1.0.2 crayon_1.5.3 rlang_1.1.4
## [109] KEGGREST_1.47.0