scater
provides tools for visualization of single-cell transcriptomic data. It
is based on the SingleCellExperiment
class (from the SingleCellExperiment
package), and thus is interoperable with many other Bioconductor
packages such as scran,
scuttle
and iSEE. To
demonstrate the use of the various scater
functions, we will load in the classic Zeisel dataset:
## class: SingleCellExperiment
## dim: 20006 3005
## metadata(0):
## assays(1): counts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(1): featureType
## colnames(3005): 1772071015_C02 1772071017_G12 ... 1772066098_A12
## 1772058148_F03
## colData names(9): tissue group # ... level1class level2class
## reducedDimNames(0):
## mainExpName: gene
## altExpNames(2): repeat ERCC
Note: A more comprehensive description of the use of scater (along with other packages) in a scRNA-seq analysis workflow is available at https://osca.bioconductor.org.
Quality control to remove damaged cells and poorly sequenced libraries is a common step in single-cell analysis pipelines. We will use some utilities from the scuttle package (conveniently loaded for us when we load scater) to compute the usual quality control metrics for this dataset.
library(scater)
example_sce <- addPerCellQC(example_sce,
subsets=list(Mito=grep("mt-", rownames(example_sce))))
Metadata variables can be plotted against each other using the
plotColData()
function, as shown below. We expect to see an
increasing number of detected genes with increasing total count. Each
point represents a cell that is coloured according to its tissue of
origin.
Here, we have plotted the total count for each cell against the mitochondrial content. Well-behaved cells should have a large number of expressed features and and a low percentage of expression from feature controls. High percentage expression from feature controls and few expressed features are indicative of blank and failed cells. For some variety, we have faceted by the tissue of origin.
plotColData(example_sce, x = "sum", y="subsets_Mito_percent",
other_fields="tissue") + facet_wrap(~tissue)
On the gene level, we can look at a plot that shows the top 50 (by default) most-expressed features. Each row in the plot below corresponds to a gene; each bar corresponds to the expression of a gene in a single cell; and the circle indicates the median expression of each gene, with which genes are sorted. We expect to see the “usual suspects”, i.e., mitochondrial genes, actin, ribosomal protein, MALAT1. A few spike-in transcripts may also be present here, though if all of the spike-ins are in the top 50, it suggests that too much spike-in RNA was added. A large number of pseudo-genes or predicted genes may indicate problems with alignment.
Variable-level metrics are computed by the
getVarianceExplained()
function (after normalization, see
below). This calculates the percentage of variance of each gene’s
expression that is explained by each variable in the
colData
of the SingleCellExperiment
object. We
can then use this to determine which experimental factors are
contributing most to the variance in expression. This is useful for
diagnosing batch effects or to quickly verify that a treatment has an
effect.
# Computing variance explained on the log-counts,
# so that the statistics reflect changes in relative expression.
example_sce <- logNormCounts(example_sce)
vars <- getVarianceExplained(example_sce,
variables=c("tissue", "total mRNA mol", "sex", "age"))
head(vars)
## tissue total mRNA mol sex age
## Tspan12 0.02207262 0.003532711 0.146344996 0.11264880
## Tshz1 3.36083014 0.003641463 0.001079356 0.32263515
## Fnbp1l 0.43597185 2.617958068 0.003071630 0.46387469
## Adamts15 0.54233888 0.004958976 0.030821621 0.02045916
## Cldn12 0.03506751 1.389789185 0.008341408 0.02098451
## Rxfp1 0.18559637 0.839362667 0.055646799 0.02012965
The plotExpression()
function makes it easy to plot
expression values for a subset of genes or features. This can be
particularly useful for further examination of features identified from
differential expression testing, pseudotime analysis or other analyses.
By default, it uses expression values in the "logcounts"
assay, but this can be changed through the exprs_values
argument.
Setting x
will determine the covariate to be shown on
the x-axis. This can be a field in the column metadata or the name of a
feature (to obtain the expression profile across cells). Categorical
covariates will yield grouped violins as shown above, with one panel per
feature. By comparison, continuous covariates will generate a scatter
plot in each panel, as shown below.
The points can also be coloured, shaped or resized by the column metadata or expression values.
Directly plotting the gene expression without any x
or
other visual parameters will generate a set of grouped violin plots,
coloured in an aesthetically pleasing manner.
Principal components analysis (PCA) is often performed to denoise and
compact the data prior to downstream analyses. The runPCA()
function provides a simple wrapper around the base machinery in BiocSingular
for computing PCs from log-transformed expression values. This stores
the output in the reducedDims
slot of the
SingleCellExperiment
, which can be easily retrieved (along
with the percentage of variance explained by each PC) as shown
below:
## num [1:3005, 1:50] -15.4 -15 -17.2 -16.9 -18.4 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:3005] "1772071015_C02" "1772071017_G12" "1772071017_A05" "1772071014_B06" ...
## ..$ : chr [1:50] "PC1" "PC2" "PC3" "PC4" ...
## - attr(*, "varExplained")= num [1:50] 478 112.8 51.1 47 33.2 ...
## - attr(*, "percentVar")= num [1:50] 39.72 9.38 4.25 3.9 2.76 ...
## - attr(*, "rotation")= num [1:500, 1:50] 0.1471 0.1146 0.1084 0.0958 0.0953 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:500] "Plp1" "Trf" "Mal" "Apod" ...
## .. ..$ : chr [1:50] "PC1" "PC2" "PC3" "PC4" ...
By default, runPCA()
uses the top 500 genes with the
highest variances to compute the first PCs. This can be tuned by
specifying subset_row
to pass in an explicit set of genes
of interest, and by using ncomponents
to determine the
number of components to compute. The name
argument can also
be used to change the name of the result in the reducedDims
slot.
example_sce <- runPCA(example_sce, name="PCA2",
subset_row=rownames(example_sce)[1:1000],
ncomponents=25)
str(reducedDim(example_sce, "PCA2"))
## num [1:3005, 1:25] -20 -21 -23 -23.7 -21.5 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:3005] "1772071015_C02" "1772071017_G12" "1772071017_A05" "1772071014_B06" ...
## ..$ : chr [1:25] "PC1" "PC2" "PC3" "PC4" ...
## - attr(*, "varExplained")= num [1:25] 153 35 23.5 11.6 10.8 ...
## - attr(*, "percentVar")= num [1:25] 22.3 5.11 3.42 1.69 1.58 ...
## - attr(*, "rotation")= num [1:1000, 1:25] -2.24e-04 -8.52e-05 -2.43e-02 -5.92e-04 -6.35e-03 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:1000] "Tspan12" "Tshz1" "Fnbp1l" "Adamts15" ...
## .. ..$ : chr [1:25] "PC1" "PC2" "PC3" "PC4" ...
t-distributed stochastic
neighbour embedding (t-SNE) is
widely used for visualizing complex single-cell data sets. The same
procedure described for PCA plots can be applied to generate t-SNE plots using
plotTSNE
, with coordinates obtained using
runTSNE
via the Rtsne package.
We strongly recommend generating plots with different random seeds and
perplexity values, to ensure that any conclusions are robus t to
different visualizations.
# Perplexity of 10 just chosen here arbitrarily.
set.seed(1000)
example_sce <- runTSNE(example_sce, perplexity=10)
head(reducedDim(example_sce, "TSNE"))
## TSNE1 TSNE2
## 1772071015_C02 -50.37385 -11.891827
## 1772071017_G12 -53.30176 -10.136091
## 1772071017_A05 -49.98892 -11.869588
## 1772071014_B06 -52.88776 -9.943132
## 1772067065_H06 -52.44460 -9.566839
## 1772071017_E02 -53.42319 -9.636750
A more common pattern involves using the pre-existing PCA results as input into the t-SNE algorithm. This is useful as it improves speed by using a low-rank approximation of the expression matrix; and reduces random noise, by focusing on the major factors of variation. The code below uses the first 10 dimensions of the previously computed PCA result to perform the t-SNE.
set.seed(1000)
example_sce <- runTSNE(example_sce, perplexity=50,
dimred="PCA", n_dimred=10)
head(reducedDim(example_sce, "TSNE"))
## TSNE1 TSNE2
## 1772071015_C02 -37.68430 0.7891615
## 1772071017_G12 -35.54978 0.1157152
## 1772071017_A05 -38.06437 1.2511114
## 1772071014_B06 -38.09322 -0.3843783
## 1772067065_H06 -40.40679 -1.1441570
## 1772071017_E02 -38.31896 2.4123524
The same can be done for uniform manifold with approximate projection
(UMAP) via the runUMAP()
function, itself based on the
uwot
package.
## UMAP1 UMAP2
## 1772071015_C02 13.39035 -3.603727
## 1772071017_G12 13.43977 -3.512446
## 1772071017_A05 13.36352 -3.616711
## 1772071014_B06 13.39616 -3.558855
## 1772067065_H06 13.44166 -3.505640
## 1772071017_E02 13.43508 -3.514699
Any dimensionality reduction result can be plotted using the
plotReducedDim
function. Here, each point represents a cell
and is coloured according to its cell type label.
Some result types have dedicated wrappers for convenience, e.g.,
plotTSNE()
for t-SNE results:
The dedicated plotPCA()
function also adds the
percentage of variance explained to the axes:
Multiple components can be plotted in a series of pairwise plots. When more than two components are plotted, the diagonal boxes in the scatter plot matrix show the density for each component.
example_sce <- runPCA(example_sce, ncomponents=20)
plotPCA(example_sce, ncomponents = 4, colour_by = "level1class")
We separate the execution of these functions from the plotting to enable the same coordinates to be re-used across multiple plots. This avoids repeatedly recomputing those coordinates just to change an aesthetic across plots.
We provide some helper functions to easily convert from a
SingleCellExperiment
object to a data.frame
for use in, say, ggplot2
functions. This allows users to create highly customized plots that are
not covered by the existing scater
functions. The ggcells()
function will intelligently
retrieve fields from the colData()
, assays()
,
altExps()
or reducedDims()
to create a single
data.frame
for immediate use. In the example below, we
create boxplots of Snap25 expression stratified by cell type
and tissue of origin:
Reduced dimension results are easily pulled in to create customized
equivalents of the plotReducedDim()
output. In this
example, we create a t-SNE
plot faceted by tissue and coloured by Snap25 expression:
ggcells(example_sce, mapping=aes(x=TSNE.1, y=TSNE.2, colour=Snap25)) +
geom_point() +
stat_density_2d() +
facet_wrap(~tissue) +
scale_colour_distiller(direction=1)
It is also straightforward to examine the relationship between the size factors on the normalized gene expression:
Similar operations can be performed on the features using the
ggfeatures()
function, which will retrieve values either
from rowData
or from the columns of the
assays
. Here, we examine the relationship between the
feature type and the expression within a given cell; note the renaming
of the otherwise syntactically invalid cell name.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] scater_1.35.0 ggplot2_3.5.1
## [3] scuttle_1.17.0 scRNAseq_2.20.0
## [5] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
## [7] Biobase_2.67.0 GenomicRanges_1.59.1
## [9] GenomeInfoDb_1.43.2 IRanges_2.41.2
## [11] S4Vectors_0.45.2 BiocGenerics_0.53.3
## [13] generics_0.1.3 MatrixGenerics_1.19.0
## [15] matrixStats_1.4.1 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 sys_3.4.3 jsonlite_1.8.9
## [4] magrittr_2.0.3 ggbeeswarm_0.7.2 GenomicFeatures_1.59.1
## [7] gypsum_1.3.0 farver_2.1.2 rmarkdown_2.29
## [10] BiocIO_1.17.1 zlibbioc_1.52.0 vctrs_0.6.5
## [13] memoise_2.0.1 Rsamtools_2.23.1 RCurl_1.98-1.16
## [16] htmltools_0.5.8.1 S4Arrays_1.7.1 AnnotationHub_3.15.0
## [19] curl_6.0.1 BiocNeighbors_2.1.2 Rhdf5lib_1.29.0
## [22] SparseArray_1.7.2 rhdf5_2.51.1 sass_0.4.9
## [25] alabaster.base_1.7.2 bslib_0.8.0 alabaster.sce_1.7.0
## [28] httr2_1.0.7 cachem_1.1.0 buildtools_1.0.0
## [31] GenomicAlignments_1.43.0 lifecycle_1.0.4 pkgconfig_2.0.3
## [34] rsvd_1.0.5 Matrix_1.7-1 R6_2.5.1
## [37] fastmap_1.2.0 GenomeInfoDbData_1.2.13 digest_0.6.37
## [40] colorspace_2.1-1 AnnotationDbi_1.69.0 irlba_2.3.5.1
## [43] ExperimentHub_2.15.0 RSQLite_2.3.9 beachmat_2.23.5
## [46] labeling_0.4.3 filelock_1.0.3 mgcv_1.9-1
## [49] httr_1.4.7 abind_1.4-8 compiler_4.4.2
## [52] bit64_4.5.2 withr_3.0.2 BiocParallel_1.41.0
## [55] viridis_0.6.5 DBI_1.2.3 HDF5Array_1.35.2
## [58] alabaster.ranges_1.7.0 alabaster.schemas_1.7.0 MASS_7.3-61
## [61] rappdirs_0.3.3 DelayedArray_0.33.3 rjson_0.2.23
## [64] tools_4.4.2 vipor_0.4.7 beeswarm_0.4.0
## [67] glue_1.8.0 restfulr_0.0.15 nlme_3.1-166
## [70] rhdf5filters_1.19.0 grid_4.4.2 Rtsne_0.17
## [73] isoband_0.2.7 gtable_0.3.6 ensembldb_2.31.0
## [76] BiocSingular_1.23.0 ScaledMatrix_1.15.0 XVector_0.47.1
## [79] ggrepel_0.9.6 BiocVersion_3.21.1 pillar_1.10.0
## [82] splines_4.4.2 dplyr_1.1.4 BiocFileCache_2.15.0
## [85] lattice_0.22-6 FNN_1.1.4.1 rtracklayer_1.67.0
## [88] bit_4.5.0.1 tidyselect_1.2.1 maketools_1.3.1
## [91] Biostrings_2.75.3 knitr_1.49 gridExtra_2.3
## [94] ProtGenerics_1.39.1 xfun_0.49 UCSC.utils_1.3.0
## [97] lazyeval_0.2.2 yaml_2.3.10 evaluate_1.0.1
## [100] codetools_0.2-20 tibble_3.2.1 alabaster.matrix_1.7.4
## [103] BiocManager_1.30.25 cli_3.6.3 uwot_0.2.2
## [106] munsell_0.5.1 jquerylib_0.1.4 Rcpp_1.0.13-1
## [109] dbplyr_2.5.0 png_0.1-8 XML_3.99-0.17
## [112] parallel_4.4.2 blob_1.2.4 AnnotationFilter_1.31.0
## [115] sparseMatrixStats_1.19.0 bitops_1.0-9 alabaster.se_1.7.0
## [118] viridisLite_0.4.2 scales_1.3.0 crayon_1.5.3
## [121] rlang_1.1.4 cowplot_1.1.3 KEGGREST_1.47.0