The SingleCellExperiment
class is a lightweight
Bioconductor container for storing and manipulating single-cell genomics
data. It extends the RangedSummarizedExperiment
class and
follows similar conventions, i.e., rows should represent features
(genes, transcripts, genomic regions) and columns should represent
cells. It provides methods for storing dimensionality reduction results
and data for alternative feature sets (e.g., synthetic spike-in
transcripts, antibody-derived tags). It is the central data structure
for Bioconductor single-cell packages like scater and
scran.
SingleCellExperiment
objects can be created via the
constructor of the same name. For example, if we have a count matrix in
counts
, we can simply call:
library(SingleCellExperiment)
counts <- matrix(rpois(100, lambda = 10), ncol=10, nrow=10)
sce <- SingleCellExperiment(counts)
sce
## class: SingleCellExperiment
## dim: 10 10
## metadata(0):
## assays(1): ''
## rownames: NULL
## rowData names(0):
## colnames: NULL
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
In practice, it is often more useful to name the assay by passing in a named list:
## class: SingleCellExperiment
## dim: 10 10
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(0):
## colnames: NULL
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
It is similarly easy to set the column and row metadata by passing values to the appropriate arguments. We will not go into much detail here as most of this is covered by the SummarizedExperiment documentation, but to give an example:
pretend.cell.labels <- sample(letters, ncol(counts), replace=TRUE)
pretend.gene.lengths <- sample(10000, nrow(counts))
sce <- SingleCellExperiment(list(counts=counts),
colData=DataFrame(label=pretend.cell.labels),
rowData=DataFrame(length=pretend.gene.lengths),
metadata=list(study="GSE111111")
)
sce
## class: SingleCellExperiment
## dim: 10 10
## metadata(1): study
## assays(1): counts
## rownames: NULL
## rowData names(1): length
## colnames: NULL
## colData names(1): label
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
Alternatively, we can construct a SingleCellExperiment
by coercing an existing (Ranged)SummarizedExperiment
object:
## class: SingleCellExperiment
## dim: 10 10
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(0):
## colnames: NULL
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
Any operation that can be applied to a
RangedSummarizedExperiment
is also applicable to any
instance of a SingleCellExperiment
. This includes access to
assay data via assay()
, column metadata with
colData()
, and so on. Again, without going into too much
detail:
## [1] 10 10
## [1] "label"
## [1] "length"
To demonstrate the use of the class in the rest of the vignette, we will use the Allen data set from the scRNAseq package.
## class: SingleCellExperiment
## dim: 20816 379
## metadata(2): SuppInfo which_qc
## assays(1): tophat_counts
## rownames(20816): 0610007P14Rik 0610009B22Rik ... Zzef1 Zzz3
## rowData names(0):
## colnames(379): SRR2140028 SRR2140022 ... SRR2139341 SRR2139336
## colData names(22): NREADS NALIGNED ... Animal.ID passes_qc_checks_s
## reducedDimNames(0):
## mainExpName: endogenous
## altExpNames(1): ERCC
We compute log-transformed normalized expression values from the
count matrix. (We note that many of these steps can be performed as
one-liners from the scater
package, but we will show them here in full to demonstrate the
capabilities of the SingleCellExperiment
class.)
counts <- assay(sce, "tophat_counts")
libsizes <- colSums(counts)
size.factors <- libsizes/mean(libsizes)
logcounts(sce) <- log2(t(t(counts)/size.factors) + 1)
assayNames(sce)
## [1] "tophat_counts" "logcounts"
We obtain the PCA and t-SNE representations of the data and add them
to the object with the reducedDims()<-
method.
Alternatively, we can representations one at a time with the
reducedDim()<-
method (note the missing
s
).
pca_data <- prcomp(t(logcounts(sce)), rank=50)
library(Rtsne)
set.seed(5252)
tsne_data <- Rtsne(pca_data$x[,1:50], pca = FALSE)
reducedDims(sce) <- list(PCA=pca_data$x, TSNE=tsne_data$Y)
sce
## class: SingleCellExperiment
## dim: 20816 379
## metadata(2): SuppInfo which_qc
## assays(2): tophat_counts logcounts
## rownames(20816): 0610007P14Rik 0610009B22Rik ... Zzef1 Zzz3
## rowData names(0):
## colnames(379): SRR2140028 SRR2140022 ... SRR2139341 SRR2139336
## colData names(22): NREADS NALIGNED ... Animal.ID passes_qc_checks_s
## reducedDimNames(2): PCA TSNE
## mainExpName: endogenous
## altExpNames(1): ERCC
The coordinates for all representations can be retrieved from a
SingleCellExperiment
en masse with
reducedDims()
or one at a time by name/index with
reducedDim()
. Each row of the coordinate matrix is assumed
to correspond to a cell while each column represents a dimension.
## List of length 2
## names(2): PCA TSNE
## [1] "PCA" "TSNE"
## PC1 PC2
## SRR2140028 -70.23670 33.78880
## SRR2140022 -41.00366 49.76633
## SRR2140055 -133.70763 7.68870
## SRR2140083 -36.26099 113.18806
## SRR2139991 -100.86439 15.94740
## SRR2140067 -97.71210 35.77881
## [,1] [,2]
## SRR2140028 -7.518029 -0.7060692
## SRR2140022 -7.437483 1.0151941
## SRR2140055 -1.137628 1.1183457
## SRR2140083 -4.981877 7.5763461
## SRR2139991 -5.900807 1.1559410
## SRR2140067 -7.534082 -3.8238655
Any subsetting by column of sce_sub
will also lead to
subsetting of the dimensionality reduction results by cell. This is
convenient as it ensures our low-dimensional results are always
synchronized with the gene expression data.
## [1] 379 50
## [1] 10 50
In the SingleCellExperiment
, users can assign arbitrary
names to entries of assays
. To assist interoperability
between packages, we provide some suggestions for what the names should
be for particular types of data:
counts
: Raw count data, e.g., number of reads or
transcripts for a particular gene.normcounts
: Normalized values on the same scale as the
original counts. For example, counts divided by cell-specific size
factors that are centred at unity.logcounts
: Log-transformed counts or count-like values.
In most cases, this will be defined as log-transformed
normcounts
, e.g., using log base 2 and a pseudo-count of
1.cpm
: Counts-per-million. This is the read count for
each gene in each cell, divided by the library size of each cell in
millions.tpm
: Transcripts-per-million. This is the number of
transcripts for each gene in each cell, divided by the total number of
transcripts in that cell (in millions).Each of these suggested names has an appropriate getter/setter method
for convenient manipulation of the SingleCellExperiment
.
For example, we can take the (very specifically named)
tophat_counts
name and assign it to counts
instead:
## class: SingleCellExperiment
## dim: 20816 379
## metadata(2): SuppInfo which_qc
## assays(3): tophat_counts logcounts counts
## rownames(20816): 0610007P14Rik 0610009B22Rik ... Zzef1 Zzz3
## rowData names(0):
## colnames(379): SRR2140028 SRR2140022 ... SRR2139341 SRR2139336
## colData names(22): NREADS NALIGNED ... Animal.ID passes_qc_checks_s
## reducedDimNames(2): PCA TSNE
## mainExpName: endogenous
## altExpNames(1): ERCC
## [1] 20816 379
This means that functions expecting count data can simply call
counts()
without worrying about package-specific naming
conventions.
Many scRNA-seq experiments contain sequencing data for multiple feature types beyond the endogenous genes:
Such features can be stored inside the
SingleCellExperiment
via the concept of “alternative
Experiments”. These are nested SummarizedExperiment
instances that are guaranteed to have the same number and ordering of
columns as the main SingleCellExperiment
itself. Data for
endogenous genes and other features can thus be kept separate - which is
often desirable as they need to be processed differently - while still
retaining the synchronization of operations on a single object.
To illustrate, consider the case of the spike-in transcripts in the
Allen data. The altExp()
method returns a self-contained
SingleCellExperiment
instance containing only the spike-in
transcripts.
## class: SingleCellExperiment
## dim: 92 379
## metadata(0):
## assays(4): tophat_counts cufflinks_fpkm rsem_counts rsem_tpm
## rownames(92): ERCC-00002 ERCC-00003 ... ERCC-00170 ERCC-00171
## rowData names(0):
## colnames(379): SRR2140028 SRR2140022 ... SRR2139341 SRR2139336
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
Each alternative Experiment can have a different set of assays from
the main SingleCellExperiment
. This is useful in cases
where the other feature types must be normalized or transformed
differently. Similarly, the alternative Experiments can have different
rowData()
from the main object.
## DataFrame with 92 rows and 1 column
## concentration
## <numeric>
## ERCC-00002 0.034970
## ERCC-00003 0.979286
## ERCC-00004 0.171510
## ERCC-00009 0.577606
## ERCC-00012 0.965111
## ... ...
## ERCC-00164 0.2945227
## ERCC-00165 0.4449455
## ERCC-00168 0.2821027
## ERCC-00170 0.0817114
## ERCC-00171 0.4178510
## DataFrame with 20816 rows and 0 columns
We provide the splitAltExps()
utility to easily split a
SingleCellExperiment
into new alternative Experiments. For
example, if we wanted to split the RIKEN transcripts into a separate
Experiment - say, to ensure that they are not used in downstream
analyses without explicitly throwing out the data - we would do:
is.riken <- grepl("^[0-9]", rownames(sce))
sce <- splitAltExps(sce, ifelse(is.riken, "RIKEN", "gene"))
altExpNames(sce)
## [1] "ERCC" "RIKEN"
Alternatively, if we want to swap the main and alternative
Experiments - perhaps because the RIKEN transcripts were more
interesting than expected, and we want to perform our analyses on them -
we can simply use swapAltExp()
to switch the RIKEN
alternative Experiment with the gene expression data:
## class: SingleCellExperiment
## dim: 611 379
## metadata(2): SuppInfo which_qc
## assays(3): tophat_counts logcounts counts
## rownames(611): 0610007P14Rik 0610009B22Rik ... 9930111J21Rik1
## 9930111J21Rik2
## rowData names(0):
## colnames(379): SRR2140028 SRR2140022 ... SRR2139341 SRR2139336
## colData names(22): NREADS NALIGNED ... Animal.ID passes_qc_checks_s
## reducedDimNames(2): PCA TSNE
## mainExpName: RIKEN
## altExpNames(2): ERCC original
A common procedure in single-cell analyses is to identify
relationships between pairs of cells, e.g., to construct a
nearest-neighbor graph or to mark putative physical interactions between
cells. We can capture this information in the
SingleCellExperiment
class with the colPairs()
functionality. To demonstrate, say we have 100 relationships between the
cells in sce
, characterized by some distance measure:
cell1 <- sample(ncol(sce), 100, replace=TRUE)
cell2 <- sample(ncol(sce), 100, replace=TRUE)
distance <- runif(100)
We store this in the SingleCellExperiment
as a
SelfHits
object using the value
metadata field
to hold our data. This is easily extracted as a SelfHits
or, for logical or numeric data, as a sparse matrix from Matrix.
colPair(sce, "relationships") <- SelfHits(
cell1, cell2, nnode=ncol(sce), value=distance)
colPair(sce, "relationships")
## SelfHits object with 100 hits and 1 metadata column:
## from to | value
## <integer> <integer> | <numeric>
## [1] 2 224 | 0.3155391
## [2] 2 311 | 0.0554668
## [3] 12 200 | 0.6148850
## [4] 16 304 | 0.0174597
## [5] 21 244 | 0.9801941
## ... ... ... . ...
## [96] 349 369 | 0.662352
## [97] 349 372 | 0.839914
## [98] 352 10 | 0.540362
## [99] 360 172 | 0.392147
## [100] 360 237 | 0.626605
## -------
## nnode: 379
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
A particularly useful feature is that the indices of the interacting
cells are automatically remapped when sce
is subsetted.
This ensures that the pairings are always synchronized with the
identities of the cells in sce
.
## SelfHits object with 46 hits and 1 metadata column:
## from to | value
## <integer> <integer> | <numeric>
## [1] 7 157 | 0.748443
## [2] 11 2 | 0.260945
## [3] 18 203 | 0.857025
## [4] 19 217 | 0.967591
## [5] 32 249 | 0.608411
## ... ... ... . ...
## [42] 236 3 | 0.17149734
## [43] 240 245 | 0.98800819
## [44] 245 52 | 0.00164622
## [45] 248 208 | 0.53441707
## [46] 250 65 | 0.31503674
## -------
## nnode: 251
Similar functionality is available for pairings between rows via the
rowPairs()
family of functions, which is potentially useful
for representing coexpression or regulatory networks.
The SingleCellExperiment
class provides the
sizeFactors()
getter and setter methods, to set and
retrieve size factors from the colData
of the object. Each
size factor represents the scaling factor applied to a cell to normalize
expression values prior to downstream comparisons, e.g., to remove the
effects of differences in library size and other cell-specific biases.
These methods are primarily intended for programmatic use in functions
implementing normalization methods, but users can also directly call
this to inspect or define the size factors for their analysis.
# Making up some size factors and storing them:
sizeFactors(sce) <- 2^rnorm(ncol(sce))
summary(sizeFactors(sce))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1336 0.6761 1.0082 1.2365 1.5034 9.8244
## NULL
The colLabels()
getter and setters methods allow
applications to set and retrieve cell labels from the
colData
. These labels can be derived from cluster
annotations, assigned by classification algorithms, etc. and are often
used in downstream visualization and analyses. While labels can be
stored in any colData
field, the colLabels()
methods aim to provide some informal standardization to a default
location that downstream functions can search first when attempting to
retrieve annotations.
# Making up some labels and storing them:
colLabels(sce) <- sample(letters, ncol(sce), replace=TRUE)
table(colLabels(sce))
##
## a b c d e f g h i j k l m n o p q r s t u v w x y z
## 18 14 15 13 16 12 19 14 10 12 14 13 16 10 25 22 15 8 12 19 14 19 12 17 11 9
## NULL
In a similar vein, we provide the rowSubset()
function
for users to set and get row subsets from the rowData
. This
will store any vector of gene identities (e.g., row names, integer
indices, logical vector) in the SingleCellExperiment
object
for retrieval and use by downstream functions. Users can then easily
pack multiple feature sets into the same object for synchronized
manipulation.
# Packs integer or character vectors into the rowData:
rowSubset(sce, "my gene set 1") <- 1:10
which(rowSubset(sce, "my gene set 1"))
## [1] 1 2 3 4 5 6 7 8 9 10
## NULL
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] Rtsne_0.17 scRNAseq_2.20.0
## [3] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
## [5] Biobase_2.67.0 GenomicRanges_1.59.0
## [7] GenomeInfoDb_1.43.0 IRanges_2.41.0
## [9] S4Vectors_0.45.0 BiocGenerics_0.53.1
## [11] generics_0.1.3 MatrixGenerics_1.19.0
## [13] matrixStats_1.4.1 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 alabaster.se_1.7.0 dplyr_1.1.4
## [4] blob_1.2.4 filelock_1.0.3 Biostrings_2.75.1
## [7] bitops_1.0-9 lazyeval_0.2.2 fastmap_1.2.0
## [10] RCurl_1.98-1.16 BiocFileCache_2.15.0 GenomicAlignments_1.43.0
## [13] XML_3.99-0.17 digest_0.6.37 lifecycle_1.0.4
## [16] ProtGenerics_1.39.0 alabaster.matrix_1.7.0 KEGGREST_1.47.0
## [19] alabaster.base_1.7.0 RSQLite_2.3.7 magrittr_2.0.3
## [22] compiler_4.4.2 rlang_1.1.4 sass_0.4.9
## [25] tools_4.4.2 utf8_1.2.4 yaml_2.3.10
## [28] rtracklayer_1.67.0 knitr_1.49 S4Arrays_1.7.1
## [31] bit_4.5.0 curl_6.0.0 DelayedArray_0.33.1
## [34] abind_1.4-8 BiocParallel_1.41.0 HDF5Array_1.35.1
## [37] gypsum_1.3.0 sys_3.4.3 grid_4.4.2
## [40] fansi_1.0.6 ExperimentHub_2.15.0 Rhdf5lib_1.29.0
## [43] cli_3.6.3 rmarkdown_2.29 crayon_1.5.3
## [46] httr_1.4.7 rjson_0.2.23 rhdf5_2.51.0
## [49] DBI_1.2.3 cachem_1.1.0 zlibbioc_1.52.0
## [52] parallel_4.4.2 AnnotationDbi_1.69.0 AnnotationFilter_1.31.0
## [55] BiocManager_1.30.25 XVector_0.47.0 restfulr_0.0.15
## [58] alabaster.schemas_1.7.0 vctrs_0.6.5 Matrix_1.7-1
## [61] jsonlite_1.8.9 bit64_4.5.2 ensembldb_2.31.0
## [64] alabaster.ranges_1.7.0 GenomicFeatures_1.59.1 maketools_1.3.1
## [67] jquerylib_0.1.4 glue_1.8.0 codetools_0.2-20
## [70] BiocVersion_3.21.1 BiocIO_1.17.0 UCSC.utils_1.3.0
## [73] tibble_3.2.1 pillar_1.9.0 rhdf5filters_1.19.0
## [76] rappdirs_0.3.3 htmltools_0.5.8.1 GenomeInfoDbData_1.2.13
## [79] httr2_1.0.6 R6_2.5.1 dbplyr_2.5.0
## [82] alabaster.sce_1.7.0 evaluate_1.0.1 lattice_0.22-6
## [85] AnnotationHub_3.15.0 png_0.1-8 Rsamtools_2.23.0
## [88] memoise_2.0.1 bslib_0.8.0 Rcpp_1.0.13-1
## [91] SparseArray_1.7.1 xfun_0.49 buildtools_1.0.0
## [94] pkgconfig_2.0.3