By design, the scope of this package is limited to defining the
SingleCellExperiment
class and some minimal getter and
setter methods. For this reason, we leave it to developers of
specialized packages to provide more advanced methods for the
SingleCellExperiment
class. If packages define their own
data structure, it is their responsibility to provide coercion methods
to/from their classes to SingleCellExperiment
.
For developers, the use of SingleCellExperiment
objects
within package functions is mostly the same as the use of instances of
the base SummarizedExperiment
class. The only exceptions
involve direct access to the internal fields of the
SingleCellExperiment
definition. Manipulation of these
internal fields in other packages is possible but requires some caution,
as we shall discuss below.
We use an internal storage mechanism to protect certain fields from
direct manipulation by the user. This ensures that only a call to the
provided setter methods can change the size factors. The same effect
could be achieved by reserving a subset of columns (or column names) as
“private” in colData()
and rowData()
, though
this is not easily implemented.
The internal storage avoids situations where users or functions can
silently overwrite these important metadata fields during manipulations
of rowData
or colData
. This can result in bugs
that are difficult to track down, particularly in long workflows
involving many functions. It also allows us to add new methods and
metadata types to SingleCellExperiment
without worrying
about overwriting user-supplied metadata in existing objects.
Methods to get or set the internal fields are exported for use by developers of packages that depend on SingleCellExperiment. This allows dependent packages to store their own custom fields that are not meant to be directly accessible by the user. However, this requires some care to avoid conflicts between packages.
The concern is that package A and B
both define methods that get/set an internal field X
in a
SingleCellExperiment
instance. Consider the following
example object:
library(SingleCellExperiment)
counts <- matrix(rpois(100, lambda = 10), ncol=10, nrow=10)
sce <- SingleCellExperiment(assays = list(counts = counts))
sce
## class: SingleCellExperiment
## dim: 10 10
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(0):
## colnames: NULL
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
Assume that we have functions that set an internal field
X
in packages A and
B.
# Function in package A:
AsetX <- function(sce) {
int_colData(sce)$X <- runif(ncol(sce))
sce
}
# Function in package B:
BsetX <- function(sce) {
int_colData(sce)$X <- sample(LETTERS, ncol(sce), replace=TRUE)
sce
}
If both of these functions are called, one will clobber the output of the other. This may lead to nonsensical results in downstream procedures.
## [1] 0.9710205 0.3077690 0.1867486 0.9234136 0.3155550 0.5379592 0.4742281
## [8] 0.1255479 0.1789071 0.1749591
## [1] "N" "H" "F" "M" "F" "Y" "U" "D" "H" "Q"
We recommend using nested DataFrame
s to store internal
fields in the column-level metadata. The name of the nested element
should be set to the package name, thus avoiding clashes between fields
with the same name from different packages.
AsetX_better <- function(sce) {
int_colData(sce)$A <- DataFrame(X=runif(ncol(sce)))
sce
}
BsetX_better <- function(sce) {
choice <- sample(LETTERS, ncol(sce), replace=TRUE)
int_colData(sce)$B <- DataFrame(X=choice)
sce
}
sce2 <- AsetX_better(sce)
sce2 <- BsetX_better(sce2)
int_colData(sce2)$A$X
## [1] 0.1859085 0.2144826 0.6465126 0.0306806 0.6046127 0.8663672 0.4212556
## [8] 0.1642679 0.3241464 0.4327663
## [1] "N" "Q" "U" "R" "Q" "Z" "B" "L" "N" "K"
The same approach can be applied to the row-level metadata, e.g., for
some per-row field Y
.
AsetY_better <- function(sce) {
int_elementMetadata(sce)$A <- DataFrame(Y=runif(nrow(sce)))
sce
}
BsetY_better <- function(sce) {
choice <- sample(LETTERS, nrow(sce), replace=TRUE)
int_elementMetadata(sce)$B <- DataFrame(Y=choice)
sce
}
sce2 <- AsetY_better(sce)
sce2 <- BsetY_better(sce2)
int_elementMetadata(sce2)$A$Y
## [1] 0.65886425 0.11181128 0.16207320 0.15148012 0.02037788 0.32894890
## [7] 0.74549851 0.31833148 0.22943161 0.14364916
## [1] "S" "U" "B" "I" "S" "Z" "Z" "A" "D" "E"
For the object-wide metadata, a nested list is usually sufficient.
AsetZ_better <- function(sce) {
int_metadata(sce)$A <- list(Z = "Aaron")
sce
}
BsetZ_better <- function(sce) {
int_metadata(sce)$B <- list(Z = "Davide")
sce
}
sce2 <- AsetZ_better(sce)
sce2 <- BsetZ_better(sce2)
int_metadata(sce2)$A$Z
## [1] "Aaron"
## [1] "Davide"
In this manner, both A and B can
set their internal X
, Y
and Z
without interfering with each other. Of course, this strategy assumes
that packages do not have the same names as some of the in-built
internal fields (which would be very unfortunate).
If your package accesses the internal fields of the
SingleCellExperiment
class, we suggest you get into contact
with us on GitHub. This
will help us in planning changes to the internal organization of the
class. It will also allow us to contact you with respect to changes or
to get feedback.
We are particularly interested in scenarios where multiple packages are defining internal fields with the same scientific meaning. In such cases, it may be valuable to provide getters and setters for this field in SingleCellExperiment directly. This reduces redundancy in the definitions across packages and promotes interoperability. For example, methods from one package can set the field, which can then be used by methods of another package.
reducedDims
?We use a SimpleList
as the reducedDims
slot
to allow for multiple dimensionality reduction results. One can imagine
that different dimensionality reduction techniques will be useful for
different aspects of the analysis, e.g., t-SNE for visualization, PCA
for pseudo-time inference. We see reducedDims
as a similar
slot to assays()
in that multiple matrices can be stored,
though the dimensionality reduction results need not have the same
number of dimensions.
RangedSummarizedExperiment
?We decided to extend RangedSummarizedExperiment
rather
than SummarizedExperiment
because for certain assays it
will be essential to have rowRanges()
. Even for RNA-seq, it
is sometimes useful to have rowRanges()
and other classes
to define the genomic coordinates, e.g., DESeqDataSet
in
the DESeq2
package. An alternative would have been to have two classes,
SingleCellExperiment
and
RangedSingleCellExperiment
. However, this seems like an
unnecessary duplication as having a class with default empty
rowRanges
seems good enough when one does not need
rowRanges
.
MultiAssayExperiment
?Another approach to storing alternative Experiments would be to use a
MultiAssayExperiment
. We do not do so as the vast majority
of scRNA-seq data analyses operate on the endogenous genes. Switching to
a MultiAssayExperiment
introduces an additional layer of
indirection with no benefit in most cases. Indeed, the methods of this
class are largely unnecessary when the alternative Experiments contain
data for the same samples. By storing nested Experiments, we maintain
the familiar SummarizedExperiment
interface for better
compatibility and ease of use.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
## [3] Biobase_2.67.0 GenomicRanges_1.59.0
## [5] GenomeInfoDb_1.43.0 IRanges_2.41.0
## [7] S4Vectors_0.45.0 BiocGenerics_0.53.1
## [9] generics_0.1.3 MatrixGenerics_1.19.0
## [11] matrixStats_1.4.1 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] Matrix_1.7-1 jsonlite_1.8.9 compiler_4.4.2
## [4] BiocManager_1.30.25 crayon_1.5.3 jquerylib_0.1.4
## [7] yaml_2.3.10 fastmap_1.2.0 lattice_0.22-6
## [10] R6_2.5.1 XVector_0.47.0 S4Arrays_1.7.1
## [13] knitr_1.49 DelayedArray_0.33.1 maketools_1.3.1
## [16] GenomeInfoDbData_1.2.13 bslib_0.8.0 rlang_1.1.4
## [19] cachem_1.1.0 xfun_0.49 sass_0.4.9
## [22] sys_3.4.3 SparseArray_1.7.1 cli_3.6.3
## [25] zlibbioc_1.52.0 grid_4.4.2 digest_0.6.37
## [28] lifecycle_1.0.4 evaluate_1.0.1 buildtools_1.0.0
## [31] abind_1.4-8 rmarkdown_2.29 httr_1.4.7
## [34] tools_4.4.2 htmltools_0.5.8.1 UCSC.utils_1.3.0