Developing around the SingleCellExperiment class

Introduction

By design, the scope of this package is limited to defining the SingleCellExperiment class and some minimal getter and setter methods. For this reason, we leave it to developers of specialized packages to provide more advanced methods for the SingleCellExperiment class. If packages define their own data structure, it is their responsibility to provide coercion methods to/from their classes to SingleCellExperiment.

For developers, the use of SingleCellExperiment objects within package functions is mostly the same as the use of instances of the base SummarizedExperiment class. The only exceptions involve direct access to the internal fields of the SingleCellExperiment definition. Manipulation of these internal fields in other packages is possible but requires some caution, as we shall discuss below.

Using the internal fields

Rationale

We use an internal storage mechanism to protect certain fields from direct manipulation by the user. This ensures that only a call to the provided setter methods can change the size factors. The same effect could be achieved by reserving a subset of columns (or column names) as “private” in colData() and rowData(), though this is not easily implemented.

The internal storage avoids situations where users or functions can silently overwrite these important metadata fields during manipulations of rowData or colData. This can result in bugs that are difficult to track down, particularly in long workflows involving many functions. It also allows us to add new methods and metadata types to SingleCellExperiment without worrying about overwriting user-supplied metadata in existing objects.

Methods to get or set the internal fields are exported for use by developers of packages that depend on SingleCellExperiment. This allows dependent packages to store their own custom fields that are not meant to be directly accessible by the user. However, this requires some care to avoid conflicts between packages.

Conflicts between packages

The concern is that package A and B both define methods that get/set an internal field X in a SingleCellExperiment instance. Consider the following example object:

library(SingleCellExperiment)
counts <- matrix(rpois(100, lambda = 10), ncol=10, nrow=10)
sce <- SingleCellExperiment(assays = list(counts = counts))
sce
## class: SingleCellExperiment 
## dim: 10 10 
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(0):
## colnames: NULL
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

Assume that we have functions that set an internal field X in packages A and B.

# Function in package A:
AsetX <- function(sce) {
    int_colData(sce)$X <- runif(ncol(sce))
    sce
}

# Function in package B:
BsetX <- function(sce) {
    int_colData(sce)$X <- sample(LETTERS, ncol(sce), replace=TRUE)
    sce
}

If both of these functions are called, one will clobber the output of the other. This may lead to nonsensical results in downstream procedures.

sce2 <- AsetX(sce)
int_colData(sce2)$X
##  [1] 0.9710205 0.3077690 0.1867486 0.9234136 0.3155550 0.5379592 0.4742281
##  [8] 0.1255479 0.1789071 0.1749591
sce2 <- BsetX(sce2)
int_colData(sce2)$X
##  [1] "N" "H" "F" "M" "F" "Y" "U" "D" "H" "Q"

Using “Inception-style” nesting

We recommend using nested DataFrames to store internal fields in the column-level metadata. The name of the nested element should be set to the package name, thus avoiding clashes between fields with the same name from different packages.

AsetX_better <- function(sce) {
    int_colData(sce)$A <- DataFrame(X=runif(ncol(sce)))
    sce
}

BsetX_better <- function(sce) {
    choice <- sample(LETTERS, ncol(sce), replace=TRUE)
    int_colData(sce)$B <- DataFrame(X=choice)
    sce
}

sce2 <- AsetX_better(sce)
sce2 <- BsetX_better(sce2)
int_colData(sce2)$A$X 
##  [1] 0.1859085 0.2144826 0.6465126 0.0306806 0.6046127 0.8663672 0.4212556
##  [8] 0.1642679 0.3241464 0.4327663
int_colData(sce2)$B$X 
##  [1] "N" "Q" "U" "R" "Q" "Z" "B" "L" "N" "K"

The same approach can be applied to the row-level metadata, e.g., for some per-row field Y.

AsetY_better <- function(sce) {
    int_elementMetadata(sce)$A <- DataFrame(Y=runif(nrow(sce)))
    sce
}

BsetY_better <- function(sce) {
    choice <- sample(LETTERS, nrow(sce), replace=TRUE)
    int_elementMetadata(sce)$B <- DataFrame(Y=choice)
    sce
}

sce2 <- AsetY_better(sce)
sce2 <- BsetY_better(sce2)
int_elementMetadata(sce2)$A$Y 
##  [1] 0.65886425 0.11181128 0.16207320 0.15148012 0.02037788 0.32894890
##  [7] 0.74549851 0.31833148 0.22943161 0.14364916
int_elementMetadata(sce2)$B$Y
##  [1] "S" "U" "B" "I" "S" "Z" "Z" "A" "D" "E"

For the object-wide metadata, a nested list is usually sufficient.

AsetZ_better <- function(sce) {
    int_metadata(sce)$A <- list(Z = "Aaron")
    sce
}

BsetZ_better <- function(sce) {
    int_metadata(sce)$B <- list(Z = "Davide")
    sce
}

sce2 <- AsetZ_better(sce)
sce2 <- BsetZ_better(sce2)
int_metadata(sce2)$A$Z
## [1] "Aaron"
int_metadata(sce2)$B$Z
## [1] "Davide"

In this manner, both A and B can set their internal X, Y and Z without interfering with each other. Of course, this strategy assumes that packages do not have the same names as some of the in-built internal fields (which would be very unfortunate).

Contacting us

If your package accesses the internal fields of the SingleCellExperiment class, we suggest you get into contact with us on GitHub. This will help us in planning changes to the internal organization of the class. It will also allow us to contact you with respect to changes or to get feedback.

We are particularly interested in scenarios where multiple packages are defining internal fields with the same scientific meaning. In such cases, it may be valuable to provide getters and setters for this field in SingleCellExperiment directly. This reduces redundancy in the definitions across packages and promotes interoperability. For example, methods from one package can set the field, which can then be used by methods of another package.

Other design decisions

What’s up with reducedDims?

We use a SimpleList as the reducedDims slot to allow for multiple dimensionality reduction results. One can imagine that different dimensionality reduction techniques will be useful for different aspects of the analysis, e.g., t-SNE for visualization, PCA for pseudo-time inference. We see reducedDims as a similar slot to assays() in that multiple matrices can be stored, though the dimensionality reduction results need not have the same number of dimensions.

Why derive from a RangedSummarizedExperiment?

We decided to extend RangedSummarizedExperiment rather than SummarizedExperiment because for certain assays it will be essential to have rowRanges(). Even for RNA-seq, it is sometimes useful to have rowRanges() and other classes to define the genomic coordinates, e.g., DESeqDataSet in the DESeq2 package. An alternative would have been to have two classes, SingleCellExperiment and RangedSingleCellExperiment. However, this seems like an unnecessary duplication as having a class with default empty rowRanges seems good enough when one does not need rowRanges.

Why not use a MultiAssayExperiment?

Another approach to storing alternative Experiments would be to use a MultiAssayExperiment. We do not do so as the vast majority of scRNA-seq data analyses operate on the endogenous genes. Switching to a MultiAssayExperiment introduces an additional layer of indirection with no benefit in most cases. Indeed, the methods of this class are largely unnecessary when the alternative Experiments contain data for the same samples. By storing nested Experiments, we maintain the familiar SummarizedExperiment interface for better compatibility and ease of use.

Session information

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
##  [3] Biobase_2.67.0              GenomicRanges_1.59.0       
##  [5] GenomeInfoDb_1.43.0         IRanges_2.41.0             
##  [7] S4Vectors_0.45.0            BiocGenerics_0.53.1        
##  [9] generics_0.1.3              MatrixGenerics_1.19.0      
## [11] matrixStats_1.4.1           BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##  [1] Matrix_1.7-1            jsonlite_1.8.9          compiler_4.4.2         
##  [4] BiocManager_1.30.25     crayon_1.5.3            jquerylib_0.1.4        
##  [7] yaml_2.3.10             fastmap_1.2.0           lattice_0.22-6         
## [10] R6_2.5.1                XVector_0.47.0          S4Arrays_1.7.1         
## [13] knitr_1.49              DelayedArray_0.33.1     maketools_1.3.1        
## [16] GenomeInfoDbData_1.2.13 bslib_0.8.0             rlang_1.1.4            
## [19] cachem_1.1.0            xfun_0.49               sass_0.4.9             
## [22] sys_3.4.3               SparseArray_1.7.1       cli_3.6.3              
## [25] zlibbioc_1.52.0         grid_4.4.2              digest_0.6.37          
## [28] lifecycle_1.0.4         evaluate_1.0.1          buildtools_1.0.0       
## [31] abind_1.4-8             rmarkdown_2.29          httr_1.4.7             
## [34] tools_4.4.2             htmltools_0.5.8.1       UCSC.utils_1.3.0