As the scale of single cell RNA-seq datasets continues to increase,
loading the entire dataset into memory is often impractical. Dreamlet
can take advantage of efficient data storage through the Bioconductor’s SingleCellExperiment
interface to dramatically reduce memory usage. There are two general
methods for storing large single-cell datasets. SingleCellExperiment,
and therefore dreamlet
, are compatible with both:
On-disk storage: The data remains on the
physical disk and is only read into memory when requested. In R there is
little data duplication in memory since operations are delayed until the
result is requested using DelayedMatrix
. Memory usage is
minimized at the cost of performance since disk access can be
slow.
In-memory storage: Single cell read counts are
generally sprase with, say, 20% of read counts being non-zero. The
counts can be stored as a sparseMatrix
that stores only
non-zero entries. Performance is higher since data is in memory, but
excessive memory usage can be a problem.
zellkonverter takes advantage of the H5AD file format built on the HDF5 format in order to dramatically reduce memory usage while still retaining performance. The zellkonverter package uses a DelayedArray backend to provide a seamless interface to an on-disk H5AD dataset through the interface of the SingleCellExperiment class.
zellkonverter
is an interface for python code. Note that
the python dependencies are installed the first time
readH5AD()
is run, rather than when the package is
installed. So the first run of readH5AD()
can take a few
minutes, but subsequent runs will be fast.
Some software, including pegasus, store
normalized data in the default portion of the file and save raw counts
in the raw/
field. Here, load a H5AD file generated by
pegasus.
# read raw/ from H5AD file
# raw = TRUE tells readH5AD() to read alternative data
# Must use zellkonverter >=1.3.3
sce_in <- readH5AD(h5ad_file, use_hdf5 = TRUE, raw = TRUE)
# use `raw` as counts
sce <- swapAltExp(sce_in, "raw") # use raw as main
counts(sce) <- assay(sce, "X") # set counts assay to data in X
assay(sce, "X") <- NULL # free X
Seurat can also convert
and import H5AD files, and then convert to SingleCellExperiment.
The resulting SingleCellExperiment
object stores data as a
sparseMatrix
.
The SingleCellExperiment
interface to
zellkonverter
and Seurat
hides the backend
differences from the typical R user. The usage of dreamlet
is the same in both cases. For small to medium datasets, the performance
differences should be minimal. However, for large datasets there can be
a substantial difference in performance. Use of
zellkonverter
and DelayedMatrix
minimize
memory usage at the cost of computationl performance. Use of
Seurat
and sparseMatrix
can be faster if there
is sufficent memory.
aggregateToPseudoBulk()
for computing pseudobulk data
uses different backend implements for on-disk DelayedMatrix
and in-memory sparseMatrix
. Both versions are highly
optimized compared to a naive implementation.