Handling large H5AD datasets

Introduction

As the scale of single cell RNA-seq datasets continues to increase, loading the entire dataset into memory is often impractical. Dreamlet can take advantage of efficient data storage through the Bioconductor’s SingleCellExperiment interface to dramatically reduce memory usage. There are two general methods for storing large single-cell datasets. SingleCellExperiment, and therefore dreamlet, are compatible with both:

  1. On-disk storage: The data remains on the physical disk and is only read into memory when requested. In R there is little data duplication in memory since operations are delayed until the result is requested using DelayedMatrix. Memory usage is minimized at the cost of performance since disk access can be slow.

  2. In-memory storage: Single cell read counts are generally sprase with, say, 20% of read counts being non-zero. The counts can be stored as a sparseMatrix that stores only non-zero entries. Performance is higher since data is in memory, but excessive memory usage can be a problem.

On-disk storage: zellkonverter

zellkonverter takes advantage of the H5AD file format built on the HDF5 format in order to dramatically reduce memory usage while still retaining performance. The zellkonverter package uses a DelayedArray backend to provide a seamless interface to an on-disk H5AD dataset through the interface of the SingleCellExperiment class.

zellkonverter is an interface for python code. Note that the python dependencies are installed the first time readH5AD() is run, rather than when the package is installed. So the first run of readH5AD() can take a few minutes, but subsequent runs will be fast.

Standard usage

library(zellkonverter)
library(SingleCellExperiment)

# Create SingleCellExperiment object that points to on-disk H5AD file
sce <- readH5AD(h5ad_file, use_hdf5 = TRUE)

Merge multiple datasets

# Read a series of H5AD files into a list
# then combine them into a single merged SingleCellExperiment
sce.lst <- lapply(h5ad_files, function(file) {
  readH5AD(file, use_hdf5 = TRUE)
})
sce <- do.call(cbind, sce.list)

Access alternative data

Some software, including pegasus, store normalized data in the default portion of the file and save raw counts in the raw/ field. Here, load a H5AD file generated by pegasus.

# read raw/ from H5AD file
# raw = TRUE tells readH5AD() to read alternative data
# Must use zellkonverter >=1.3.3
sce_in <- readH5AD(h5ad_file, use_hdf5 = TRUE, raw = TRUE)

# use `raw` as counts
sce <- swapAltExp(sce_in, "raw") # use raw as main
counts(sce) <- assay(sce, "X") # set counts assay to data in X
assay(sce, "X") <- NULL # free X

In-memory storage: Seurat

Seurat can also convert and import H5AD files, and then convert to SingleCellExperiment. The resulting SingleCellExperiment object stores data as a sparseMatrix.

library(SingleCellExperiment)
library(SeuratDisk)

# Convert h5ad file to h5seurat format
Convert(h5ad_file, dest = "h5seurat")

# load Seurat file
obj <- LoadH5Seurat(h5seurat_file)

# Convert Seurat object to SingleCellExperiment
sce <- as.SingleCellExperiment(obj)

Comparing interfaces

The SingleCellExperiment interface to zellkonverter and Seurat hides the backend differences from the typical R user. The usage of dreamlet is the same in both cases. For small to medium datasets, the performance differences should be minimal. However, for large datasets there can be a substantial difference in performance. Use of zellkonverter and DelayedMatrix minimize memory usage at the cost of computationl performance. Use of Seurat and sparseMatrix can be faster if there is sufficent memory.

aggregateToPseudoBulk() for computing pseudobulk data uses different backend implements for on-disk DelayedMatrix and in-memory sparseMatrix. Both versions are highly optimized compared to a naive implementation.

Session Info