---
title: "Handling large H5AD datasets"
subtitle: 'On-disk versus in-memory storage'
author: "Developed by [Gabriel Hoffman](http://gabrielhoffman.github.io/)"
documentclass: article
vignette: >
%\VignetteIndexEntry{Loading large-scale H5AD datasets}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
%\usepackage[utf8]{inputenc}
output:
BiocStyle::html_document:
toc: true
toc_float: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
warning = FALSE,
message = FALSE,
error = FALSE,
eval = FALSE,
tidy = FALSE,
dev = c("png"),
cache = TRUE
)
```
# Introduction
As the scale of single cell RNA-seq datasets continues to increase, loading the entire dataset into memory is often impractical. Dreamlet can take advantage of efficient data storage through the [Bioconductor's](https://bioconductor.org) [SingleCellExperiment](https://bioconductor.org/packages/SingleCellExperiment/) interface to dramatically reduce memory usage. There are two general methods for storing large single-cell datasets. [SingleCellExperiment](https://bioconductor.org/packages/SingleCellExperiment/), and therefore `dreamlet`, are compatible with both:
1) __On-disk storage__: The data remains on the physical disk and is only read into memory when requested. In R there is little data duplication in memory since operations are delayed until the result is requested using `DelayedMatrix`. Memory usage is minimized at the cost of performance since disk access can be slow.
2) __In-memory storage__: Single cell read counts are generally sprase with, say, 20% of read counts being non-zero. The counts can be stored as a `sparseMatrix` that stores only non-zero entries. Performance is higher since data is in memory, but excessive memory usage can be a problem.
# On-disk storage: zellkonverter
[zellkonverter](https://github.com/theislab/zellkonverter) takes advantage of the [H5AD](https://anndata.readthedocs.io/en/latest/fileformat-prose.html) file format built on the [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) format in order to dramatically reduce memory usage while still retaining performance. The [zellkonverter](https://github.com/theislab/zellkonverter) package uses a [DelayedArray](https://bioconductor.org/packages/DelayedArray/) backend to provide a seamless interface to an on-disk H5AD dataset through the interface of the [SingleCellExperiment](https://bioconductor.org/packages/SingleCellExperiment/) class.
`zellkonverter` is an interface for python code. Note that the python dependencies are installed the first time `readH5AD()` is run, rather than when the package is installed. So the first run of `readH5AD()` can take a few minutes, but subsequent runs will be fast.
## Standard usage
```{r example, eval=FALSE}
library(zellkonverter)
library(SingleCellExperiment)
# Create SingleCellExperiment object that points to on-disk H5AD file
sce <- readH5AD(h5ad_file, use_hdf5 = TRUE)
```
## Merge multiple datasets
```{r example2, eval=FALSE}
# Read a series of H5AD files into a list
# then combine them into a single merged SingleCellExperiment
sce.lst <- lapply(h5ad_files, function(file) {
readH5AD(file, use_hdf5 = TRUE)
})
sce <- do.call(cbind, sce.list)
```
## Access alternative data
Some software, including [pegasus](https://pegasus.readthedocs.io/en/stable/), store normalized data in the default portion of the file and save raw counts in the `raw/` field. Here, load a H5AD file generated by pegasus.
```{r example3, eval=FALSE}
# read raw/ from H5AD file
# raw = TRUE tells readH5AD() to read alternative data
# Must use zellkonverter >=1.3.3
sce_in <- readH5AD(h5ad_file, use_hdf5 = TRUE, raw = TRUE)
# use `raw` as counts
sce <- swapAltExp(sce_in, "raw") # use raw as main
counts(sce) <- assay(sce, "X") # set counts assay to data in X
assay(sce, "X") <- NULL # free X
```
# In-memory storage: Seurat
[Seurat](https://satijalab.org/seurat) can also convert and import H5AD files, and then convert to [SingleCellExperiment](https://bioconductor.org/packages/SingleCellExperiment/). The resulting `SingleCellExperiment` object stores data as a `sparseMatrix`.
```{r Seurat, eval=FALSE}
library(SingleCellExperiment)
library(SeuratDisk)
# Convert h5ad file to h5seurat format
Convert(h5ad_file, dest = "h5seurat")
# load Seurat file
obj <- LoadH5Seurat(h5seurat_file)
# Convert Seurat object to SingleCellExperiment
sce <- as.SingleCellExperiment(obj)
```
# Comparing interfaces
The `SingleCellExperiment` interface to `zellkonverter` and `Seurat` hides the backend differences from the typical R user. The usage of `dreamlet` is the same in both cases. For small to medium datasets, the performance differences should be minimal. However, for large datasets there can be a substantial difference in performance. Use of `zellkonverter` and `DelayedMatrix` minimize memory usage at the cost of computationl performance. Use of `Seurat` and `sparseMatrix` can be faster if there is sufficent memory.
`aggregateToPseudoBulk()` for computing pseudobulk data uses different backend implements for on-disk `DelayedMatrix` and in-memory `sparseMatrix`. Both versions are highly optimized compared to a naive implementation.
# Session Info
```{r session, echo=FALSE}
sessionInfo()
```