--- title: "Large-scale single-cell RNA-seq data analysis using GDS files and Seurat" author: "Dr. Xiuwen Zheng (Genomics Research Center, AbbVie)" date: "Feb 2023" output: BiocStyle::html_document: toc_float: true BiocStyle::pdf_document: default vignette: > %\VignetteIndexEntry{scRNA-seq data analysis with GDS files and Seurat} %\VignetteKeywords{scRNAseq, GDS, Seurat} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r style, echo=FALSE, results='asis'} BiocStyle::markdown() knitr::opts_chunk$set(warning=FALSE, error=FALSE, message=FALSE) ``` # Introduction Single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of gene expression heterogeneity within complex biological systems. As scRNA-seq technology becomes increasingly accessible and cost-effective, experiments are generating data from larger and larger numbers of cells. However, the analysis of large scRNA-seq data remains a challenge, particularly in terms of scalability. While numerous analysis tools have been developed to tackle the complexities of scRNA-seq data, their scalability is often limited, posing a major bottleneck in the analysis of large-scale experiments. In particular, the R package [Seurat](https://cran.r-project.org/package=Seurat) is one of the most widely used tools for exploring and analyzing scRNA-seq data, but its scalability is often limited by available memory. To address this issue, we introduce a new R package called “SCArray.sat” that extends the Seurat classes and functions to support Genomic Data Structure ([GDS](http://www.bioconductor.org/packages/gdsfmt)) files as a [DelayedArray](http://www.bioconductor.org/packages/DelayedArray) backend for data representation. GDS files store multiple dense and sparse array-based data sets in a hierarchical structure. This package defines a new class, called “SCArrayAssay” (derived from the Seurat class “Assay”), which wraps raw counts, normalized expressions, and scaled data matrices based on GDS-specific DelayedMatrix. It is designed to integrate seamlessly with the Seurat package to provide common data analysis in a workflow, with optimized algorithms for GDS data files. ~ ```{r workflow, echo=FALSE, fig.cap="Overview of the SCArray framework.", fig.wide=TRUE} knitr::include_graphics("scarray_sat.svg") ``` ~ The SeuratObject package defines the `Assay` class with three members/slots `counts`, `data` and `scale.data` storing raw counts, normalized expressions and scaled data matrix respectively. However, `counts` and `data` should be either a dense matrix or a sparse matrix defined in the [Matrix](https://cran.r-project.org/package=Matrix) package. The scalability of the sparse matrix is limited by the number of non-zero values (should be < 2^31), since the Matrix package uses 32-bit indices internally. `scale.data` in the `Assay` class is defined as a dense matrix, so it is also limited by the available memory. The new class `SCArrayAssay` is derived from `Assay`, with three additional slots `counts2`, `data2` and `scale.data2` replacing the old ones. These new slots can be DelayedMatrix wrapping an on-disk data matrix, without loading the data in memory. The SCArray.sat package takes advantage of the S3 object-oriented methods defined in the SeuratObject and Seurat packages to reduce code redundancy, by implementing the functions specific to the classes `SCArrayAssay` and `SC_GDSMatrix` (GDS-specific DelayedMatrix). Table 1 shows a list of key S3 methods for data analysis. For example, the function `NormalizeData.SC_GDSMatrix()` will be called when a `SC_GDSMatrix` object is passed to the S3 generic `NormalizeData()`, while `NormalizeData.Assay()` and `NormalizeData.Seurat()` are unchanged. In addition, the SCArray and SCArray.sat packages implement the optimized algorithms for the calculations, by reducing the on-disk data access and taking the GDS data structure into account. ~ **Table 1: Key S3 methods implemented in the SCArray.sat package.** | **Methods** | **Description** | **Note** | |:--------------------------|:--------------------------|:---------------------------| |GetAssayData.SCArrayAssay() |Accessor function for ‘SCArrayAssay’ objects | | |SetAssayData.SCArrayAssay |Setter functions for ‘Assay’ objects | | |NormalizeData.SC_GDSMatrix() |Normalize raw count data | Store a DelayedMatrix | |ScaleData.SC_GDSMatrix() |Scale and center the normalized data | | |FindVariableFeatures.SC_GDSMatrix() |Identifies features | | |RunPCA.SC_GDSMatrix() |Run a PCA dimensionality reduction | | *SC_GDSMatrix: GDS- and single-cell- specific DelayedMatrix.* ~ ~ # Installation * Requires [SCArray](http://www.bioconductor.org/packages/SCArray/) (≥ v1.7.13), [SeuratObject](https://cran.r-project.org/package=SeuratObject) (≥ v4.0), [Seurat](https://cran.r-project.org/package=Seurat) (≥ v4.0) * Bioconductor repository To install this package, start R and enter: ```{r, eval=FALSE} if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("SCArray.sat") ``` # Examples ## Small Datasets ```{r} # Load the packages suppressPackageStartupMessages({ library(Seurat) library(SCArray) library(SCArray.sat) }) # Input GDS file with raw counts fn <- system.file("extdata", "example.gds", package="SCArray") # show the file contents (f <- scOpen(fn)) scClose(f) # close the file # Create a Seurat object from the GDS file d <- scNewSeuratGDS(fn) class(GetAssay(d)) # SCArrayAssay, derived from Assay d <- NormalizeData(d) d <- FindVariableFeatures(d, nfeatures=500) d <- ScaleData(d) ``` **Let's check the internal data matrices,** ```{r} # get the file name for the on-disk object scGetFiles(d) # raw counts m <- GetAssayData(d, slot="counts") scGetFiles(m) # the file name storing raw count data m # normalized expression # the normalized data does not save in neither the file nor the memory GetAssayData(d, slot="data") # scaled and centered data matrix # in this example, the scaled data does not save in neither the file nor the memory GetAssayData(d, slot="scale.data") ``` **Perform PCA and UMAP:** ```{r fig.wide=TRUE} d <- RunPCA(d, ndims.print=1:2) DimPlot(d, reduction="pca") d <- RunUMAP(d, dims=1:50) # use all PCs calculated by RunPCA() DimPlot(d, reduction="umap") ``` ~ ## Large Datasets Let's download a large single-cell RNA-seq dataset from Bioconductor, and convert it to a GDS file. This step will take a while. If the `TENxBrainData` package is not installed, run: ```R # install a Bioconductor package BiocManager::install("TENxBrainData") ``` Then, ```R library(TENxBrainData) library(SCArray) # scRNA-seq data for 1.3 million brain cells from E18 mice (10X Genomics) # the data will be downloaded automatically at the first time. # raw count data is stored in HDF5 format tenx <- TENxBrainData() rownames(tenx) <- rowData(tenx)$Ensembl # since rownames(tenx)=NULL # save it to a GDS file SCArray::scConvGDS(tenx, "1M_sc_neurons.gds") ``` After the file conversion, users can use this GDS file with Seurat to analyze the data. ~ ~ # Benchmarks ## Test Datasets The datasets used in the benchmark (Table 2) were generated from the 1.3 million brain cells, with the following R codes: ```R library(SCArray) library(SCArray.sat) sce <- scExperiment("1M_sc_neurons.gds") # load the full 1.3M cells # D100 dataset scConvGDS(sce[, 1:1e5], "1M_sc_neurons_d100.gds") # save to a GDS # in-memory Seurat object obj <- scMemory(scNewSeuratGDS("1M_sc_neurons_d100.gds")) saveRDS(obj, "1M_sc_neurons_d100_seuratobj.rds") # save to a RDS # D250 dataset scConvGDS(sce[, 1:2.5e5], "1M_sc_neurons_d250.gds") obj <- scMemory(scNewSeuratGDS("1M_sc_neurons_d250.gds")) saveRDS(obj, "1M_sc_neurons_d250_seuratobj.rds") # D500 dataset scConvGDS(sce[, 1:5e5], "1M_sc_neurons_d500.gds") obj <- scMemory(scNewSeuratGDS("1M_sc_neurons_d500.gds")) saveRDS(obj, "1M_sc_neurons_d500_seuratobj.rds") # Dfull dataset scConvGDS(sce, "1M_sc_neurons_dfull.gds") ``` **Table 2: Datasets in the benchmarks.** | **Dataset** | **# of features** | **# of cells** | **GDS file** | **RDS (Seurat Object)** | |:-------------|:-------------------|:----------------|:--------------|:-------------------------| |D100 |27,998 | 100K | 209MB | 419MB | |D250 |27,998 | 250K | 529MB | 1.1GB | |D500 |27,998 | 500K | 1.1GB | 2.2GB | |Dfull |27,998 | 1.3 million | 2.8GB | Out of the limit of sparse matrix | *the number of non-zeros should be < 2^31 in a sparse matrix.* ~ ## R Codes in the Benchmark The following R script is used in the benchmark for testing GDS files, and the R codes for testing the Seurat Object are similar except the input file. ```R suppressPackageStartupMessages({ library(Seurat) library(SCArray.sat) }) # the input GDS file can be for d250, d500, dfull fn <- "1M_sc_neurons_d100.gds" d <- scNewSeuratGDS(fn) d <- NormalizeData(d) d <- FindVariableFeatures(d, nfeatures=2000) # using the default d <- ScaleData(d) d <- RunPCA(d) d <- RunUMAP(d, dims=1:50) saveRDS(d, "d100.rds") # or d250.rds, d500.rds, dfull.rds q('no') ``` ~ ## Memory Usage and Elapsed Time With large test datasets, the SCArray.sat package significantly reduces the memory usages compared with the Seurat package, while the in-memory implementation in Seurat is only 2 times faster than SCArray.sat. When the full dataset "Dfull" was tested, Seurat failed to load the data since the number of non-zeros is out of the limit of sparse matrix. ```{r benchmark, echo=FALSE, fig.cap="The benchmark on PCA & UMAP with large datasets (CPU: Intel Xeon Gold 6248 @2.50GHz, RAM: 176GB).", fig.wide=TRUE} knitr::include_graphics("benchmark.svg") ``` ~ ~ # Miscellaneous ## Save SCArrayAssay The Seurat object with `SCArrayAssay` can be directly saved to a RDS (R object) file, in which the raw counts in the GDS file is not stored in the RDS file. This can avoid data duplication, and is helpful for faster meta data loading. Please keep the GDS and RDS files in the same directory or the same relative paths. The R object can be reloaded later in another R session, and GDS files are reopened internally when accessing the count data. ```{r} d # the example for the small dataset save_fn <- tempfile(fileext=".rds") # or specify an appropriate location save_fn saveRDS(d, save_fn) # save to a RDS file remove(d) # delete the variable d gc() # trigger a garbage collection d <- readRDS(save_fn) # load from a RDS file d GetAssayData(d, slot="counts") # reopens the GDS file automatically ``` ~ ## Multicore or Multi-process Implementation The multicore and multi-process features are supported by SCArray and SCArray.sat via the Bioconductor package "BiocParallel". To enable the parallel feature, users can use the function `setAutoBPPARAM()` in the DelayedArray package to setup multi-process workers. For examples, ```R library(BiocParallel) DelayedArray::setAutoBPPARAM(MulticoreParam(4)) # 4 child processes ``` ~ ## Downgrade SCArrayAssay The `SCArrayAssay` object can be downgraded to the regular `Assay`. It is useful when the downstream functions or packages do not support DelayedArray. ```{r} is(GetAssay(d)) new_d <- scMemory(d) # downgrade the active assay is(GetAssay(new_d)) ``` If users only want to 'downgrade' the scaled data matrix, then ```{r} is(GetAssayData(d, slot="scale.data")) # it is a DelayedMatrix new_d <- scMemory(d, slot="scale.data") # downgrade "scale.data" in the active assay is(GetAssay(new_d)) # it is still SCArrayAssay is(GetAssayData(new_d, slot="scale.data")) # in-memory matrix ``` ~ ## Conversion from Seurat to SingleCellExperiment A Seurat object with `SCArrayAssay` can be converted to a Bioconductor `SingleCellExperiment` object using `as.SingleCellExperiment()` in the Seurat package. The DelayedMatrix in `SCArrayAssay will be saved in the new SingleCellExperiment object. For example, ```{r} is(d) sce <- as.SingleCellExperiment(d) is(sce) sce counts(sce) # raw counts ``` ~ ## List of supported functions Not all of the functions in the Seurat package can be applied to the `SCArrayAssay` object. Here is the list of currently supported and unsupported functions we have tested. The unsupported methods maybe available on request in the future release of SCArray.sat. Note that the supported states may depend on the package versions of Seurat and SeuratObject, and SeuratObject_4.1.3 and Seurat_4.3.0 were tested here. **Table 3: The states of functions and methods with the support of SCArrayAssay.** |State|Functions |Description |Notes | |:---:|:----------------------|:---------------------------------------|:--------------------------| | ✓ |CreateSeuratObject() | | | | ✓ |FindVariableFeatures() |Identifies the top features | | | ✓ |NormalizeData() |Normalize the count data | | | ✓ |RunPCA() |Principal component analysis | | | ✓ |ScaleData() |Scales and centers features | | | ☑ |FindMarkers() |Differentially expressed genes | data read via blocking | | ☑ |FoldChange() | | | | ☑ |RunICA() |Independent component analysis | | | ☑ |RunSPCA() |Supervised principal component analysis | | | ☑ |RunLDA() |Linear discriminant analysis | | | ☑ |RunSLSI() |Supervised latent semantic indexing | | | ☑ |RunUMAP() |Uniform manifold approx. and projection | | | ⦿ |FindNeighbors |Nearest-neighbor graph construction | | | ⦿ |HVFInfo() |Info for highly variable features | | | ⦿ |RunTSNE() |t-SNE dimensionality reduction | | | ⦿ |ProjectDim() | | | | ⦿ |ProjectUMAP() | | | | ⦿ |SVFInfo() |Info for spatially variable features | | | ⦿ |VariableFeatures() |Get/set variable feature information | | | ✗ |CreateAssayObject() | | use CreateAssayObject2() instead | | ✗ |as.Seurat() | | planned | | ✗ |RunCCA() |Canonical correlation analysis | planned | | ✗ |SCTransform() |Normalization via regularized NB regression | planned | * ✓ Supported (implemented in SCArray.sat, optimized for memory) * ☑ Supported (mainly relying on the implementation in Seurat) * ⦿ Supported (implemented in Seurat, not in SCArray.sat) * ✗ Unsupported (raising an error) ~ ## Debugging `options(SCArray.verbose=TRUE)` is used to enable displaying debug information when calling the functions in the SCArray and SCArray.sat packages. For example, ```{r} options(SCArray.verbose=TRUE) d <- ScaleData(d) ``` ~ ~ # Session Information ```{r} # print version information about R, the OS and attached or loaded packages sessionInfo() ``` ```{r echo=FALSE} unlink("test.rds", force=TRUE) ```