--- title: "Introduction to StatescopeR" author: - name: Mischa Steketee affiliation: - Department of Pathology, Cancer Center Amsterdam, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands email: m.f.b.steketee@amsterdamumc.nl - name: Yongsoo Kim affiliation: - Department of Pathology, Cancer Center Amsterdam, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands email: yo.kim@amsterdamumc.nl output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 code_folding: show date: "`r doc_date()`" vignette: > %\VignetteIndexEntry{Introduction to StatescopeR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", ## see https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html crop = NULL ) ``` ```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE} ## Bib setup library("RefManageR") ## Write bibliography information bib <- c( R = citation(), BiocStyle = citation("BiocStyle")[1], knitr = citation("knitr")[1], RefManageR = citation("RefManageR")[1], rmarkdown = citation("rmarkdown")[1], sessioninfo = citation("sessioninfo")[1], testthat = citation("testthat")[1], StatescopeR = citation("StatescopeR")[1] ) ``` # Introduction StatescopeR is an R wrapper around Statescope, a computational framework designed to discover cell states from cell type-specific gene expression profiles inferred from bulk RNAseq profiles. StatescopeR executes the three following steps consecutively: 1) Single cell RNAseq reference-based deconvolution 2) Refinement of inferred cell type-specific gene expression profiles 3) Cell state discovery. In addition to bulk RNAseq and a single cell RNAseq (scRNAseq) reference, StatescopeR can integrate prior expectations of (groups of) cell fractions in the bulk RNAseq. For example tumor fractions estimated by DNA copy number analysis or immune cell fractions by immunohistochemistry. Below is a manual for running StatescopeR with scRNAseq data from human pancreas cells used for both the reference and pseudobulk which is deconvolved. # Running StatescopeR ## Installation StatescopeR is available through Bioconductor. Install it using the following commands in R: ```{r "install", eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("StatescopeR") ## Check that you have a valid Bioconductor installation BiocManager::valid() ``` ## Prepare scRNAseq data After installing StatescopeR, the next step is to load it along with the scuttle and the scRNAseq packages. From the latter, the SegerstolpePancreas dataset will be used as an example. This dataset will serve two purposes: 1) Creating a scRNAseq reference. Alternatively, premade signatures can be obtained from using the fetch_signature function. 2) Creating pseudobulk for deconvolution, which can also be used for evaluation, as the true cell type fractions and expression are known. Usually, bulk RNA sequencing data, with unknown cell type fractions and expression, would be used for deconvolution. ```{r 'Prepare scRNAseq data'} ## Load StatescopeR & scRNAseq (for example data) suppressMessages(library(StatescopeR)) suppressMessages(library(scRNAseq)) suppressMessages(library(scuttle)) ## Load the SegerstolpePancreas data set scRNAseq <- SegerstolpePancreasData() ## remove duplicate genes scRNAseq <- scRNAseq[!duplicated(rownames(scRNAseq)), ] ## remove cells with no cell type label scRNAseq <- scRNAseq[, !is.na(scRNAseq$`cell type`)] ## remove very rare cell types (<100 cells in total data set) celltypes_to_remove <- names(table(scRNAseq$`cell type`)[(table(scRNAseq$`cell type`) < 100)]) scRNAseq <- scRNAseq[, !scRNAseq$`cell type` %in% celltypes_to_remove] ``` ## Prepare StatescopeR input StatescopeR takes the following inputs: 1) scRNAseq reference/signature 2) bulk RNAseq to be deconvolved 3) selected genes for deconvolution, by default this uses `r BiocStyle::Githubpkg("theislab/AutoGeneS")` (optional) Prior expectation of fractions. See [Group expectations](#group-expectations) for expectations of groups of cell types. (optional) Hyperparameters (i.e. cores for parallelization , nrepfinal for maximum number of optimization iterations etc.) ```{r 'Prepare StatescopeR input'} ## Normalize (cp10k) and logtransform scRNAseq cpm(scRNAseq) <- calculateCPM(scRNAseq) logcounts(scRNAseq) <- log1p(cpm(scRNAseq) / 100) ## Create scRNAseq reference/signature with 50 hvg for quick example signature <- create_signature(scRNAseq, hvg_genes = TRUE, n_hvg_genes = 50L, labels = scRNAseq$`cell type` ) ## select subset of genes for deconvolution (30/50 hvg to make it quick) selected_genes <- select_genes(scRNAseq, 30L, 50L, labels = scRNAseq$`cell type` ) ## Create pseudobulk and normalize to cp10k (logging is done within Statescope) pseudobulk <- aggregateAcrossCells(scRNAseq, ids = scRNAseq$individual) normcounts(pseudobulk) <- calculateCPM(pseudobulk) / 100 pseudobulk <- as(pseudobulk, "SummarizedExperiment") rownames(pseudobulk) <- rownames(scRNAseq) ## (optional) Create prior expectation for ductal cells prior <- gather_true_fractions(scRNAseq, ids = scRNAseq$individual, label_col = "cell type" ) # Use True sc fractions for this prior[rownames(prior) != "ductal cell", ] <- NA # Keep only ductal cells prior <- t(prior) # Transpose it to nSample x nCelltype ``` ## Run StatescopeR Once all input data have been prepared, the (pseudo)bulk RNAseq can be deconvolved to identify transcriptional states. In short the modules work as follows: BLADE_deconvolution employs Bayesian variational inference, modelling the bulk gene expression level of each gene j in sample i by: \begin{equation} y_{ij} = log(\sum_{t} f_{i}^t exp(x_{ij}^t)) + \epsilon_{ij} \end{equation} with hidden variables $f_{i}^t$ and $x_{ij}^t$ denoting the fraction of cell type t for sample i and the expression of gene j for sample i in cell type j respectively. Refinement further optimizes cell type-specific gene expression profiles, $x_{ij}^t$, by fixing estimated cell fractions and putting more emphasis on capturing inter-sample heterogeneity over staying close to the single cell prior knowledge. StateDiscovery uses convex-NMF to cluster inferred cell type-specific gene expression profiles in an unsupervised manner. These modules are wrappers around the original Python functions `r BiocStyle::Githubpkg("tgac-vumc/Statescope")`. ```{r 'Run StatescopeR'} ## Run Deconvolution module Statescope <- BLADE_deconvolution(signature, pseudobulk, selected_genes, prior) ## Run Refinement module Statescope <- Refinement(Statescope, signature, pseudobulk, cores = 2L) ## Run State Discovery module (pick k=2 for quick example) Statescope <- StateDiscovery(Statescope, k = 2L, Ncores = 2L) ``` ## Evaluate Statescope Since pseudobulk data derived from scRNAseq are used in this manual, it is possible to verify the accuracy of the StatescopeR output. ```{r 'Evaluate Statescope'} ## Count fractions of cells per sample in the pseudobulk true_fractions <- gather_true_fractions(scRNAseq, ids = scRNAseq$individual, label_col = "cell type" ) ## Plot correlation and RMSE with true fractions per celltype fraction_eval(Statescope, true_fractions) ``` ## Downstream analysis/visualizations After running StatescopeR, a wide range of analyses can be performed. In this section, we describe how to access the results and show some examples of visualizations with them. ```{r 'Downstream analysis/visualizations'} ## Show predicted fractions metadata(Statescope)$fractions ## Show predicted ductal cell type specific gene expression profiles assay(metadata(Statescope)$ct_specific_gep$`ductal cell`, "weighted_gep") ## Show ductal cell state scores per sample metadata(Statescope)$statescores$`ductal cell` ## Show ductal cell state loadings metadata(Statescope)$stateloadings$`ductal cell` ## Plot heatmap of fractions fraction_heatmap(Statescope) ## Plot barplot of top state loadings barplot_stateloadings(Statescope) ``` ## Group expectations When an expected fraction is available for a group of cell types rather than for individual cell types (i.e. Lymphoid cells instead of T/B cells), StatescopeR allows the use of a group prior. Here we show an example where the expected fraction of the pancreatic islet cells (alpha, beta, gamma & delta cells) together is known, but not their individual contributions. ```{r 'Group expectations'} ## define cell types in one group grouped_cts <- c("alpha cell", "beta cell", "gamma cell", "delta cell") ## initialize group assignment matrix (nCelltype x nGroup) with 0's celltype_names <- colnames(signature$mu) group_names <- c("Group", setdiff(celltype_names, grouped_cts)) nCelltype <- length(celltype_names) nGroup <- length(group_names) group <- matrix(0, nrow = nGroup, ncol = nCelltype, dimnames = list(group_names, celltype_names) ) ## initialize prior fraction matrix (nSamples x nCelltypes) with NA's nSamples <- ncol(pseudobulk) sample_names <- colnames(pseudobulk) prior <- matrix( nrow = nSamples, ncol = nGroup, dimnames = list(sample_names, group_names) ) ## Assign cell types to groups for (ct in celltype_names) { if (ct %in% grouped_cts) { ## assign cell type to group group["Group", ct] <- 1 } else { ## Assign cell type to its own group group[ct, ct] <- 1 } } ## Add grouped prior fractions to prior true_fractions <- gather_true_fractions(scRNAseq, ids = scRNAseq$individual, label_col = "cell type") prior[names(true_fractions), "Group"] <- colSums(as.matrix( true_fractions[grouped_cts, ] )) ## init group prior group_prior <- list("Group" = group, "Expectation" = prior) ## Now you can just run Statescope as with a normal prior. Note we do not give ## a ductal cell prior this time Statescope <- BLADE_deconvolution(signature, pseudobulk, selected_genes, group_prior, cores = 1L, Nrep = 1L ) ``` # Citing `StatescopeR` If you made use of StatescopeR for your research, please use the following information to cite the package and the underlying methodology. ```{r "citation"} ## Citation info citation("StatescopeR") ``` # Reproducibility The `r Biocpkg("StatescopeR")` package `r Citep(bib[["StatescopeR"]])` was made possible thanks to: - R `r Citep(bib[["R"]])` - `r Biocpkg("BiocStyle")` `r Citep(bib[["BiocStyle"]])` - `r CRANpkg("knitr")` `r Citep(bib[["knitr"]])` - `r CRANpkg("RefManageR")` `r Citep(bib[["RefManageR"]])` - `r CRANpkg("rmarkdown")` `r Citep(bib[["rmarkdown"]])` - `r CRANpkg("sessioninfo")` `r Citep(bib[["sessioninfo"]])` - `r CRANpkg("testthat")` `r Citep(bib[["testthat"]])` This package was developed using `r BiocStyle::Biocpkg("biocthis")`. `R` session information. ```{r reproduce3, echo=FALSE} ## Session info library("sessioninfo") options(width = 80) session_info() ```