Retrofit Simulation Vignette

Introduction

RETROFIT is a statistical method for reference-free deconvolution of spatial transcriptomics data to estimate cell type mixtures. In this Vignette, we will estimate cell type composition of a sample synthetic dataset. We will annotate cell types using an annotated single cell reference.

Package Installation

Install and load the package using the following steps:

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("retrofit")

The main functionalities of RETROFIT are covered in this tutorial.

Spatial Transcriptomics Data

First load the ST data, using the following command:

data("vignetteSimulationData")
x = vignetteSimulationData$n10m3_x

The ST data matrix will consist of G = 500 genes and S = 1000 spots i.e., a matrix of order G x S.

Deconvolution

Initialize the following parameters for deconvolution: - iterations: Number of iterations (default = 4000) - L: Number of components required

iterations  = 10
L           = 20

After initialization, run RETROFIT on the data (X) as follows:

result  = retrofit::decompose(x, L=L, iterations=iterations, verbose=TRUE)
## [1] "iteration: 1 0.075 Seconds"
## [1] "iteration: 2 0.088 Seconds"
## [1] "iteration: 3 0.072 Seconds"
## [1] "iteration: 4 0.193 Seconds"
## [1] "iteration: 5 0.078 Seconds"
## [1] "iteration: 6 0.072 Seconds"
## [1] "iteration: 7 0.072 Seconds"
## [1] "iteration: 8 0.072 Seconds"
## [1] "iteration: 9 0.072 Seconds"
## [1] "iteration: 10 0.073 Seconds"
## [1] "Iteration mean:  0.087  seconds  total:  0.867  seconds"
H   = result$h
W   = result$w
Th  = result$th

After deconvolution of ST data, we have our estimates of W (a matrix of order G x L), H (a matrix of order L x S) and Theta (a vector of L components). Here, we are using number of iterations as 10 for demonstration purposes. For reproducing results in the paper, we need to run RETROFIT for iterations = 4000. The whole computation is omitted here due to time complexity (> 10min). We will load the results from 4000 iterations for the rest of the analysis.

H   = vignetteSimulationData$results_4k_iterations$decompose$h
W   = vignetteSimulationData$results_4k_iterations$decompose$w
Th  = vignetteSimulationData$results_4k_iterations$decompose$th

The above results are obtained by running the code below.

iterations = 4000
set.seed(12)
result = retrofit::decompose(x, L=L, iterations=iterations)

Cell-type Annotation

Next, we need to annotate the components, to get the proportion of, say K, cell types. We can do this in two ways: (a) using an annotated single cell reference or (b) using the known marker genes. Here, we will annotate using single cell reference.

Get the single cell reference data:

sc_ref_w = vignetteSimulationData$sc_ref_w

This file contains average gene expression values for G = 500 genes in K = 10 cell types i.e., a matrix of order G x K. Run the following command to get K cell-type mixtures from the ST data X:

K             = 10
result        = retrofit::annotateWithCorrelations(sc_ref=sc_ref_w, K=K, 
                                                   decomp_w=W, decomp_h=H)
H_annotated   = result$h
W_annotated   = result$w
ranked_cells  = result$ranked_cells

Above code assigns components to the cell type it has maximum correlation with.

Deconvolution and annotation in one step (Optional)

We can also deconvolve the ST data matrix along with cell-type annotation all in one step, as follows:

iterations  = 10
L           = 20
K           = 10 
result      = retrofit::retrofit(x, 
                                 sc_ref=sc_ref_w,
                                 iterations=iterations, 
                                 L=L, 
                                 K=K)

Results

Visualize the correlation values between each cell type and the chosen component, as follows:

# correlation between true and estimated cell-type proportions
correlations        = stats::cor(sc_ref_w[ranked_cells], W_annotated[,ranked_cells])
ranked_correlations = sort(diag(correlations), decreasing=TRUE)
df                  = data.frame(x=1:length(ranked_correlations), 
                                 y=ranked_correlations, 
                                 label_x1=1, 
                                 label_x2=2, 
                                 label_y=seq(from=0.5, by=-0.05, length.out=10),
                                 label_cell=ranked_cells,
                                 label_corr=format(round(ranked_correlations, digits=4)))

gg <- ggplot2::ggplot(df,ggplot2::aes(x=x, y=y, group=1)) + 
  ggplot2::geom_line(ggplot2::aes(x=x, y=y)) + 
  ggplot2::geom_point(ggplot2::aes(x=x, y=y)) + 
  ggplot2::theme_bw() + 
  ggplot2::theme(axis.text.x=ggplot2::element_blank()) +
  ggplot2::ylim(0, 1.05) +
  ggplot2::ylab(expression(paste("Correlation (",W^0,",",widetilde(W),")"))) +
  ggplot2::geom_text(data=df, ggplot2::aes(x=label_x1, y=label_y, label=label_corr), size=4, hjust=0) +
  ggplot2::geom_text(data=df, ggplot2::aes(x=label_x2, y=label_y, label=label_cell), size=4, hjust=0)
plot(gg)

All mapped cell types have greater than 0.75 correlation with the component they are matched with.

Since this is synthetic data, true cell type proportions are known. We can visualize the RMSE between the true and the estimated cell-type proportions as follows:

H_true  = vignetteSimulationData$sc_ref_h
H_est   = H_annotated
corrH   = sort(diag(stats::cor(H_true,H_est)), decreasing=TRUE, na.last=TRUE)
df      = data.frame(x=seq(0,1,length.out = 1000), 
                     y=corrH)
df_text = data.frame(x=0.2,
                     y=0.6,
                     label = c(paste("RETROFIT:", round(DescTools::AUC(x=seq(0,1,length.out = 1000), y=corrH),digits=3))))

gg   <- ggplot2::ggplot(df, ggplot2::aes(x=x,y=y)) +
  ggplot2::geom_line() + 
  ggplot2::scale_color_manual('gray30') + 
  ggplot2::xlab("Normalized Rank") +
  ggplot2::ylab(expression(paste("Correlation (",H,",",widetilde(H),")"))) +
  ggplot2::theme_bw() +
  ggplot2::geom_text(data = df_text, ggplot2::aes(label = label)) 
plot(gg)

In case of real data, where ground truth is not available, we can do certain diagnostics on the quality of results. For example, compute the correlation between marker expression (if known) and cell type proportion in ST data and visualize the agreement of the spatial pattern between markers and estimated proportion (read the paper for more information).

Session information

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] retrofit_1.7.0   Rcpp_1.0.13-1    BiocStyle_2.35.0
## 
## loaded via a namespace (and not attached):
##  [1] gld_2.6.6           gtable_0.3.6        xfun_0.49          
##  [4] bslib_0.8.0         ggplot2_3.5.1       corrplot_0.95      
##  [7] lattice_0.22-6      vctrs_0.6.5         tools_4.4.2        
## [10] bitops_1.0-9        tibble_3.2.1        proxy_0.4-27       
## [13] fansi_1.0.6         pkgconfig_2.0.3     Matrix_1.7-1       
## [16] data.table_1.16.2   pals_1.9            readxl_1.4.3       
## [19] lifecycle_1.0.4     rootSolve_1.8.2.4   compiler_4.4.2     
## [22] farver_2.1.2        stringr_1.5.1       Exact_3.3          
## [25] munsell_0.5.1       mapproj_1.2.11      htmltools_0.5.8.1  
## [28] sys_3.4.3           buildtools_1.0.0    DescTools_0.99.58  
## [31] maps_3.4.2.1        class_7.3-22        sass_0.4.9         
## [34] RCurl_1.98-1.16     yaml_2.3.10         pillar_1.9.0       
## [37] jquerylib_0.1.4     MASS_7.3-61         cachem_1.1.0       
## [40] boot_1.3-31         digest_0.6.37       mvtnorm_1.3-2      
## [43] stringi_1.8.4       reshape2_1.4.4      forcats_1.0.0      
## [46] labeling_0.4.3      maketools_1.3.1     fastmap_1.2.0      
## [49] grid_4.4.2          colorspace_2.1-1    lmom_3.2           
## [52] expm_1.0-0          cli_3.6.3           magrittr_2.0.3     
## [55] dichromat_2.0-0.1   utf8_1.2.4          e1071_1.7-16       
## [58] withr_3.0.2         scales_1.3.0        httr_1.4.7         
## [61] rmarkdown_2.29      cellranger_1.1.0    hms_1.1.3          
## [64] png_0.1-8           evaluate_1.0.1      haven_2.5.4        
## [67] knitr_1.49          rlang_1.1.4         glue_1.8.0         
## [70] BiocManager_1.30.25 rstudioapi_0.17.1   jsonlite_1.8.9     
## [73] R6_2.5.1            plyr_1.8.9