---
title: "Example_Analysis"
author: "Ying Ma"
date: "Jun 25, 2024"
output:
    BiocStyle::html_document:
        toc: true
        toc_depth: 2
vignette: >
    %\VignetteIndexEntry{Example_Analysis}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

## Introduction
We developed a statistical method for spatially informed cell type 
deconvolution for spatial transcriptomics. Briefly,CARD is a 
reference-based deconvolution method that estimates cell type composition 
in spatial transcriptomics based on cell type specific expression 
information obtained from a reference scRNA-seq data. A key feature of 
CARD is its ability to accommodate spatial correlation in the cell type 
composition across tissue locations, enabling accurate and spatially 
informed cell type deconvolution as well as refined spatial map 
construction. CARD relies on an efficient optimization algorithm for 
constrained maximum likelihood estimation and is scalable to spatial 
transcriptomics with tens of thousands of spatial locations 
and tens of thousands of genes.

## Installation
```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install("CARDspa")
```

This tutorial is the example analysis with CARDspa on the human pancreatic 
ductal adenocarcinomas data from Moncada et al, 2020. Please note that we are 
using edited data, see the [tutorial](https://yma-lab.github.io/CARD/) for an 
example using the complete data.

## Required input data
`CARD` requires two types of input data:
- spatial transcriptomics count data, along with spatial location information.  
- single cell RNAseq (scRNA-seq) count data, along with meta information 
indicating the cell type information and the sample (subject) information for 
each cell.  

The example data for running the tutorial is included in the package.
Here are the details about the required data input illustrated by the example 
datasets. 

### 1. spatial transcriptomics data, e.g.,
```{r}
library(CARDspa)
library(RcppML)
library(NMF)
library(RcppArmadillo)
library(SingleCellExperiment)
library(SpatialExperiment)
library(ggplot2)
#### load the example spatial transcriptomics count data,
data(spatial_count)
spatial_count[1:4, 1:4]
```
The spatial transcriptomics count data must be in the format of matrix or 
sparseMatrix, while each row represents a gene and each column represents a 
spatial location. The column names of the spatial data can be in the 
"XcoordxYcoord" (i.e., 10x10) format, but you can also maintain your original 
spot names, for example, barcode names. 

```{r}
#### load the example spatial location data,
data(spatial_location)
spatial_location[1:4, ]
```
The spatial location data must be in the format of data frame while each row 
represents a spatial location, the first column represents the x coordinate and 
the second column represents the y coordinate. The rownames of the spatial 
location data frame should match exactly with the column names of the 
spatial_count.

### 2. single cell RNAseq ((scRNA-seq)) data,  e.g.,
```{r}
data(sc_count)
sc_count[1:4, 1:4]
```
The scRNA-seq count data must be in the format of matrix or sparseMatrix, while 
each row represents a gene and each column represents a cell.

```{r}
data(sc_meta)
sc_meta[1:4, ]
```

The scRNAseq meta data must be in the format of data frame while each row 
represents a cell. The rownames of the scRNAseq meta data should match exactly 
with the column names of the scRNAseq count data. The sc_meta data must contain 
the column indicating the cell type assignment for each cell (e.g., "cellType" 
column in the example sc_meta data). Sample/subject information should be 
provided, if there is only one sample, we can add a column by 
```sc_meta$sampleInfo = "sample1"```.

We suggest the users to check their single cell RNASeq data carefully before 
running CARD. We suggest the users to input the single cell RNAseq data with 
each cell type containing at least 2 cells. i.e. 
print(table(sc_meta$cellType,useNA = "ifany"))

## Cell Type Deconvolution
### 1. Deconvolution using CARD
We can use `CARD_deconvolution` to deconvolute the spatial transcriptomics
data. The essential inputs are:
- sc_count: Matrix or sparse matrix of raw scRNA-seq count data, each row 
represents a gene and each column represents a cell. This sc_count data 
serves as a reference for the cell type deconvolution for spatial 
transcriptomics data. 
- sc_meta: Data frame, with each row representing the cell type and/or sample 
information of a specific cell. The row names of this data frame should match 
exactly with the column names of the sc_count data. The sc_meta data must 
contain the column indicating the cell type assignment for each cell 
(e.g., "cellType" column in the example sc_meta data). 
- spatial_count: Matrix or sparse matrix of raw spatial resolved 
transcriptomics count data, each row represents a gene and each column 
represents a spatial location. 
This is the spatial transcriptomics data that we are interested to deconvolute.
- spatial_location: Data frame, with two columns representing the x and y 
coordinates of the spatial location. The rownames of this data frame should 
match eaxctly with the columns of the spatial_count.
- ct.varname: Caracter, the name of the column in sc_meta that specifies the
cell type assignment. 
- ct.select: Vector of cell type names that you are interested in to 
deconvolute, default as NULL. If NULL, then use all cell types provided by 
single cell dataset.
- sample.varname: Character,the name of the column in sc_meta that
specifies the sample/subject information. If NULL, we just use the 
whole data as one sample/subject.
- minCountGene: Numeric, include spatial locations where at least this
number of counts detected. Default is 100. 
- minCountSpot: Numeric, include genes where at least this number of spatial
locations that have non-zero expression. Default is 5.

This function first create a CARD object and then do deconvolution. Finally 
return a SpatialExperiment object. The results are stored in 
`CARD_obj$Proportion_CARD`


CARD is computationally fast and memory 
efficient. CARD relies on an efficient optimization algorithm for constrained 
maximum likelihood estimation and is scalable to spatial transcriptomics with 
tens of thousands of spatial locations and tens of thousands of genes. 
For the example dataset with the sample size of 428 locations, it takes 
within a minute to finish the deconvolution. 

```{r}
set.seed(seed = 20200107)
CARD_obj <- CARD_deconvolution(
    sc_count = sc_count,
    sc_meta = sc_meta,
    spatial_count = spatial_count,
    spatial_location = spatial_location,
    ct_varname = "cellType",
    ct_select = unique(sc_meta$cellType),
    sample_varname = "sampleInfo",
    mincountgene = 100,
    mincountspot = 5 )
## QC on scRNASeq dataset! ...
## QC on spatially-resolved dataset! ..
## create reference matrix from scRNASeq...
## Select Informative Genes! ...
## Deconvolution Starts! ...
## Deconvolution Finish! ...
```

And `CARDspa` also supports using SingleCellExperiment object and 
SingleCellExperiment object, you can run the following code:

```{r}
## create sce object
sce <- SingleCellExperiment(assay = list(counts = sc_count),
                            colData = sc_meta)
## create spe object
spe <- SpatialExperiment(assay = list(counts = spatial_count),
                        spatialCoords  = as.matrix(spatial_location)
                        )
celltypes <- unique(sc_meta$cellType)
    
set.seed(seed = 20200107)
CARD_obj <- CARD_deconvolution(
    spe = spe,
    sce = sce,
    sc_count = NULL,
    sc_meta = NULL,
    spatial_count = NULL,
    spatial_location = NULL,
    ct_varname = "cellType",
    ct_select = celltypes,
    sample_varname = "sampleInfo",
    mincountgene = 100,
    mincountspot = 5
)
```

The spatial data are stored in `assays(CARD_obj)$spatial_countMat` and 
`spatialCoords(CARD_obj)` while the scRNA-seq data is stored in 
`CARD_obj@metadata$sc_eset` in the format of SingleCellExperiment. 
The results are stored in `CARD_obj$Proportion_CARD`.

```{r}
print(CARD_obj$Proportion_CARD[1:2, ])
```

### 2. Visualize the proportion for each cell type
First, we jointly visualize the cell type proportion matrix through scatterpie 
plot.Note that here because the number of spots is relatively small, so jointly 
visualize the cell type proportion matrix in the scatterpie plot format is 
duable. We do not recommend users to visualize this plot when the number of 
spots is > 500. Instead, we recommend users to visualize the proportion 
directly, i.e., using the function CARD_visualize_prop(). Details of using this 
function see the next example.  

```{r}
## set the colors. Here, I just use the colors in the manuscript, if the color
## is not provided, the function will use default color in the package.
colors <- c(
    "#FFD92F", "#4DAF4A", "#FCCDE5", "#D9D9D9", "#377EB8", "#7FC97F",
    "#BEAED4", "#FDC086", "#FFFF99", "#386CB0", "#F0027F", "#BF5B17",
    "#666666", "#1B9E77", "#D95F02", "#7570B3", "#E7298A", "#66A61E",
    "#E6AB02", "#A6761D"
)
p1 <- CARD_visualize_pie(
    proportion = CARD_obj$Proportion_CARD,
    spatial_location = spatialCoords(CARD_obj),
    colors = colors,
    radius = 0.52
) ### You can choose radius = NULL or your own radius number
print(p1)
```

Then, we can select some interested cell types to visualize separately. 

```{r}
## select the cell type that we are interested
ct.visualize <- c(
    "Acinar_cells", "Cancer_clone_A", "Cancer_clone_B",
    "Ductal_terminal_ductal_like",
    "Ductal_CRISP3_high-centroacinar_like",
    "Ductal_MHC_Class_II",
    "Ductal_APOL1_high-hypoxic",
    "Fibroblasts"
)
## visualize the spatial distribution of the cell type proportion
p2 <- CARD_visualize_prop(
    proportion = CARD_obj$Proportion_CARD,
    spatial_location = spatialCoords(CARD_obj),
    ### selected cell types to visualize
    ct_visualize = ct.visualize,
    ### if not provide, we will use the default colors
    colors = c("lightblue", "lightyellow", "red"),
    ### number of columns in the figure panel
    NumCols = 4,
    ### point size in ggplot2 scatterplot
    pointSize = 3.0
)
print(p2)
```

### 3. Visualize the proportion for two cell types
We added a new visualization function to visualize the distribution of two cell 
types on the same post.  

```{r}
## visualize the spatial distribution of two cell types on the same plot
p3 <- CARD_visualize_prop_2CT(
    ### Cell type proportion estimated by CARD
    proportion = CARD_obj$Proportion_CARD,
    ### spatial location information
    spatial_location = spatialCoords(CARD_obj),
    ### two cell types you want to visualize
    ct2_visualize = c("Cancer_clone_A", "Cancer_clone_B"),
    ### two color scales
    colors = list(
        c("lightblue", "lightyellow", "red"),
        c("lightblue", "lightyellow", "black")
    )
)
print(p3)
```

### 5. Visualize the cell type proportion correlation 
```{r}
# if not provide, we will use the default colors
p4 <- CARD_visualize_Cor(CARD_obj$Proportion_CARD, colors = NULL)
print(p4)
```

## Refined spatial map
A unique feature of CARD is its ability to model the spatial 
correlation in cell type composition across tissue locations, thus 
enabling spatially informed cell type deconvolution. Modeling spatial 
correlation allows us to not only accurately infer the cell type 
composition on each spatial location, but also impute cell type 
compositions and gene expression levels on unmeasured tissue locations, 
facilitating the construction of a refined spatial tissue map with a 
resolution much higher than that measured in the original study.
Specifically, CARD constructed a refined spatial map through the 
function `CARD_imputation`. The essential inputs are:

- CARD_object: SpatialExperiment Object created by CARD_deconvolution with 
estimated cell type compositions on the original spatial resolved 
transcriptomics data.
- NumGrids: Initial number of newly grided spatial locations. The final number 
of newly grided spatial locations will be lower than this value since the newly 
grided locations outside the shape of the tissue will be filtered. 
- ineibor: Numeric, number of neighbors used in the imputation on newly grided 
spatial locations, default is 10.  

Briefly, CARD first outlined the shape of the tissue by applying a 
two-dimensional concave hull algorithm on the existing locations, then perform 
imputation on the newly grided spatial locations. We recommend to check the 
exisiting spatial locations to see if there are outliers that are seriously 
affect the detection of the shape. 

### 1. Imputation on the newly grided spatial locations

```{r}
CARD_obj <- CARD_imputation(
    CARD_obj, 
    num_grids = 2000, 
    ineibor = 10, 
    exclude = NULL)
## The rownames of locations are matched ...
## Make grids on new spatial locations ...
```

The results are store in `CARD_obj$refined_prop` and 
`assays(CARD_obj)$refined_expression`

```{r}
## Visualize the newly grided spatial locations to see if the shape is correctly
## detected. If not, the user can provide the row names of the excluded spatial
## location data into the CARD_imputation function
location_imputation <- cbind.data.frame(
    x = as.numeric(sapply(
        strsplit(rownames(CARD_obj$refined_prop), split = "x"), "[", 1
    )),
    y = as.numeric(sapply(
        strsplit(rownames(CARD_obj$refined_prop), split = "x"), "[", 2
    ))
)
rownames(location_imputation) <- rownames(CARD_obj$refined_prop)
library(ggplot2)
p5 <- ggplot(
    location_imputation,
    aes(x = x, y = y)
) +
    geom_point(shape = 22, color = "#7dc7f5") +
    theme(
        plot.margin = margin(0.1, 0.1, 0.1, 0.1, "cm"),
        legend.position = "bottom",
        panel.background = element_blank(),
        plot.background = element_blank(),
        panel.border = element_rect(colour = "grey89", fill = NA, 
                                    linewidth = 0.5)
    )
print(p5)
```


### 2. Visualize the cell type proportion at an enhanced resolution
Now we can use the same `CARD_visualize_prop` function to visualize the cell 
type proportion at the enhanced resolution. But this time, the input of the 
function should be the imputed cell typr propotion and corresponding newly 
grided spatial locations.

```{r}
p6 <- CARD_visualize_prop(
    proportion = CARD_obj$refined_prop,
    spatial_location = location_imputation,
    ct_visualize = ct.visualize,
    colors = c("lightblue", "lightyellow", "red"),
    NumCols = 4
)
print(p6)
```

### 3. Visualize the marker gene expression at an enhanced resolution
After we obtained cell type proportion at the enhanced resolution by CARD, we 
can predict the spatial gene expression at the enhanced resolution. The 
following code is to visualize the marker gene expression at an enhanced 
resolution.

```{r}
p7 <- CARD_visualize_gene(
    spatial_expression = assays(CARD_obj)$refined_expression,
    spatial_location = location_imputation,
    gene_visualize = c("A4GNT", "AAMDC", "CD248"),
    colors = NULL,
    NumCols = 6
)
print(p7)
```

Now, compare with the original resolution, CARD facilitates the construction of 
a refined spatial tissue map with a resolution much higher than that 
measured in the original study.

```{r}
p8 <- CARD_visualize_gene(
    spatial_expression = metadata(CARD_obj)$spatial_countMat,
    spatial_location = metadata(CARD_obj)$spatial_location,
    gene_visualize = c("A4GNT", "AAMDC", "CD248"),
    colors = NULL,
    NumCols = 6
)
print(p8)
```


## Extension of CARD in a reference-free version: CARDfree
We extended CARD to enable reference-free cell type deconvolution and eliminate 
the need for the single-cell reference data. We refer to this extension as the 
reference-free version of CARD, or simply as CARDfree. Different from CARD, 
CARDfree no longer requires an external single-cell reference dataset and only 
needs users to input a list of gene names for previously known cell type markers
We use the same exmple dataset to illustrate the use of CARDfree. 
In addition to the exmple dataset, CARDfree also requires the input of 
marker gene list, which is in a list format with each element of the list 
being the cell type specific gene markers. The example marker list for runing 
the tutorial is included. 

Similar to CARD, we will first need to create a CARDfree object with 
the spatial transcriptomics dataset and the marker gene list 


### 1. Deconvolution using CARDfree
We can use `CARD_refFree` to do reference-free deconvolution. This function 
frist creat a CARDfree object and then do deconvolution. Briefly, 
the essential inputs are the same as the function `CARD_deconvolution`, except 
that this function does not require the single cell count and meta information 
matrix. Instead, it requires a markerList. 

```{r}
## deconvolution using CARDfree
data(markerList)
set.seed(seed = 20200107)
CARDfree_obj <- CARD_refFree(
    markerlist = markerList,
    spatial_count = spatial_count,
    spatial_location = spatial_location,
    mincountgene = 100,
    mincountspot = 5
)
```

Similarly, you can use SpatialExperiment object.

```{r}
data(markerList)
set.seed(seed = 20200107)
CARDfree_obj <- CARD_refFree(
    markerlist = markerList,
    spatial_count = NULL,
    spatial_location = NULL,
    spe = spe,
    mincountgene = 100,
    mincountspot = 5
)
```


The spatial data are stored in `assays(CARDfree_obj)$spatial_countMat` 
and `spatialCoords(CARDfree_obj)` while the marker list is stored in 
`CARDfree_obj@metadata$markerList` in the format of list. 
The results are stored in `CARDfree_obj$Proportion_CARD`. 

```{r}
## One limitation of reference-free version of CARD is that the cell
## types inferred
## from CARDfree do not come with a cell type label. It might be difficult to
## interpret the results.
print(CARDfree_obj$Proportion_CARD[1:2, ])
```

### 3. Visualization of the results of CARDfree
Note that here because the number of spots is relatively small, so jointly 
visualize the cell type proportion matrix in the scatterpie plot format is 
duable. We do not recommend users to visualize this plot when the number of 
spots is > 500. Instead, we recommend users to visualize the proportion 
directly, i.e., using the function CARD_visualize_prop(). 

```{r}
colors <- c(
    "#FFD92F", "#4DAF4A", "#FCCDE5", "#D9D9D9", "#377EB8", "#7FC97F", "#BEAED4",
    "#FDC086", "#FFFF99", "#386CB0", "#F0027F", "#BF5B17", "#666666",
    "#1B9E77", "#D95F02", "#7570B3", "#E7298A", "#66A61E", "#E6AB02",
    "#A6761D"
)
### In order to maximumply match with the original results of CARD, we order the
### colors to generally match with the results infered by CARD
current_data <- CARDfree_obj$Proportion_CARD
new_order <- current_data[, c(
    8, 10, 14, 2, 1, 6, 12, 18, 7, 13, 20, 19, 16,
    17, 11, 15, 4, 9, 3, 5
)]
CARDfree_obj$Proportion_CARD <- new_order
colnames(CARDfree_obj$Proportion_CARD) <- paste0("CT", 1:20)
p9 <- CARD_visualize_pie(CARDfree_obj$Proportion_CARD,
    spatialCoords(CARDfree_obj),
    colors = colors
)
print(p9)
```

## Extension of CARD for single cell resolution mapping
We also extended CARD to facilitate the construction of single-cell resolution 
spatial transcriptomics from non-single-cell resolution spatial 
transcriptomics. 
Details of the algorithm see the main text. Briefly, we infer the single cell 
resolution gene expression for each measured spatial location from the 
non-single cell resolution spatial transcriptomics data based on reference 
scRNaseq data we used for deconvolution. The procedure is implemented in the 
function `CARD_SCMapping`. 
The essential inputs are:
- CARD_object: CARD object create by the createCARDObject function. This one 
should be the one after we finish the deconvolution procedure
- shapeSpot: a character indicating whether the sampled spatial coordinates for 
single cells locating in a Square-like region or a Circle-like region. The 
center of this region is the measured spatial location in the non-single cell 
resolution spatial transcriptomics data. The default is "Square", and the other 
option is "Circle"
- numCell: a numeric value indicating the number of cores used to accelerate 
the procedure.
```{r}
#### Note that here the shapeSpot is the user defined variable which
#### indicates the capturing area of single cells. Details see above.
set.seed(seed = 20210107)
scMapping <- CARD_scmapping(CARD_obj, shapeSpot = "Square", numcell = 20, 
                            ncore = 2)
print(scMapping)
### spatial location info and expression count of the single cell resolution
### data
MapCellCords <- as.data.frame(colData(scMapping))
count_SC <- assays(scMapping)$counts
```

The results are stored in a SingleCellExperiment object with mapped single cell 
resolution counts stored in the assays slot and the information of the spatial 
location for each single cell as well as their relashionship to the original 
measured spatial location is stored in the colData slot. 

Next, we visualize the cell type for each single cell with their spatial 
location information 
```{r}
df <- MapCellCords
colors <- c(
    "#8DD3C7", "#CFECBB", "#F4F4B9", "#CFCCCF", "#D1A7B9", "#E9D3DE", "#F4867C",
    "#C0979F", "#D5CFD6", "#86B1CD", "#CEB28B", "#EDBC63", "#C59CC5",
    "#C09CBF", "#C2D567", "#C9DAC3", "#E1EBA0",
    "#FFED6F", "#CDD796", "#F8CDDE"
)
p10 <- ggplot(df, aes(x = x, y = y, colour = CT)) +
    geom_point(size = 3.0) +
    scale_colour_manual(values = colors) +
    # facet_wrap(~Method,ncol = 2,nrow = 3) +
    theme(
        plot.margin = margin(0.1, 0.1, 0.1, 0.1, "cm"),
        panel.background = element_rect(colour = "white", fill = "white"),
        plot.background = element_rect(colour = "white", fill = "white"),
        legend.position = "bottom",
        panel.border = element_rect(
            colour = "grey89", 
            fill = NA, 
            linewidth = 0.5),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank(),
        legend.title = element_text(size = 13, face = "bold"),
        legend.text = element_text(size = 12),
        legend.key = element_rect(colour = "transparent", fill = "white"),
        legend.key.size = unit(0.45, "cm"),
        strip.text = element_text(size = 15, face = "bold")
    ) +
    guides(color = guide_legend(title = "Cell Type"))
print(p10)
```

## Session information
```{r}
sessionInfo()
```

## References
Ma, Y., Zhou, X. Spatially informed cell-type deconvolution for spatial 
transcriptomics. Nat Biotechnol 40, 1349–1359 (2022).
https://doi.org/10.1038/s41587-022-01273-7

Moncada, R., Barkley, D., Wagner, F. et al. Integrating 
microarray-based spatial transcriptomics and single-cell RNA-seq 
reveals tissue architecture in pancreatic ductal adenocarcinomas. 
Nat Biotechnol 38, 333–342 (2020).
https://doi.org/10.1038/s41587-019-0392-8