---
title: "StabMap: Stabilised mosaic single cell data integration using unshared features"
author:
  - Shila Ghazanfar
  - Nick Robertson
  - Aiden Jin
output:
  BiocStyle::html_document:
    toc_float: true
  BiocStyle::pdf_document: default
package: StabMap
vignette: |
  %\VignetteIndexEntry{Mosaic single cell data integration}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = FALSE
)
```


```{r setup, message=FALSE, warning=FALSE}
library(StabMap)
library(magrittr)

library(scater)
library(scran)
library(SingleCellMultiModal)
library(gridExtra)
```

```{r}
set.seed(2024)
```

# Introduction

StabMap is a technique for performing mosaic single cell data integration. 
Mosaic data integration presents the challenge of integration of data where 
only some features or cells are shared across datasets. For example, these 
challenges arise when integrating single-cell datasets that measure different 
molecular profiles, such as chromatin accessibility or RNA expression assays. 
Integrative analysis of such data may provide a more in-depth profile of each 
cell, facilitating downstream analysis. To read more about StabMap please see 
our [paper on Nature Biotechnology](https://www.nature.com/articles/s41587-023-01766-z).  

## Vignette Goals
In this vignette we will elaborate on how mosaic single cell data integration 
is implemented in the `StabMap` package. We address a few key goals: 

  - Mosaic Data integration for 2 datasets
  
  - Demonstrating cell imputation following integration
  
  - Indirect mosaic data integration for 3 datasets, including 2 non-overlapping 
    datasets

# Load data

In this tutorial we will work with a multi-assay single cell dataset, consisting
of ATAC and gene expression data for 10,032 cells. 

```{r, message=FALSE, warning=FALSE}
mae <- scMultiome(
  "pbmc_10x",
  mode = "*", dry.run = FALSE, format = "MTX", verbose = TRUE
)
```

Perform some exploration of this data.

```{r}
mae

upsetSamples(mae)

head(colData(mae))

dim(experiments(mae)[["rna"]])

names(experiments(mae))
```

Keep the first 2,000 cells only. Normalise and select variable features for the 
RNA modality.


```{r}
sce.rna <- experiments(mae)[["rna"]]

# Normalisation
sce.rna <- logNormCounts(sce.rna)

# Feature selection
decomp <- modelGeneVar(sce.rna)
hvgs <- rownames(decomp)[decomp$mean > 0.01 & decomp$p.value <= 0.05]

length(hvgs)

sce.rna <- sce.rna[hvgs, ]
```

Keep the first 2,000 cells only. Normalise and select variable features for the ATAC modality.

```{r}
dim(experiments(mae)[["atac"]])

sce.atac <- experiments(mae)[["atac"]]

# Normalise
sce.atac <- logNormCounts(sce.atac)

# Feature selection using highly variable peaks
# And adding matching peaks to genes
decomp <- modelGeneVar(sce.atac)
hvgs <- rownames(decomp)[decomp$mean > 0.25 &
  decomp$p.value <= 0.05]
length(hvgs)

sce.atac <- sce.atac[hvgs, ]
```

Create a composite full data matrix by concatenating.

```{r}
logcounts_all <- rbind(logcounts(sce.rna), logcounts(sce.atac))
dim(logcounts_all)

assayType <- ifelse(rownames(logcounts_all) %in% rownames(sce.rna),
  "rna", "atac"
)
table(assayType)
```

# Mosaic data integration with StabMap

We will simulate a situation where half of the cells correspond to the Multiome
(RNA + ATAC features) modality, and half of the cells correspond to the RNA modality.
Our goal is to then integrate both datasets by generating a joint embedding of 
the cells using all data, and to impute the missing ATAC cell values from the RNA 
modality cells.

```{r}
dataType <- setNames(
  sample(c("RNA", "Multiome"), ncol(logcounts_all),
    prob = c(0.5, 0.5), replace = TRUE
  ),
  colnames(logcounts_all)
)
table(dataType)

assay_list <- list(
  RNA = logcounts_all[assayType %in% c("rna"), dataType %in% c("RNA")],
  Multiome = logcounts_all[
    assayType %in% c("rna", "atac"), dataType %in% c("Multiome")
  ]
)

lapply(assay_list, dim)
lapply(assay_list, class)
```

Examine the shared features between the two datasets using `mosaicDataUpSet()`.

```{r}
mosaicDataUpSet(assay_list, plot = FALSE)
```

From this we note that there are shared features between the RNA and Multiome
datasets, but there are many features that are observed only in the Multiome
dataset and not the RNA - as we had constructed.

We can understand the `mosaicDataTopology()` of these datasets, which
generates an `igraph` object, which can be inspected and plotted. 
The `mosaicDataTopology()` is a weighted network where nodes represent each 
dataset, and edges connect nodes with at least one overlapping feature.   

```{r}
mdt <- mosaicDataTopology(assay_list)
mdt
plot(mdt)
```

From this we note that the datasets RNA and Multiome share at least some 
features. StabMap requires that the mosaic data topology network be connected,
that is, that there should be a path between every pair of nodes in the network.

We now aim to integrate the data from the RNA and Multiome modality by generating 
a common joint embedding for these data using `stabMap()`. The `stabMap()` integration
approach aims to stabilize integration of single-cell data by exploting the 
non-overlapping features, so that cells with similar biological profiles will 
cluster. Stabilisation using non-overlapping features may be important
when there are limited overlapping features or when the informative features are
unknown. 

<details>
  <summary>**What is `stabMap` doing?**</summary>
`stabMap` generates a joint embedding using 3 steps:

- Identify the `mosaicDataTopology()`
  
- Embed the reference dataset into a lower dimensional space
  
- Project cells from non-reference datasets onto the reference dataset embedding
 by using a model to traverse shortest paths in the `mosaicDataTopology()`   
 
</details>

Since the Multiome data contains all features, we treat this as the reference dataset.
Since we already examined the mosaic data topology, we set `plot = FALSE`.

```{r}
stab <- stabMap(assay_list,
  reference_list = c("Multiome"),
  plot = FALSE
)
dim(stab)
stab[1:5, 1:5]
```

We can reduce the dimension further using non-linear approaches such as UMAP.

```{r}
stab_umap <- calculateUMAP(t(stab))
dim(stab_umap)

plot(stab_umap, pch = 16, cex = 0.3, col = factor(dataType[rownames(stab)]))
```

Here we see that the RNA and Multiome cells are fairly well-mixed.

# Data imputation after StabMap

Given the joint embedding, we can predict the missing ATAC cell values using
`imputeEmbedding()`. We use `imputeEmbedding()` for demonstration purposes as 
for our data both modalities have sufficient sample sizes (cells) and thus 
cellular imputation isn't needed. 

To `imputeEmbedding()` we provide the data list, and the joint embedding as output 
from `stabMap()`. We set the Multiome cells as reference and the RNA cells as
query. This is useful for downstream visualisation or further interpretation.

```{r}
imp <- imputeEmbedding(
  assay_list,
  stab,
  reference = colnames(assay_list[["Multiome"]]),
  query = colnames(assay_list[["RNA"]])
)

class(imp)
names(imp)
lapply(imp, dim)
lapply(assay_list, dim)
imp[["Multiome"]][1:5, 1:5]
```

# Annotating Query Datasets using the StabMap embedding

We can also leverage this joint embedding to annotate the query data.
We will use a k-nearest neighbors (KNN) based algorithm to 
transfer cell type labels from the reference to the query dataset. For our 
demonstration we will treat the Multiome dataset as the reference and the RNA 
dataset as the query.   

The column data of the single cell experiments objects contained in `mae` 
contain cell type annotations for each cell in the `celltype` column. We first
extract cell type annotations for our reference dataset (Multiome). 

```{r}
annotation <- "celltype"
referenceLabels <- colData(
  experiments(mae)[["rna"]]
)[colnames(assay_list$Multiome), annotation]
names(referenceLabels) <- colnames(assay_list$Multiome)

table(referenceLabels)
```

To classify query cells based on a reference dataset we can use the 
`classifyEmbedding()` function. We provide the joint embedding generated by 
`stabMap()` and cell type labels for the reference dataset to the
`classifyEmbedding()` function. `classifyEmbedding()` returns a dataframe with predicted 
labels in the `predicted_labels` column. 

```{r}
knn_out <- classifyEmbedding(
  stab,
  referenceLabels,
)
```

As we have simulated out datasets we have the true label annotations for the RNA
(query) cells. We can evaluate how well our predicted annotations match the true
annotations use a measure such as accuracy. 

```{r}
# Extract query labels
queryLabels <- colData(
  experiments(mae)[["rna"]]
)[colnames(assay_list$RNA), annotation]
names(queryLabels) <- colnames(assay_list$RNA)

acc <- mean(queryLabels == knn_out[names(queryLabels), "predicted_labels"])
acc
```

Since both the reference and query cells are embedded in the same low 
dimensional space we can also visualise their cells together. Here we present a 
UMAP visualisation colour coded by their cell types. 

```{r}
# Extract reference and query cells from UMAP embedding
stab_umap_ref <- stab_umap[colnames(assay_list$Multiome), ]
stab_umap_query <- stab_umap[colnames(assay_list$RNA), ]

# Create UMAP for reference cells
df_umap_ref <- data.frame(
  x = stab_umap_ref[, 1],
  y = stab_umap_ref[, 2],
  cell_type = referenceLabels[rownames(stab_umap_ref)]
)

p_ref <- df_umap_ref %>%
  ggplot() +
  aes(x = x, y = y, colour = cell_type) +
  geom_point(size = 1) +
  ggtitle("Reference cell type annotation")

# Create UMAP for query cells
df_umap_query <- data.frame(
  x = stab_umap_query[, 1],
  y = stab_umap_query[, 2],
  cell_type = queryLabels[rownames(stab_umap_query)]
)

p_query <- df_umap_query %>%
  ggplot() +
  aes(x = x, y = y, colour = cell_type) +
  geom_point(size = 1) +
  ggtitle("Query predicted cell types")

grid.arrange(p_ref, p_query, ncol = 2)
```

# Indirect mosaic data integration with StabMap

StabMap is a flexible framework for mosaic data integration, and can still 
integrate data even when there are pairs of datasets that share no features at 
all. So long as there is a path connecting the datasets along the mosaic data 
topology (and the underlying assumption that the shared features along these
paths contain information), then we can extract meaningful joint embeddings. To
demonstrate this, we will simulate three data sources.

```{r}
dataTypeIndirect <- setNames(
  sample(c("RNA", "Multiome", "ATAC"), ncol(logcounts_all),
    prob = c(0.3, 0.3, 0.3), replace = TRUE
  ),
  colnames(logcounts_all)
)
table(dataTypeIndirect)

assay_list_indirect <- list(
  RNA = logcounts_all[assayType %in% c("rna"), dataTypeIndirect %in% c("RNA")],
  Multiome = logcounts_all[
    assayType %in% c("rna", "atac"), dataTypeIndirect %in% c("Multiome")
  ],
  ATAC = logcounts_all[
    assayType %in% c("atac"), dataTypeIndirect %in% c("ATAC")
  ]
)

lapply(assay_list_indirect, dim)
lapply(assay_list_indirect, class)
```

Using `mosaicDataUpSet()`, we note that there are no shared features between 
the ATAC and RNA datasets. For their integration we might be able to match features by extracting
genomic positions and making the "central dogma assumption", that is, that the
peaks associated with a genomic position overlapping a gene should correspond to
positive gene expression for that gene. However, using `stabMap()` we need not make this 
assumption for the data integration to be performed.

```{r}
mosaicDataUpSet(assay_list_indirect, plot = FALSE)
```

We can understand the `mosaicDataTopology()` of these datasets, which
generates an `igraph` object, which can be inspected and plotted.

```{r}
mdt_indirect <- mosaicDataTopology(assay_list_indirect)
mdt_indirect
plot(mdt_indirect)
```

StabMap only requires that the mosaic data topology network be connected,
that is, that there should be a path between every pair of nodes in the network.
While ATAC and RNA have no overlapping features, since there is a path between 
RNA and ATAC (via Multiome), we can proceed.

We now generate a common joint embedding for these data using `stabMap()`. Since the 
Multiome data contains all features, we again treat this as the reference 
dataset. Since we already examined the mosaic data topology, we set 
`plot = FALSE`.

```{r}
stab_indirect <- stabMap(assay_list_indirect,
  reference_list = c("Multiome"),
  plot = FALSE
)
dim(stab_indirect)
stab_indirect[1:5, 1:5]
```

We can reduce the dimension further using non-linear approaches such as UMAP.

```{r}
stab_indirect_umap <- calculateUMAP(t(stab_indirect))
dim(stab_indirect_umap)

plot(stab_indirect_umap,
  pch = 16, cex = 0.3,
  col = factor(dataTypeIndirect[rownames(stab_indirect)])
)
```

Here we see that the RNA, ATAC and Multiome cells are fairly well-mixed.

Colouring the cells by their original cell type, we can also see that the 
mosaic data integration is meaningful.

```{r}
cellType <- setNames(mae$celltype, colnames(mae[[1]]))

plot(stab_indirect_umap,
  pch = 16, cex = 0.3,
  col = factor(cellType[rownames(stab_indirect)])
)
```

<details>
  <summary>**Session Info**</summary>
  
```{r}
sessionInfo()
```

</details>