---
title: "Single-cell immune repertoire trajectory analysis with dandelionR"
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 2
    number_sections: true
knitr:
  opts_chunk:
    dev: 'png'
    fig.align: 'left'
date: "`r Sys.Date()`"
vignette: >
  %\VignetteIndexEntry{Single-cell immune repertoire trajectory analysis with dandelionR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
```{r, echo=FALSE, results="hide", message=FALSE}
knitr::opts_chunk$set(error = FALSE, message = FALSE, warning = FALSE)
library(BiocStyle)
```

# Foreword

Welcome to `dandelionR`!

`dandelionR` is an R package for performing single-cell immune repertoire
trajectory analysis,  based on the original python implementation in 
[dandelion](https://www.github.com/zktuong/dandelion). 

It provides all the necessary tools to interface with
[scRepertoire](https://github.com/ncborcherding/scRepertoire) 
and a custom implementation of absorbing markov chain for pseudotime inference, 
inspired based on the [palantir](https://github.com/dpeerlab/Palantir) python 
package.

This is a work in progress, so please feel free to open an issue if you 
encounter any problems or have any suggestions for improvement.

# Installation

You can install `dandelionR` with:

## Bioconductor
```{r, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install("dandelionR")
```

## Development version (GitHub)
```{r, eval = FALSE}
if (!requireNamespace("devtools", quietly = TRUE)) {
    install.packages("devtools")
}
devtools::install_github("tuonglab/dandelionR", dependencies = TRUE)
```

In a standard analysis workflow in R, users probably choose to read in their 
VDJ data with [scRepertoire](https://github.com/ncborcherding/scRepertoire). 

In this vignette, we will demonstrate how to perform TCR trajectory analysis 
starting from 'raw' data i.e. just a standard single-cell gene expression data 
(stored in `SingleCellExperiment`) and VDJ data (in AIRR format).

Install `scater` and `scRepertoire` if you haven't already.
```{r, eval = FALSE}
# only for the tutorial
if (!requireNamespace("scater", quietly = TRUE)) {
    BiocManager::install("scater")
}
if (!requireNamespace("scRepertoire", quietly = TRUE)) {
    BiocManager::install("scRepertoire")
}
# or
devtools::install_github("ncborcherding/scRepertoire")
```

# Usage

## Load the required libraries

```{r, message=FALSE, warning=FALSE}
library(dandelionR)
library(scRepertoire)
library(scater)
```

## Load the demo data
Due to size limitations of the package, we have provided a very trimmed down 
version of the demo data to ~2000 cells. The full dataset can be found here 
accordingly: 

GEX (Lymphoid Cells) - https://developmental.cellatlas.io/fetal-immune 

VDJ - https://github.com/zktuong/dandelion-demo-files/ 

The VDJ data is in the dandelion_manuscript/data/dandelion-remap folder.

```{r}
data(demo_sce)
data(demo_airr)
```

Check out the other vignette for an example dataset that starts from the 
original `dandelion` output associated with the original 
[manuscript](https://www.nature.com/articles/s41587-023-01734-7).

```{r}
vignette("vignette_reproduce_original")
```

We will also set the seed so that the plots and results are consistent.
```{r}
set.seed(123)
```

# Use `scRepertoire` to load the VDJ data
For the trajectory analysis work here, we are focusing on the main productive 
TCR chains. Therefore we will flag `filterMulti = TRUE`, which will keep the 
selection of the 2 corresponding chains with the highest expression for a 
single barcode. For more details, refer to `scRepertoire`'s 
[docs](https://www.borch.dev/uploads/screpertoire/reference/combinetcr).

```{r}
contig.list <- loadContigs(input = demo_airr, format = "AIRR")

# Format to `scRepertoire`'s requirements and some light filtering
combined.TCR <- combineTCR(contig.list,
    removeNA = TRUE,
    removeMulti = FALSE,
    filterMulti = TRUE
)
```

## Merging VDJ data with gene expression data
Next we will combine the gene expression data with the VDJ data to create a 
`SingleCellExperiment` object.
```{r}
sce <- combineExpression(combined.TCR, demo_sce)
```

# Initiate `dandelionR` workflow
Here, the data is ready to be used for the pseudobulk and trajectory analysis 
workflow in `dandelionR`.

Because this is a alpha-beta TCR data, we will set the `mode_option` to "abT". 
This will append `abT` to the relevant columns holding the VDJ gene 
information. If you are going to try other types of VDJ data e.g. BCR, you 
should set `mode_option` to "B" instead. And this argument should be 
consistently set with the `vdjPseudobulk` function later.

Since the TCR data is already filtered for productive chains in `combineTCR`, 
we will set `already.productive = TRUE` and can keep `allowed_chain_status` 
as `NULL`.

We will also subset the data to only include the main T-cell types: CD8+T, 
CD4+T, ABT(ENTRY), DP(P)_T, DP(Q)_T.
```{r}
sce <- setupVdjPseudobulk(sce,
    mode_option = "abT",
    already.productive = TRUE,
    subsetby = "anno_lvl_2_final_clean",
    groups = c("CD8+T", "CD4+T", "ABT(ENTRY)", "DP(P)_T", "DP(Q)_T")
)
```

The main output of this function is a `SingleCellExperiment` object with the 
relevant VDJ information appended to the `colData`, particularly the columns 
with the `_main` suffix e.g. `v_call_abT_VJ_main`, `j_call_abT_VJ_main` etc.
```{r}
head(colData(sce))
```

Visualise the UMAP of the filtered data.
```{r}
plotUMAP(sce, color_by = "anno_lvl_2_final_clean")
```


## Milo object and neighbourhood graph construction
We will use miloR to create the pseudobulks based on the gene expression data. 
The goal is to construct a neighbourhood graph with many neighbors with which 
we can sample the representative neighbours to form the objects.

```{r, warning = FALSE}
library(miloR)
milo_object <- Milo(sce)
milo_object <- buildGraph(milo_object, k = 30, d = 20, reduced.dim = "X_scvi")
milo_object <- makeNhoods(milo_object,
    reduced_dims = "X_scvi", d = 20,
    prop = 0.3
)
```
## Construct UMAP on milo neighbor graph

We can visualise this milo object using UMAP.

```{r, warning = FALSE}
milo_object <- miloUmap(milo_object, n_neighbors = 30)
```
```{r}
plotUMAP(milo_object,
    color_by = "anno_lvl_2_final_clean",
    dimred = "UMAP_knngraph"
)
```

# Construct pseudobulked VDJ feature space

Next, we will construct the pseudobulked VDJ feature space using the 
neighbourhood graph constructed above. We will also run PCA on the 
pseudobulked VDJ feature space.
```{r}
pb.milo <- vdjPseudobulk(milo_object,
    mode_option = "abT",
    col_to_take = "anno_lvl_2_final_clean"
)
```

Inspect the newly created `pb.milo` object.
```{r}
pb.milo
```

We can compute and visualise the PCA of the pseudobulked VDJ feature space.
```{r}
pb.milo <- runPCA(pb.milo, assay.type = "Feature_space", ncomponents = 20)
plotPCA(pb.milo, color_by = "anno_lvl_2_final_clean")
```

## TCR trajectory inference using Absorbing Markov Chain

In the original `dandelion` python package, the trajectory inference is done 
using the `palantir` package. Here, we implement the absorbing markov chain 
approach in dandelionR to infer the trajectory, leveraging on `destiny` for 
diffusion map computation.

### Define root and branch tips
```{r}
# extract the PCA matrix
pca <- t(as.matrix(reducedDim(pb.milo, type = "PCA")))
# define the CD8 terminal cell as the top-most cell and CD4 terminal cell as
# the bottom-most cell
branch.tips <- c(which.max(pca[2, ]), which.min(pca[2, ]))
names(branch.tips) <- c("CD8+T", "CD4+T")
# define the start of our trajectory as the right-most cell
root <- which.max(pca[1, ])
```

### Construct diffusion map
```{r, warning = FALSE}
library(destiny)
# Run diffusion map on the PCA
dm <- DiffusionMap(t(pca), n_pcs = 10, n_eigs = 10)
```

### Compute diffusion pseudotime on diffusion map
```{r}
dif.pse <- DPT(dm, tips = c(root, branch.tips), w_width = 0.1)
```

```{r, message=FALSE}
# the root is automatically called DPT + index of the root cell
DPTroot <- paste0("DPT", root)
# store pseudotime in milo object
pb.milo$pseudotime <- dif.pse[[DPTroot]]
# set the colours for pseudotime
pal <- colorRampPalette(rev((RColorBrewer::brewer.pal(9, "RdYlBu"))))(255)
plotPCA(pb.milo, color_by = "pseudotime") +
    scale_colour_gradientn(colours = pal)
```

# Markov chain construction on the pseudobulk VDJ feature space
This step will compute the Markov chain probabilities on the pseudobulk VDJ 
feature space. It will return the branch probabilities in the `colData` and 
the column name corresponds to the branch tips defined earlier.
```{r}
pb.milo <- markovProbability(
    milo = pb.milo,
    diffusionmap = dm,
    terminal_state = branch.tips,
    root_cell = root,
    pseudotime_key = "pseudotime",
    knn = 30
)
```

Inspect the `pb.milo` object to see the newly added columns.
```{r}
head(colData(pb.milo))
```

# Visualising branch probabilities

With the Markov chain probabilities computed, we can visualise the branch 
probabilities towards CD4+ or CD8+ T-cell fate on the PCA plot.
```{r, message=FALSE}
plotPCA(pb.milo, color_by = "CD8+T") + scale_color_gradientn(colors = pal)
plotPCA(pb.milo, color_by = "CD4+T") + scale_color_gradientn(colors = pal)
```

# Transfer

The next step is to project the pseudotime and the branch probability 
information from the pseudobulks back to each cell in the dataset. If the cell 
do not belong to any of the pseudobulk, it will be removed. If a cell belongs 
to multiple pseudobulk samples, its value should be calculated as a weighted 
average of the corresponding values from each pseudobulk, where each weight 
is inverse of the size of the pseudobulk.

## Project pseudobulk data to each cell
```{r}
cdata <- projectPseudotimeToCell(milo_object, pb.milo, branch.tips)
```

## Visualise the trajectory data on a per cell basis
```{r, message=FALSE}
plotUMAP(cdata, color_by = "anno_lvl_2_final_clean", dimred = "UMAP_knngraph")
plotUMAP(cdata, color_by = "pseudotime", dimred = "UMAP_knngraph") +
    scale_color_gradientn(colors = pal)
plotUMAP(cdata, color_by = "CD4+T", dimred = "UMAP_knngraph") +
    scale_color_gradientn(colors = pal)
plotUMAP(cdata, color_by = "CD8+T", dimred = "UMAP_knngraph") +
    scale_color_gradientn(colors = pal)
```

And that's it! We have successfully inferred the trajectory of the T-cells in 
this dataset!

# Session info

```{r, warning = FALSE}
sessionInfo()
```