---
title: "scMitoMut demo: CRC dataset"
author: 
- name: Wenjie Sun
package: scMitoMut
output: 
  BiocStyle::html_document:
    toc_float: true
  BiocStyle::pdf_document: default
vignette: >
  %\VignetteIndexEntry{CRC_dataset_demo}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r echo=FALSE}
library(knitr)
opts_chunk$set(echo = TRUE, TOC = TRUE)
```

Install {scMitoMut} from Biocondcutor:

```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("scMitoMut")
```

or from github:

```{r, eval=FALSE}
if (!requireNamespace("remotes", quietly = TRUE))
    install.packages("remotes")
remotes::install_github("wenjie1991/scMitoMut")
```

Load required packages.

```{r}
library(scMitoMut)
library(data.table)
library(ggplot2)
library(rhdf5)
```

# Overview

## Key functions 

Following are the key functions used in the scMitoMut package:

- `parse_mgatk()`: Parses the mgatk output and saves the result in an H5 file.
- `open_h5_file()`: Opens the H5 file and returns a "mtmutObj" object.
- `subset_cell()`: Subsets the cells in the mtmutObj object.
- `run_model_fit()`: Runs the model fitting and saves the results in the H5 file.
- `filter_loc()`: Filters the mutations based on criterias.
- `plot_heatmap()`: Plots a heatmap for p-values, allele frequencies, or binary mutation status.
- `export_df()`, `export_af()`, `export_pval()`, and `export_binary()`: Export the mutation data in `data.frame` format and the allele frequency, p-value, and binary mutation status in `data.matrix` format.

## Key conceptions

- Somatic mutation: the mutation that occurs in cells after fertilizatoin.
- AF or VAF: allele frequency (AF) or variant allele frequency (VAF). It is the ratio of the number of reads supporting the variant allele to the total number of reads covering that locus.

**IMPORTANT**: In this vignette, I used the term "mutation" to refer to the lineage-related somatic mutation. For each mutation, I used the dominant allele as the reference allele. If the reference allele frequency is significantly (FDR < 0.05) lower, I will call the **locus** a mutation.

<!-- TODO: Describing the modeling details.-->

## Background

Single-cell genomics technology serves as a powerful tool for gaining insights into cellular heterogeneity and diversity within complex tissues.

Mitochondrial DNA (mtDNA) is characterized by its small size and the presence of multiple copies within a cell. These attributes contribute to achieving robust mtDNA sequencing coverage and depth in single-cell sequencing data, thereby enhancing the detection of somatic mutations without dropout.


In this vignette, the scMitoMut package is used to identify and visualize lineage-related mtDNA single nucleic somatic mutations.

In the following analysis, scMitoMut was used to analyze the allele count data, which is the output of [mgatk](https://github.com/caleblareau/mgatk). 
Only a few loci have been selected for demonstration purposes to reduce file size and run time. 
The full dataset can be access from {Signac} [vignette](https://github.com/stuart-lab/signac/blob/1.9.0/vignettes/mito.Rmd).

# Loading data

We load the allele count table with the `parse_table` function. The allele count table consists with following columns:

1. `loc`: the locus name
2. `cell_barcode`: the cell barcode in single-cell sequencing data
3. `fwd_depth`: the forward read count of the allele
4. `rev_depth`: the reverse read count of the allele
5. `alt`: the allele name 
6. `coverage`: the read count covering the locus
7. `ref`: the reference allele name

Instead of using the table above as input, the output from `mgatk` can also be directly read using the `parse_mgatk` function.

Using the `parse_table` function or `parse_mgatk` function, the allele count data are read and saved into an `H5` file. 
The `H5` file works as a database, which does not occupy memory, and data can be randomly accessed by querying. 
It helps with better memory usage and faster loading.

The process may take some minutes. The return value is the `H5` file path.

```{r, eval=TRUE}
## Load the allele count table
f <- system.file("extdata", "mini_dataset.tsv.gz", package = "scMitoMut")
f_h5_tmp <- tempfile(fileext = ".h5")
f_h5 <- parse_table(f, h5_file = f_h5_tmp)
```

```{r, eval=TRUE}
f_h5
```

After obtaining the `H5` file, the `open_h5_file` function can be utilized to load it, resulting in an object referred to as "mtmutObj".

**Detail**: On this step, the `mtmutObj` has 6 slots
- `h5f` is the `H5` file handle
- `mut_table` is the allele count table
- `loc_list` is a list of available loci
- `loc_selected` is the selected loci
- `cell_list` is a list of available cell ids
- `cell_selected` is the selected cell ids

```{r}
## Open the h5 file as a scMitoMut object
x <- open_h5_file(f_h5)
str(x)

## Show what's in the h5 file
h5ls(x$h5f, recursive = FALSE)
```

# Selecting cells

We are only interested in annotated good-quality cells.

So we will select the cells with annotation, which are good quality cells.

```{r}
f <- system.file("extdata", "mini_dataset_cell_ann.csv", package = "scMitoMut")
cell_ann <- read.csv(f, row.names = 1)

## Subset the cells, the cell id can be found by colnames() for the Seurat object
x <- subset_cell(x, rownames(cell_ann))
```

After subsetting the cells, the `cell_selected` slot will be updated. 
Only the selected cells will be used in the following p-value calculation. 

```{r}
head(x$cell_selected)
```

Similarly, we can select loci by using the `subset_locus` function. It saves time when we only focus on a few loci.

# Calculating mutation p-value

We built an null-hypothesis that there are not lineage-related mutation for specific locus in all cells.
Then we fit the allele frequency distribution with beta-binomial distribution and calculate the probability of observing allele frequency for a specific locus in a cell.
If the probability is small, we can reject the null hypothesis and conclude that there is a mutation for that locus in the cell.

To calculate the probability value (p-value), we run `run_calling` function, which has 2 arguments: 
- `mtmutObj` is the `scMitoMut` object
- `mc.cores` is the number of CPU threads to be used

The process will take some time.
The output will be stored in the `pval` group of the `H5` file.
The result is stored in the hard drive, instead of in memory.
We don't need to re-run the mutation calling when loading the `H5` file next time. 

The mutation calling is based on beta-binomial distribution. 
The mutation p-value is the probability that with the null hypothesis: there are no mutations for that locus in the cell.

<!--Remove the following lines-->
**Detail**: For each locus, we calculate the p-value using the following steps. 
1. Defining the wild-type allele as the allele with the highest median allele frequency among cells. 
2. Fitting a 2 components binomial-mixture distribution as classifier to select the likely wild-type cells.
We define the likely wild-type cells if it has a probability >= 0.001 to be the wild type. 
3. Using those likely wild-type cells, we fit the beta-binomial model. 
4. At last, based on the model, we calculate the p-value of observing the allele frequency of the wild-type allele in specific cell. 

```{r, eval=TRUE}
## Run the model fitting
run_model_fit(x, mc.cores = 1)

## The p-value is kept in the pval group of H5 file
h5ls(x$h5f, recursive = FALSE)
```

# Filter mutations

Then we will filter the mutations by using the `mut_filter` function with the following criteria:
- The mutation has at least 5 cells mutant.
- The FDR (false discovery rate) adjusted p-value (mutation quality q-value) is less than 0.05.

The output is a `data.frame` with 2 columns
- `loc` is the locus
- `mut_cell_n` is the cell number

We can see that there are 12 loci after filtering.

**Detail**: The `mut_filter` function has 4 arguments: 
- `mtmutObj` is the `mtmutObj` object
- `min_cell` is the minimum number of mutant cells
- `p_adj_method` is the method used to adjust the p-value. 
- `p_threshold` is the adjusted p-value (q-value) threshold 

```{r}
## Filter mutation
x <- filter_loc(
  mtmutObj = x,
  min_cell = 2,
  model = "bb",
  p_threshold = 0.01,
  p_adj_method = "fdr"
)
x$loc_pass
```

# Visualization

We will visualize the mutation by heatmap using the `plot_heatmap` function. 
It can draw a heatmap of q-value, allele frequency, or binarized mutation status.
Its input is the `mtmutObj` object.
It will independently apply all the filters we used in the `mut_filter` function, and select the cells and loci that pass the filter criteria.
In all kinds of figures, the mutation status will be calculated, and the loci and cells are ordered by the mutation status.

<!--TODO: make p value to be q-value-->

**Detail**: The `plot_heatmap` arguments.
- `mtmutObj` is the `scMitoMut` object.
- `pos_list` is the list of loci.
- `cell_ann` is the cell annotation.
- `ann_colors` is the color of the cell annotation.
- `type` is the type of the heatmap which can be `p`, `af`, or `binary`.
- `p_adj_method` is the method used to adjust the p-value.
- `p_threshold` is the adjusted p-value threshold to determine if a cell has mutation when selecting the cells and loci.
- `min_cell_n` is the minimum number of cells that have mutation when selecting the cells and loci.
- `p_binary` is the adjusted p-value threshold to get the binary mutation status.
- `percent_interp` is the percentage overlap threshold between mutations, to determine if two mutations are correlated for interpolating the mutation status
- `n_interp` is the number of overlapped cells to determine if two mutations are correlated for interpolating.

The interpolation is based on the assumption that the mutation are unique, it is rare to have two mutation in the same population.
Therefore, when two mutations are correlated, one of them is likely a subclone of the other one.
The interpolation is utilized primarily for the purpose of ordering cells during visualization.

## Binary heatmap

The binary heatmap displays the mutation status of each cell corresponding to each locus.
The color red suggests the presence of a mutant, whereas blue indicates its absence, and white denotes a missing value.

```{r, fig.width = 12}
## Prepare the color for cell annotation
colors <- c(
  "Cancer Epi" = "#f28482",
  Blood = "#f6bd60"
)
ann_colors <- list("SeuratCellTypes" = colors)
```

```{r, fig.width = 12}
## binary heatmap
plot_heatmap(x,
  cell_ann = cell_ann, ann_colors = ann_colors, type = "binary",
  percent_interp = 0.2, n_interp = 3
)
```

Also we can turn off the interpolation by setting `percent_interp = 1`.

```{r, fig.width = 12}
## binary heatmap
plot_heatmap(x,
  cell_ann = cell_ann, ann_colors = ann_colors, type = "binary",
  percent_interp = 1, n_interp = 3
)
```

## P value heatmap

The p-value heatmap illustrates the adjusted p-values for each cell corresponding to each locus. 
The arrangement of the cells and loci is based on their binary mutation status.

```{r, fig.width = 12}
## p value heatmap
plot_heatmap(x,
  cell_ann = cell_ann, ann_colors = ann_colors, type = "p",
  percent_interp = 0.2, n_interp = 3
)
```

## AF heatmap

The allele frequency heatmap illustrates the allele frequency of each cell at each locus.
The order of the cells and loci are determined by the mutation status too.

```{r, fig.width = 12}
## allele frequency heatmap
plot_heatmap(x,
  cell_ann = cell_ann, ann_colors = ann_colors, type = "af",
  percent_interp = 0.2, n_interp = 3
)
```

# Exporting mutation data

We can export the mutation data by using the following functions:

- `export_df` export the mutation data as a `data.frame`
- `export_af`  export the AF data as a `data.matrix` with loci as row names and cells as column names.
- `export_pval`  export the p-value data as a `data.matrix` with loci as row names and cells as column names.
- `export_binary`  export the mutation status data as a `data.matrix` with loci as row names and cells as column names.

Those functions have the same filtering options as the `plot_heatmap` function.

```{r}
## Export the mutation data as data.frame
m_df <- export_df(x)
m_df[1:10, ]
## Export allele frequency data as data.matrix
export_af(x)[1:5, 1:5]
## Export p-value data as data.matrix
export_pval(x)[1:5, 1:5]
## Export binary mutation status data as data.matrix
export_binary(x)[1:5, 1:5]
```

# Show the p value, af plot versus cell types

Lastly, we try to show the distribution of p value and allele frequency value versus cell types.

```{r, fig.width = 5}
## The `m_df` is exported using the `export_df` function above.
m_dt <- data.table(m_df)
m_dt[, cell_type := cell_ann[as.character(m_dt$cell_barcode), "SeuratCellTypes"]]
m_dt_sub <- m_dt[loc == "chrM.1227"]
m_dt_sub[, sum((pval) < 0.01, na.rm = TRUE), by = cell_type]
m_dt_sub[, sum((1 - af) > 0.05, na.rm = TRUE), by = cell_type]

## The p value versus cell types
ggplot(m_dt_sub) +
  aes(x = cell_type, y = -log10(pval), color = cell_type) +
  geom_jitter() +
  scale_color_manual(values = colors) +
  theme_bw() +
  geom_hline(yintercept = -log10(0.01), linetype = "dashed") +
  ylab("-log10(FDR)")

## The allele frequency versus cell types
ggplot(m_dt_sub) +
  aes(x = cell_type, y = 1 - af, color = factor(cell_type)) +
  geom_jitter() +
  scale_color_manual(values = colors) +
  theme_bw() +
  geom_hline(yintercept = 0.05, linetype = "dashed") +
  ylab("1 - Dominant Allele Frequency")

## The p value versus allele frequency
ggplot(m_dt_sub) +
    aes(x = -log10(pval), y = 1 - af, color = factor(cell_type)) +
    geom_point() +
    scale_color_manual(values = colors) +
    theme_bw() +
    geom_hline(yintercept = 0.05, linetype = "dashed") +
    geom_vline(xintercept = -log10(0.01), linetype = "dashed") +
    xlab("-log10(FDR)") +
    ylab("1 - Dominant Allele Frequency")
```

# Session Info

```{r}
sessionInfo()
```