---
title: "A quick start guide to smartid: Scoring and MARker selection method based on modified Tf-IDf"
author:
  - name: Jinjin Chen
    affiliation:
      - Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, Parkville, VIC 3052, Australia
      - Department of Medical Biology, University of Melbourne, Parkville, VIC 3010, Australia
    email: chen.j@wehi.edu.au
date: "`r format(Sys.time(), '%d %b %Y')`"
output: 
  BiocStyle::html_document:
    toc: true
    number_sections: true
vignette: >
  %\VignetteIndexEntry{smartid: Scoring and MARker selection method based on modified Tf-IDf}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 8,
  fig.height = 5
)
```

# Introduction

------------------------------------------------------------------------

**smartid**

`smartid` is a package that enables automated selection of group specific signature genes, especially for rare population. This package is developed for generating lists of specific signature genes based on **Term Frequency-Inverse Document Frequency** (TF-IDF)  modified methods and **expectation maximization** (EM) for labeled data. It can also be used as a new gene-set scoring method or data transformation method for un-labeled data. Multiple visualization functions are implemented in this package.

# Installation

------------------------------------------------------------------------

`smartid` R package can be installed from Bioconductor or [GitHub](https://github.com/DavisLaboratory/smartid).

The most updated version of `smartid` is hosted on GitHub and can be installed using `devtools::install_github()` function provided by [devtools](https://cran.r-project.org/package=devtools).

```{r installation, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
if (!requireNamespace("smartid", quietly = TRUE)) {
  BiocManager::install("smartid")
}
```

# Prepare Data

------------------------------------------------------------------------

To show a quick start guide of `smartid`, here we use package `r BiocStyle::Biocpkg("splatter")` to simulate a scRNA-seq data of 1000 genes * 3000 cells. This data consists of 4 groups, each has 2% DEGs except Group 4, which has no DEG as a negative control group.

```{r set up, message=FALSE}
library(smartid)
library(SummarizedExperiment)
library(splatter)
library(ggplot2)
library(scater)

## set seed for reproducibility
set.seed(123)
sim_params <- newSplatParams(
  nGenes = 1000,
  batchCells = 3000,
  group.prob = seq(0.1, 0.4, length.out = 4),
  de.prob = c(0.02, 0.02, 0.02, 0),
  # de.downProb = 0,
  de.facLoc = 0.5,
  de.facScale = 0.4
)

data_sim <- splatSimulate(sim_params, method = "groups")

## get up markers based on fold change
fc <- 1
cols <- paste0("DEFacGroup", seq_along(unique(data_sim$Group)))
defac <- as.data.frame(rowData(data_sim)[, cols])
up <- lapply(cols, \(id)
dplyr::filter(defac, if_all(-!!sym(id), \(x) !!sym(id) / x > fc)) |>
  rownames())
slot(data_sim, "metadata")$up_markers <- setNames(up, cols)
slot(data_sim, "metadata")$up_markers
data_sim
```

# Labeled Data

------------------------------------------------------------------------

`smartid` can be easily used to accurately identify specific marker genes on labeled data. By adapting and modifying TF-IDF approach, `smartid` shows robust power in finding marker genes, especially for rare population which many methods fail in.

**marker identification of `smartid` includes 3 key steps:**

  step 1. score samples  
  step 2. scale and transform scores  
  step 3. identify markers using expectation maximization (EM)  

## Score Samples

The first step is to score all samples/cells by using specified approach. The score can be composed of 3 terms: TF (term/feature frequency), IDF (inverse document/cell frequency) and IAE (inverse average expression of features). Each term has a couple of available choices with different formats to suit labeled or un-labeled data. Users can use function `idf_iae_methods()` to see available methods for IDF/IAE term. More details of each term can be seen in help page of each function, e.g. `?idf`.

```{r}
## show available methods
idf_iae_methods()
```

The basic version of TF, IDF and IAE can be termed as:

$\mathbf{TF_{i,j}}=\frac{N_{i,j}}{\sum_j{N_{i,j}}},$
$\mathbf{IDF_i} = \log(1+\frac{n}{n_i+1}),$
$\mathbf{IAE_i} = \log(1+\frac{n}{\sum_j^n\hat N_{i,j}+1})$

Where $N_{i,j}$ is the counts of feature $i$ in cell $j$; $\hat N_{i,j}$ is $\max(0,N_{i,j}-\mathrm{threshold})$;
$n$ is the total number of documents(cells); $n_i$ is $\sum_{j = 1}^{n} \mathrm{sign}(N_{i,j} > \mathrm{threshold})$.

Here for labeled data, we can choose logTF * IDF_prob * IAE_prob for marker identification:
$$\mathbf{score}=\log \mathbf{TF}*\mathbf{IDF}_{prob}*\mathbf{IAE}_{prob}$$

The probability version of IDF can be termed as:
$$\mathbf{IDF_{i,j}} = \log(1+\frac{\frac{n_{i,j\in D}}{n_{j\in D}}}{\max(\frac{n_{i,j\in \hat D}}{n_{j\in \hat D}})+ e^{-8}}\frac{n_{i,j\in D}}{n_{j\in D}})$$

And the probability version of IAE can be termed as:
$$\mathbf{IAE_{i,j}} = \log(1+\frac{\mathrm{mean}(\hat N_{i,j\in D})}{\max(\mathrm{mean}(\hat N_{i,j\in \hat D}))+ e^{-8}}*\mathrm{mean}(\hat N_{i,j\in D}))$$

Where $D$ is the category of cell $j$; $\hat D$ is the category other than $D$.

TF here stands for gene frequency, which is similar to CPM, while IDF represents the inverse cell/sample frequency for scRNA-seq data, and IAE is the inverse average expression of each gene across all cells or cells in each labeled group.

Another advantage of `smartid` is that it can start with raw counts data, with no need for pre-processed data. And the scoring is quite fast.

```{r}
## compute score
system.time(
  data_sim <- cal_score(
    data_sim,
    tf = "logtf",
    idf = "prob",
    iae = "prob",
    par.idf = list(label = "Group"),
    par.iae = list(label = "Group")
  )
)

## score and tf,idf,iae all saved
assays(data_sim)
names(metadata(data_sim))
```

## Scale and Transform Score

Scaling is needed to find the markers specific to the group, however, standard scaling might fail due to the rare populations. Here `smartid` uses a special scaling strategy `scale_mgm()`, which can scale imbalanced data by given group labels. By doing this, we can avoid the bias towards features with larger numerical ranges during feature selection.

The scale method is depicted as below:

$$z=\frac{x-\frac{\sum_k^{n_D}(\mu_k)}{n_D}}{sd}$$

The score will be transformed using softmax before passing to EM algorithm.

```{r}
top_m <- top_markers(
  data = data_sim,
  label = "Group",
  n = Inf # set Inf to get all features processed score
)
top_m
```
The top n features for each group will be ordered and listed in `top_m`. `smartid` provides easy-to-use functions to visualize top feature scores in each group and compare with actual up-regulated DEGs.

It's clear that the real UP DEGs are popping up to the top n features. And for the negative control "Group 4", the shape of top feature score is totally different from the ones with DEGs, which can provide more insights to help understand the data.

```{r}
score_barplot(
  top_markers = top_m,
  column = ".dot",
  f_list = slot(data_sim, "metadata")$up_markers,
  n = 20
)
```

As we can see, there is an UP DEG 'Gene76' not popping up in Group 2, we can check the relative expression of this gene using violin plot. It is clear that this gene is not significantly highly expressed in Group 2 and the average expression is quite low across all cells.

This can also be confirmed in data simulation information, where the scale factor is higher in Group2, but the GeneMean is too small to be confident. Thus this gene won't be selected by `smartid`.

```{r}
sin_score_boxplot(
  metadata(data_sim)$tf,
  features = "Gene76",
  ref.group = "Group2",
  label = data_sim$Group
)

## sim gene info
SummarizedExperiment::elementMetadata(data_sim)[76, ]
```

## Marker Selection

As we can see from above, there is a distinctly different distribution of feature score between group with DEGs and without DEGs. And there is a clear separation (break point) between the real DEGs and non-DEGs.

To help automatically select real markers for each group, `smartid` used an expectation maximization (EM) approach to identify which genes fall into the real DEGs distribution and which are not.

Regarding the distribution of scores as a mixture model, here we can choose function `markers_mixmdl()` in `smartid` to separate features. There are 2 available mixture model to choose: normal (Gaussian) or gamma. We choose "norm" here as it runs faster.

`smartid` also allows to plot the mixture distribution plot after EM. It's obvious that the top 2 components of Group 4 share quite similar distribution, thus no markers will be selected for this group.

```{r}
set.seed(123)
marker_ls <- markers_mixmdl(
  top_markers = top_m,
  column = ".dot",
  ratio = 2,
  dist = "norm",
  plot = TRUE
)
marker_ls
```

We can also compare our selected markers with real DEGs. As there is no markers or DEG in group 4, only show overlap from Group1-3. It's clear that all the markers identified by `smartid` are real DEGs, with a couple of missing genes in Group 1 and 2. But as what we showed above, those genes only exhibit low mean expression across all cells, thus not confident enough to be selected as markers.

```{r}
library(UpSetR)

upset(fromList(c(slot(data_sim, "metadata")$up_markers, marker_ls)), nsets = 6)
```

`smartid` also provides some other implementation of marker selection. Here is another example using `mclust`. Different from `markers_mixmdl()`, `markers_mclust()` doesn't need a pre-defined number of components (which is default 3 in `markers_mixmdl()`), instead, it will select the number of components by searching a series of potential numbers. This method is sometimes more robust than `markers_mixmdl()` but will also take longer running time.

Similarly, this method also allows to plot the mixture distribution for each component, but separately.


```{r}
set.seed(123)
marker_ls_new <- markers_mclust(
  top_markers = top_m,
  column = ".dot",
  method = "max.one",
  plot = TRUE
)
names(marker_ls_new) <- paste(names(marker_ls_new), "new")
```

We can compare the marker list with the previous one. The overlap result shows 2 methods can achieve almost the same marker list for each group.

```{r}
upset(fromList(c(marker_ls, marker_ls_new)), nsets = 6)
```


# Un-labeled Data

------------------------------------------------------------------------

While for the unlabeled data, `smartid` also provides the score methods with no need for label information.

Here we choose logTF * IDF_sd * IAE_sd for for gene-set scoring as a use case:
$$\mathbf{score}=\log \mathbf{TF}*\mathbf{IDF}_{sd}*\mathbf{IAE}_{sd}$$

Where IDF and IAE can be termed as:
$$\mathbf{IDF_i} = \log(1+\mathrm{SD}(\mathbf{TF}_{i})*\frac{n}{n_i+1})$$
$$\mathbf{IAE_i} = \log(1+\mathrm{SD}(\mathbf{TF}_{i})*\frac{n}{\sum_{j=1}^{n}\hat N_{i,j}+1})$$

## Score Samples

Similarly, the first step is to score samples/cells using the specified method. This step also starts with raw counts data, without need for data pre-processing, which is quite convenient and fast.

```{r}
## compute score without label
system.time(
  data_sim <- cal_score(
    data_sim,
    tf = "logtf",
    idf = "sd",
    iae = "sd",
    new.slot = "score_unlabel"
  )
)

## new score is saved and tf,idf,iae all updated
assays(data_sim)
names(metadata(data_sim))
```

## Compute Overall Score for Gene-set

To compare overall score of the given gene-set, we don't need to scale and transform score this time. Using `gs_score()` can easily compute the overall score for each cell based on the given gene-set list.

```{r}
## compute score for each group marker list
data_sim <- gs_score(
  data = data_sim,
  features = marker_ls[1:3], # group 4 has no markers
  slot = "score_unlabel",
  suffix = "score.unlabel" # specify the suffix of names to save
)

## saved score
colnames(colData(data_sim))
```

Now we get 3 columns of score for each group markers. We can then visualize the score across groups, see how well it can discern the target group.

It's evident that the score can sufficiently separate the target group from all others.

```{r, fig.width=10, fig.height=3}
as.data.frame(colData(data_sim)) |>
  tidyr::pivot_longer("Group1.score.unlabel":"Group3.score.unlabel",
    names_to = "group markers",
    values_to = "score"
  ) |>
  ggplot(aes(x = Group, y = score, fill = Group)) +
  geom_boxplot() +
  facet_wrap(~`group markers`, scales = "free") +
  theme_bw()
```

# SessionInfo

```{r}
sessionInfo()
```