---
title: "User Manual: IgGeneUsage"
author: "SK"
date: "Sep 12, 2023"
output:
  BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{User Manual: IgGeneUsage}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r setup, include = FALSE, warning = FALSE}
knitr::opts_chunk$set(comment = FALSE, 
                      warning = FALSE, 
                      message = FALSE)
```


```{r}
require(IgGeneUsage)
require(rstan)
require(knitr)
require(ggplot2)
require(ggforce)
require(ggrepel)
require(reshape2)
require(patchwork)
```


# Introduction
Decoding the properties of immune receptor repertoires (IRRs) is key to 
understanding how our adaptive immune system responds to challenges, such 
as viral infection or cancer. One important quantitative property of IRRs 
is their immunoglobulin (Ig) gene usage, i.e. how often are the differnt 
Igs that make up the immune receptors used in a given IRR. Furthermore, we
may ask: is there differential gene usage (DGU) between IRRs from different
biological conditions (e.g. healthy vs tumor). 

Both of these questions can be answered quantitatively by are answered by 
`r Biocpkg("IgGeneUsage")`.


# Input
The main input of `r Biocpkg("IgGeneUsage")` is a data.frame that has the 
following columns:

  1. **individual_id**: name of the repertoire (e.g. Patient-1)
  2. **condition**: name of the condition to which each repertoire 
  belongs (healthy, tumor_A, tumor_B, ...)
  3. **gene_name**: gene name (e.g. IGHV1-10 or family TRVB1)
  4. **gene_usage_count**: numeric (count) of usage related in individual x 
     gene x condition specified in columns 1-3
  5. [optional] **repertoire**: character/numeric identifier that tags the
     different biological replicates if they are available for a specific
     individual

# Model
`r Biocpkg("IgGeneUsage")` transforms the input data as follows.

First, given $R$ repertoires with $G$ genes each, `r Biocpkg("IgGeneUsage")` 
generates a gene usage matrix $Y^{R \times G}$. Row sums in $Y$ define the 
total usage ($N$) in each repertoire. 

Second, for the analysis of DGU between biological conditions, we use a 
Bayesian model ($M$) for zero-inflated beta-binomial regression. Empirically,
we know that Ig gene usage data can be noisy also not exhaustive, i.e. some 
Ig genes that are systematically rearranged at low probability might not be 
sampled, and certain Ig genes are not encoded (or dysfunctional) in some 
individuals. $M$ can fit over-dispersed and zero-inflated Ig gene usage data.

In the output of `r Biocpkg("IgGeneUsage")`, we report the mean effect 
size (es or $\gamma$) and its 95% highest density interval (HDI). Genes with 
$\gamma \neq 0$ (e.g. if 95% HDI of $\gamma$ excludes 0) are most likely 
to experience differential usage. Additionally, we report the probability of 
differential gene usage ($\pi$):
\begin{align}
\pi = 2 \cdot \max\left(\int_{\gamma = -\infty}^{0} p(\gamma)\mathrm{d}\gamma, 
\int_{\gamma = 0}^{\infty} p(\gamma)\mathrm{d}\gamma\right) - 1
\end{align}
with $\pi = 1$ for genes with strong differential usage, and $\pi = 0$ for 
genes with negligible differential gene usage. Both metrics are computed based
on the posterior distribution of $\gamma$, and are thus related. 

# Case Study A: analyzing IRRs
`r Biocpkg("IgGeneUsage")` has a couple of built-in Ig gene usage datasets. 
Some were obtained from studies and others were simulated.

Lets look into the simulated dataset `d_zibb_3`. This dataset was generated 
by a zero-inflated beta-binomial (ZIBB) model, and `r Biocpkg("IgGeneUsage")` 
was designed to fit ZIBB-distributed data.


```{r}
data("d_zibb_3", package = "IgGeneUsage")
knitr::kable(head(d_zibb_3))
```


We can also visualize `d_zibb_3` with `r CRANpkg("ggplot")`:

```{r, fig.width=6, fig.height=3.25}
ggplot(data = d_zibb_3)+
  geom_point(aes(x = gene_name, y = gene_usage_count, col = condition),
             position = position_dodge(width = .7), shape = 21)+
  theme_bw(base_size = 11)+
  ylab(label = "Gene usage [count]")+
  xlab(label = '')+
  theme(legend.position = "top")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4))
```


## DGU analysis
As main input `r Biocpkg("IgGeneUsage")` uses a data.frame formatted as e.g.
`d_zibb_3`. Other input parameters allow you to configure specific settings 
of the `r CRANpkg("rstan")` sampler.

In this example, we analyze `d_zibb_3` with 3 MCMC chains, 1500 iterations 
each including 500 warm-ups using a single CPU core (Hint: for parallel 
chain execution set parameter `mcmc_cores` = 3). We report for each model 
parameter its mean and 95% highest density interval (HDIs).

**Important remark:** you should run DGU analyses using default 
`r Biocpkg("IgGeneUsage")` parameters. If warnings or errors are reported 
with regard to the MCMC sampling, please consult the Stan manual[^2] and 
adjust the inputs accordingly. If the warnings persist, please submit an 
issue with a reproducible script at the Bioconductor support site or on 
Github[^3].


```{r}
M <- DGU(ud = d_zibb_3, # input data
         mcmc_warmup = 300, # how many MCMC warm-ups per chain (default: 500)
         mcmc_steps = 1500, # how many MCMC steps per chain (default: 1,500)
         mcmc_chains = 3, # how many MCMC chain to run (default: 4)
         mcmc_cores = 1, # how many PC cores to use? (e.g. parallel chains)
         hdi_lvl = 0.95, # highest density interval level (de fault: 0.95)
         adapt_delta = 0.8, # MCMC target acceptance rate (default: 0.95)
         max_treedepth = 10) # tree depth evaluated at each step (default: 12)
```

## Output format
In the output of DGU, we provide the following objects:

  * `dgu` and `dgu_prob` (main results of `r Biocpkg("IgGeneUsage")`): 
     quantitative DGU summary on a log- and probability-scale, respectively.
  * `gu`: condition-specific relative gene usage (GU) of each gene
  * `theta`: probabilities of gene usage in each sample
  * `ppc`: posterior predictive checks data (see section 'Model checking')
  * `ud`: processed Ig gene usage data 
  * `fit`: rstan ('stanfit') object of the fitted model $\rightarrow$ used 
     for model checks (see section 'Model checking')


```{r}
summary(M)
```


## Model checking
* **Check your model fit**. For this, you can use the object glm.

  * Minimal checklist of successful MCMC sampling[^2]:
      * no divergences
      * no excessive warnings from rstan
      * Rhat < 1.05
      * high Neff
  * Minimal checklist for valid model:
      * posterior predictive checks (PPCs): is model consistent with reality, 
        i.e. is there overlap between simulated and observed data?
      * leave-one-out analysis

[^2]: https://mc-stan.org/misc/warnings.html
[^3]: https://github.com/snaketron/IgGeneUsage/issues


### MCMC sampling

  * divergences, tree-depth, energy
  * none found
  
```{r}
rstan::check_hmc_diagnostics(M$fit)
```

  * rhat < 1.05 and n_eff > 0
  

```{r, fig.height = 3, fig.width = 6}
rstan::stan_rhat(object = M$fit)|rstan::stan_ess(object = M$fit)
```


## PPC: posterior predictive checks
### PPCs: repertoire-specific
The model used by `r Biocpkg("IgGeneUsage")` is generative, i.e. with the 
model we can generate usage of each Ig gene in a given repertoire (y-axis). 
Error bars show 95% HDI of mean posterior prediction. The predictions can be 
compared with the observed data (x-axis). For points near the diagonal 
$\rightarrow$ accurate prediction.

```{r, fig.height = 4, fig.width = 7}
ggplot(data = M$ppc$ppc_rep)+
  facet_wrap(facets = ~individual_id, ncol = 5)+
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", col = "darkgray")+
  geom_errorbar(aes(x = observed_count, y = ppc_mean_count, 
                    ymin = ppc_L_count, ymax = ppc_H_count), col = "darkgray")+
  geom_point(aes(x = observed_count, y = ppc_mean_count), size = 1)+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  xlab(label = "Observed usage [counts]")+
  ylab(label = "PPC usage [counts]")
```


### PPCs: overall
Prediction of generalized gene usage within a biological condition is also 
possible. We show the predictions (y-axis) of the model, and compare them 
against the observed mean usage (x-axis). If the points are near the diagonal 
$\rightarrow$ accurate prediction. Errors are 95% HDIs of the mean.

```{r, fig.height = 3, fig.width = 5}
ggplot(data = M$ppc$ppc_condition)+
  geom_errorbar(aes(x = gene_name, ymin = ppc_L_prop*100, 
                    ymax = ppc_H_prop*100, col = condition), 
                position = position_dodge(width = 0.65), width = 0.1)+
  geom_point(aes(x = gene_name, y = ppc_mean_prop*100,col = condition), 
                position = position_dodge(width = 0.65))+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  xlab(label = "Observed usage [%]")+
  ylab(label = "PPC usage [%]")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4))
```


## Results
Each row of `glm` summarizes the degree of DGU observed for specific 
Igs. Two metrics are reported: 

  * `es` (also referred to as $\gamma$): effect size on DGU, where `contrast` 
     gives the direction of the effect (e.g. tumor - healthy or healthy - tumor)
  * `pmax` (also referred to as $\pi$): probability of DGU (parameter $\pi$ 
    from model $M$)
  
For `es` we also have the mean, median standard error (se), standard 
deviation (sd), L (low bound of 95% HDI), H (high bound of 95% HDI)

```{r}
kable(x = head(M$dgu), row.names = FALSE, digits = 2)
```


### DGU: differential gene usage
We know that the values of `\gamma` and `\pi` are related to each other. 
Lets visualize them for all genes (shown as a point). Names are shown for 
genes associated with $\pi \geq 0.95$. Dashed horizontal line represents 
null-effect ($\gamma = 0$). 

Notice that the gene with $\pi \approx 1$ also has an effect size whose 
95% HDI (error bar) does not overlap the null-effect. The genes with high 
degree of differential usage are easy to detect with this figure.

```{r, fig.height = 4, fig.width = 5}
# format data
stats <- M$dgu
stats <- stats[order(abs(stats$es_mean), decreasing = FALSE), ]
stats$gene_fac <- factor(x = stats$gene_name, levels = unique(stats$gene_name))


ggplot(data = stats)+
  geom_hline(yintercept = 0, linetype = "dashed", col = "gray")+
  geom_errorbar(aes(x = pmax, y = es_mean, ymin = es_L, ymax = es_H), 
                col = "darkgray")+
  geom_point(aes(x = pmax, y = es_mean, col = contrast))+
  geom_text_repel(data = stats[stats$pmax >= 0.95, ],
                  aes(x = pmax, y = es_mean, label = gene_fac),
                  min.segment.length = 0, size = 2.75)+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  xlab(label = expression(pi))+
  xlim(c(0, 1))+
  ylab(expression(gamma))
```


### Promising hits
Lets visualize the observed data of the genes with high probability of 
differential gene usage ($\pi \geq 0.95$). Here we show the gene usage in %.

```{r, fig.height = 3, fig.width = 5}
promising_genes <- stats$gene_name[stats$pmax >= 0.95]

ppc_gene <- M$ppc$ppc_condition
ppc_gene <- ppc_gene[ppc_gene$gene_name %in% promising_genes, ]

ppc_rep <- M$ppc$ppc_rep
ppc_rep <- ppc_rep[ppc_rep$gene_name %in% promising_genes, ]


ggplot()+
  geom_point(data = ppc_rep,
             aes(x = gene_name, y = observed_prop*100, col = condition),
             size = 1, fill = "black",
             position = position_jitterdodge(jitter.width = 0.1, 
                                             jitter.height = 0, 
                                             dodge.width = 0.35))+
  geom_errorbar(data = ppc_gene, 
                aes(x = gene_name, ymin = ppc_L_prop*100, 
                    ymax = ppc_H_prop*100, group = condition),
                position = position_dodge(width = 0.35), width = 0.15)+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4))+
  ylab(label = "PPC usage [%]")+
  xlab(label = '')
```


### Promising hits [count]
Lets also visualize the predicted gene usage counts in each repertoire.

```{r, fig.height = 3, fig.width = 5}
ggplot()+
  geom_point(data = ppc_rep,
             aes(x = gene_name, y = observed_count, col = condition),
             size = 1, fill = "black",
             position = position_jitterdodge(jitter.width = 0.1, 
                                             jitter.height = 0, 
                                             dodge.width = 0.5))+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  ylab(label = "PPC usage [count]")+
  xlab(label = '')+
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4))
```


## GU: gene usage summary
`r Biocpkg("IgGeneUsage")` also reports the inferred gene usage (GU) 
probability of individual genes in each condition. For a given gene we 
report its mean GU (`prob_mean`) and the 95% (for instance) HDI (`prob_L`
and `prob_H`).

```{r, fig.width=5, fig.height=4}
ggplot(data = M$gu)+
  geom_errorbar(aes(x = gene_name, y = prob_mean, ymin = prob_L,
                    ymax = prob_H, col = condition),
                width = 0.1, position = position_dodge(width = 0.4))+
  geom_point(aes(x = gene_name, y = prob_mean, col = condition), size = 1,
             position = position_dodge(width = 0.4))+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  ylab(label = "GU [probability]")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4))
```


# Leave-one-out (LOO) analysis
To assert the robustness of the probability of DGU ($\pi$) and the effect 
size ($\gamma$), `r Biocpkg("IgGeneUsage")` has a built-in procedure for 
fully Bayesian leave-one-out (LOO) analysis. 

During each step of LOO, we discard the data of one of the R repertoires, 
and use the remaining data to analyze for DGU. In each step we record 
$\pi$ and $\gamma$ for all genes, including the mean and 95% HDI of 
$\gamma$. We assert quantitatively the robustness of $\pi$ and $\gamma$ 
by evaluating their variability for a specific gene. 

This analysis can be computationally demanding.


```{r}
L <- LOO(ud = d_zibb_3, # input data
         mcmc_warmup = 500, # how many MCMC warm-ups per chain (default: 500)
         mcmc_steps = 1000, # how many MCMC steps per chain (default: 1,500)
         mcmc_chains = 1, # how many MCMC chain to run (default: 4)
         mcmc_cores = 1, # how many PC cores to use? (e.g. parallel chains)
         hdi_lvl = 0.95, # highest density interval level (de fault: 0.95)
         adapt_delta = 0.8, # MCMC target acceptance rate (default: 0.95)
         max_treedepth = 10) # tree depth evaluated at each step (default: 12)
```


Next, we collected the results (GU and DGU) from each LOO iteration:

```{r}
L_gu <- do.call(rbind, lapply(X = L, FUN = function(x){return(x$gu)}))
L_dgu <- do.call(rbind, lapply(X = L, FUN = function(x){return(x$dgu)}))
```


... and plot them:

## LOO-DGU: variability of effect size $\gamma$

```{r, fig.width=6, fig.height=5}
ggplot(data = L_dgu)+
  facet_wrap(facets = ~contrast, ncol = 1)+
  geom_hline(yintercept = 0, linetype = "dashed", col = "gray")+
  geom_errorbar(aes(x = gene_name, y = es_mean, ymin = es_L,
                    ymax = es_H, col = contrast, group = loo_id),
                width = 0.1, position = position_dodge(width = 0.75))+
  geom_point(aes(x = gene_name, y = es_mean, col = contrast,
                 group = loo_id), size = 1,
             position = position_dodge(width = 0.75))+
  theme_bw(base_size = 11)+
  theme(legend.position = "none")+
  ylab(expression(gamma))
```


## LOO-DGU: variability of $\pi$

```{r, fig.width=6, fig.height=5}
ggplot(data = L_dgu)+
  facet_wrap(facets = ~contrast, ncol = 1)+
  geom_point(aes(x = gene_name, y = pmax, col = contrast,
                 group = loo_id), size = 1,
             position = position_dodge(width = 0.5))+
  theme_bw(base_size = 11)+
  theme(legend.position = "none")+
  ylab(expression(pi))
```


## LOO-GU: variability of the gene usage

```{r, fig.width=6, fig.height=4}
ggplot(data = L_gu)+
  geom_hline(yintercept = 0, linetype = "dashed", col = "gray")+
  geom_errorbar(aes(x = gene_name, y = prob_mean, ymin = prob_L,
                    ymax = prob_H, col = condition, 
                    group = interaction(loo_id, condition)),
                width = 0.1, position = position_dodge(width = 1))+
  geom_point(aes(x = gene_name, y = prob_mean, col = condition,
                 group = interaction(loo_id, condition)), size = 1,
             position = position_dodge(width = 1))+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  ylab("GU [probability]")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4))
```


# Case Study B: analyzing IRRs containing biological replicates

```{r}
data("d_zibb_4", package = "IgGeneUsage")
knitr::kable(head(d_zibb_4))
```


We can also visualize `d_zibb_4` with `r CRANpkg("ggplot")`:

```{r, fig.width=6.5, fig.height=3.25}
ggplot(data = d_zibb_4)+
  geom_point(aes(x = gene_name, y = gene_usage_count, col = condition, 
                 shape = replicate), position = position_dodge(width = 0.8))+
  theme_bw(base_size = 11)+
  ylab(label = "Gene usage [count]")+
  xlab(label = '')+
  theme(legend.position = "top")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4))
```


## Modeling

```{r}
M <- DGU(ud = d_zibb_4, # input data
         mcmc_warmup = 500, # how many MCMC warm-ups per chain (default: 500)
         mcmc_steps = 1500, # how many MCMC steps per chain (default: 1,500)
         mcmc_chains = 2, # how many MCMC chain to run (default: 4)
         mcmc_cores = 1, # how many PC cores to use? (e.g. parallel chains)
         hdi_lvl = 0.95, # highest density interval level (de fault: 0.95)
         adapt_delta = 0.8, # MCMC target acceptance rate (default: 0.95)
         max_treedepth = 10) # tree depth evaluated at each step (default: 12)
```


## Posterior predictive checks

```{r, fig.height = 6, fig.width = 6}
ggplot(data = M$ppc$ppc_rep)+
  facet_wrap(facets = ~individual_id, ncol = 3)+
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", col = "darkgray")+
  geom_errorbar(aes(x = observed_count, y = ppc_mean_count, 
                    ymin = ppc_L_count, ymax = ppc_H_count), col = "darkgray")+
  geom_point(aes(x = observed_count, y = ppc_mean_count), size = 1)+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  xlab(label = "Observed usage [counts]")+
  ylab(label = "PPC usage [counts]")
```


## Analysis of estimated effect sizes
The top panel shows the average gene usage (GU) in different biological 
conditions. The bottom panels shows the differential gene usage (DGU) 
between pairs of biological conditions.


```{r, fig.weight = 7, fig.height = 4}
g1 <- ggplot(data = M$gu)+
  geom_errorbar(aes(x = gene_name, y = prob_mean, ymin = prob_L,
                    ymax = prob_H, col = condition),
                width = 0.1, position = position_dodge(width = 0.4))+
  geom_point(aes(x = gene_name, y = prob_mean, col = condition), size = 1,
             position = position_dodge(width = 0.4))+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  ylab(label = "GU [probability]")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4))


stats <- M$dgu
stats <- stats[order(abs(stats$es_mean), decreasing = FALSE), ]
stats$gene_fac <- factor(x = stats$gene_name, levels = unique(stats$gene_name))

g2 <- ggplot(data = stats)+
  facet_wrap(facets = ~contrast)+
  geom_hline(yintercept = 0, linetype = "dashed", col = "gray")+
  geom_errorbar(aes(x = pmax, y = es_mean, ymin = es_L, ymax = es_H), 
                col = "darkgray")+
  geom_point(aes(x = pmax, y = es_mean, col = contrast))+
  geom_text_repel(data = stats[stats$pmax >= 0.95, ],
                  aes(x = pmax, y = es_mean, label = gene_fac),
                  min.segment.length = 0, size = 2.75)+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  xlab(label = expression(pi))+
  xlim(c(0, 1))+
  ylab(expression(gamma))
```


```{r, fig.height = 6, fig.width = 7}
(g1/g2)
```


# Session

```{r}
sessionInfo()
```