---
title: "cellmig: Quantifying Cell Migration Velocity with 
Hierarchical Bayesian Models"
author: "Simo Kitanovski (simo.kitanovski@uni-due.de)"
output:
  BiocStyle::html_document
vignette: >
  %\VignetteEncoding{UTF-8}
  %\VignetteIndexEntry{User Manual: cellmig}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: 72
---

```{r setup, include = FALSE, warning = FALSE}
knitr::opts_chunk$set(comment = FALSE, 
                      warning = FALSE, 
                      message = FALSE)
```

```{r}
library(cellmig)
library(ggplot2)
library(ggforce)
library(rstan)
ggplot2::theme_set(new = theme_bw(base_size = 10))
```

# Introduction

High-throughput time-lapse microscopy allows us to track thousands of
cells across many wells and plates. However, these experiments generate
complex data with multiple sources of variability:

1.  **Biological variability:** Cells on different experimental plates
    (biological replicates) behave differently even under the same
    conditions.

2.  **Technical variability:** Differences between wells on the same
    plate.

3.  **Batch effects:** Systematic differences between plates (e.g.,
    imaging days, reagent batches).

Standard statistical tests often ignore this hierarchical structure
(cells $\rightarrow$ wells $\rightarrow$ plates), which can lead to
false positives or missed discoveries.

The Bioconductor package `r Biocpkg("cellmig")` addresses this by using
**Bayesian hierarchical models**. It separates biological signals from
technical noise and provides uncertainty-aware estimates (credible
intervals) for treatment effects. This vignette guides you through
installing `cellmig`, formatting your data, fitting a model, and
interpreting the results.

# Installation

To install this package, start R (version "4.5") and enter:

```{r, eval=FALSE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("cellmig")
```

# Data Structure

## Required Columns

| Column     | Description                            | Example                |
|:-----------------|:--------------------------------|:--------------------|
| `well`     | Unique well ID                         | "w1", "w2", ...        |
| `plate`    | Unique plate ID (Biological Replicate) | "p1", "p2", ...        |
| `compound` | Name of the compound/drug              | "DMSO", "DrugA", ...   |
| `dose`     | Concentration or level                 | "0", "10", "high", ... |
| `v`        | Observed migration velocity (numeric)  | 0.54 (µm/min)          |
| `offset`   | Batch correction flag (0 or 1)         | 0 or 1                 |

### Important: The `offset` Column

To correct for plate-to-plate batch effects, you must identify a
**control treatment** that appears on every plate (e.g., DMSO).

- Set `offset = 1` for cells in this control group.

- Set `offset = 0` for all other treatments.

- *Note:* The model uses this group to normalize velocities across
  plates

## Example Data

We will use the simulated dataset included in the package. It contains:

- **3 Plates** (Biological replicates)

- **378 Wells** (Technical replicates nested in plates)

- **6 Compounds** at **7 Doses** (42 Treatment Groups)

```{r load-data}
data("d", package = "cellmig")
str(d)
head(d)
```

# Exploratory Data Analysis

Before modeling, it is good practice to visualize the raw data to check
for obvious batch effects, outliers, multimodal distributions, or other
imaging- related effects.

## Raw Cell Velocities

Each dot represents a single cell. Facets show different compounds.
Colors indicate plates. If plate colors cluster separately within
facets, you likely have batch effects.

```{r, fig.width=7, fig.height=6}
ggplot(data = d)+
  facet_wrap(facets = ~paste0("compound=", compound), 
             scales = "free_y", ncol = 2)+
  geom_sina(aes(x = as.factor(dose), col = plate, y = v, group = well), 
            size = 0.5)+
  theme_bw()+
  theme(legend.position = "top",
        strip.text.x = element_text(margin = margin(0.03,0,0.03,0, "cm")))+
  ylab(label = "migration velocity")+
  xlab(label = '')+
  scale_color_grey()+
  guides(color = guide_legend(override.aes = list(size = 3)))+
  guides(shape = guide_legend(override.aes = list(size = 3)))+
  scale_y_log10()+
  annotation_logticks(base = 10, sides = "l")
```

## Mean Velocity per Well

Aggregating to the well level often makes batch effects clearer. Here,
we see the mean velocity per well.

```{r, fig.width=7, fig.height=6}
dm <- aggregate(v~well+plate+compound+dose, data = d, FUN = mean)
ggplot(data = dm)+
  facet_wrap(facets = ~paste0("compound=", compound), 
             scales = "free_y", ncol = 2)+
  geom_sina(aes(x = as.factor(dose), col = plate, y = v, group = well), 
            size = 1.5, alpha = 0.7)+
  theme_bw()+
  theme(legend.position = "top",
        strip.text.x = element_text(margin = margin(0.03,0,0.03,0, "cm")))+
  ylab(label = "migration velocity")+
  xlab(label = '')+
  scale_color_grey()+
  guides(color = guide_legend(override.aes = list(size = 3)))+
  guides(shape = guide_legend(override.aes = list(size = 3)))+
  scale_y_log10()+
  annotation_logticks(base = 10, sides = "l")
```

# Model Fitting

Now we fit the hierarchical Bayesian model. The `cellmig()` function
handles the complex Stan modeling behind the scenes.

## Control Parameters

You can adjust the MCMC (Markov Chain Monte Carlo) settings via the
`control` list.

- `mcmc_warmup`: Steps to tune the sampler.

- `mcmc_steps`: Actual sampling steps used for inference.

- `mcmc_chains`: Number of independent chains (recommend $\ge$ 2 for
  convergence checks).

*Note: For this vignette, we use fewer steps for speed. For real
analysis, increase `mcmc_steps` to 2000+.*

```{r fit-model, fig.width=7, fig.height=3.5}
o <- cellmig(x = d,
             control = list(mcmc_warmup = 300,  # Warmup iterations
                            mcmc_steps = 1000,  # Sampling iterations
                            mcmc_chains = 2,    # Number of chains
                            mcmc_cores = 2))    # Parallel cores
```

# Interpreting Results

The model output `o` contains posterior distributions for all
parameters. We focus on the **overall treatment effects** ($\delta_t$),
which represent the change in velocity relative to the control (offset).

## Overall Treatment Effects ($\delta_t$)

The `delta_t` data frame contains the mean, median, and 95% Highest
Density Intervals (HDI) for each treatment.

```{r view-delta}
knitr::kable(o$posteriors$delta_t, digits = 2)
```

### Visualizing Effects

- **Dot:** Posterior mean effect.
- **Error Bar:** 95% HDI (Uncertainty range).
- **Y-axis:** Log-fold change ($\delta$).
  - If the error bar **crosses 0**, there is no clear evidence of an
    effect.
  - **Positive values:** Increased migration.
  - **Negative values:** Decreased migration.

```{r plot-delta, fig.width=6, fig.height=4}
ggplot(data = o$posteriors$delta_t) +
  geom_line(aes(x = dose, y = mean, col = compound, group = compound)) +
  geom_point(aes(x = dose, y = mean, col = compound)) +
  geom_errorbar(aes(x = dose, y = mean, ymin = X2.5., ymax = X97.5., 
                    col = compound), width = 0.1) +
  ylab(label = expression("Log-Fold Change ("*delta*")")) +
  xlab("Dose") +
  theme(legend.position = "top")
```

## From Log-Fold-Change to Fold-Change

Log-scale values can be hard to interpret biologically. We often prefer
**Fold-Change**, where 1 = no effect, \>1 = increase, \<1 = decrease. We
calculate this by exponentiating the values ($\exp(\delta)$).

```{r plot-fold-change, fig.width=6, fig.height=4}
ggplot(data = o$posteriors$delta_t) +
  geom_line(aes(x = dose, y = exp(mean), col = compound, group = compound)) +
  geom_point(aes(x = dose, y = exp(mean), col = compound)) +
  geom_errorbar(aes(x = dose, y = exp(mean), ymin = exp(X2.5.), 
                    ymax = exp(X97.5.), col = compound), width = 0.1) +
  ylab(label = expression("Fold Change ("*delta*"')")) +
  xlab("Dose") +
  theme(legend.position = "top")
```

# Pairwise Comparisons

Often, you want to compare specific treatments against each other (e.g.,
Drug A vs. Drug B), not just against the control.

## Comparison Matrix

The `get_pairs()` function computes the difference between all treatment
groups.

- $\rho$: Log-fold change difference between Treatment X and Treatment
  Y.  
- $\pi$: Probability that the effect is truly different (0 to 1). Closer
  to 1 means strong evidence.

```{r}
# Get pairwise comparisons (log-scale)
u <- get_pairs(x = o, exponentiate = FALSE)
```

### Visualize $\rho$

```{r}
# vislualize matrix of rhos
u$plot_rho
```

### Visualize $\pi$

```{r}
# visualize matrix of pis
u$plot_pi
```

### Visualize $\rho$ vs. $\pi$ with a volcano plot

```{r}
# visualize volcano plot
u$plot_volcano
```

*Tip: Set `exponentiate = TRUE` to get fold-change ratios instead of
log-differences.*

## Violin Plots for Specific Comparisons

To inspect the distribution of differences between a specific target
group and all others, use `get_violins()`.

```{r get-groups}
# View available group labels
groups <- get_groups(x = o)
head(groups)
```

```{r plot-violins, fig.width=7, fig.height=3}
# Compare all groups against Compound 2, Dose 1
u_violin <- get_violins(x = o, 
                        from_groups = groups$group,
                        to_group = "C2|D1",
                        exponentiate = FALSE)
u_violin$plot
```

- **Violins:** Show the posterior distribution of the difference.
- **Label:** Probability ($\pi$) that the difference is non-zero.

# Model Diagnostics and Calibration

Reliable uncertainty quantification requires verifying that the model
fits the data well and that probability metrics ($\pi$) are calibrated
to your experimental design. `cellmig` provides tools for both.

## Posterior Predictive Checks (PPC)

PPCs compare observed data to data simulated from the fitted model. If
the model is well-specified, the simulated data (pink) should overlap
the observed data (black).

It is crucial to verify that the model fits your data well. `cellmig`
provides Posterior Predictive Checks (PPC).

**Interpretation:**

- **Good Fit**: Observed points fall within the pink violin densities;
  well-means cluster near the diagonal.

- **Bad Fit**: Systematic deviations suggest the model may not be valid
  and the outputs (parameter values) should not be trusted!

### Cell-Level Check

Do the simulated velocities (pink) match the observed velocities
(black)? If they overlap well, the model captures the data distribution.

```{r ppc-cell, fig.width=6, fig.height=9}
g <- get_ppc_violins(x = o, wrap = TRUE, ncol = 3)
g + scale_y_log10()
```

### Well-Level Check

Does the model predict the mean velocity per well accurately? Points
should lie close to the diagonal line.

```{r ppc-well, fig.width=5, fig.height=5}
g <- get_ppc_means(x = o)
g
```


## Leave-One-Out (LOO) diagnostics

To detect influential observations or model misspecification, cellmig 
supports Pareto k diagnostics via the rstan or loo packages:
https://mc-stan.org/loo/reference/pareto-k-diagnostic.html). 

High k values (>0.7) indicate highly influential observations that the model 
struggles to predict, e.g. outlying cell velocities generated by tracking or
segmentation artifacts or experimental errors. 

```{r}
# run loo
loo_o <- rstan::loo(x = o$fit)

# Print diagnostic table
print(loo_o)

# Plot diagnostic estimates for each cell
plot(loo_o)

# Which cells have k>0.7? Inspect these cells
which(loo_o$pointwise[,"influence_pareto_k"]>0.7)
```


## MCMC diagnostics
Bayesian uncertainty estimates (HDI, $\pi$) are only valid if the Markov 
Chain Monte Carlo (MCMC) sampler has successfully explored the posterior 
distribution. `cellmig` provides tools to verify convergence.

### Check $\hat{R}$ and ESS
The potential scale reduction factor ($\hat{R}$) should be close to 1.0 
(typically $< 1.01$) for all parameters. Additionally, the Effective Sample 
Size (ESS) should be sufficiently large to ensure stable HDI estimates.

```{r rhat, fig.width=3, fig.height=3}
rstan::stan_rhat(o$fit)
```


```{r mcmc-summary}
summary_stats <- summary(o$fit)$summary
hist(summary_stats[,"n_eff"])
```

### Check for divergent transitions and other issues

Divergent transitions indicate that the sampler encountered regions of 
high curvature it could not explore. Valid uncertainty quantification 
**requires 0 divergent transitions**.

```{r}
rstan::check_hmc_diagnostics(object = o$fit)
```

Is any warnings are produced by this check, please consult the guidelines
provided in the following vignette:

https://mc-stan.org/learn-stan/diagnostics-warnings.html


# Variance Components

You can inspect how much variability comes from different sources
(plates, wells, treatments). This helps in planning future experiments.

```{r plot-variance, fig.height=3, fig.width=7}
# Plate-specific baseline effects
g_alpha_p <- ggplot(data = o$posteriors$alpha_p) +
  geom_errorbarh(aes(y = plate, x = mean, xmin = X2.5., xmax = X97.5.),
                 height = 0.2) +
  geom_point(aes(y = plate, x = mean)) +
  xlab("Plate Effect (log-scale)")

# Variance parameters (Biological vs Technical)
g_sigma <- ggplot() +
  geom_errorbarh(data = o$posteriors$sigma_bio,
                 aes(y = "Biological (Plate)",
                     x = mean, xmin = X2.5., xmax = X97.5.), height = 0.2) +
  geom_errorbarh(data = o$posteriors$sigma_tech,
                 aes(y = "Technical (Well)",
                     x = mean, xmin = X2.5., xmax = X97.5.), height = 0.2) +
  geom_errorbarh(data = o$posteriors$sigma_delta,
                 aes(y = "Treatment Variation",
                     x = mean, xmin = X2.5., xmax = X97.5.), height = 0.2) +
  geom_point(data = o$posteriors$sigma_bio,
             aes(y = "Biological (Plate)", x = mean)) +
  geom_point(data = o$posteriors$sigma_tech,
             aes(y = "Technical (Well)", x = mean)) +
  geom_point(data = o$posteriors$sigma_delta,
             aes(y = "Treatment Variation", x = mean)) +
  xlab("Standard Deviation")

g_alpha_p | g_sigma
```

# Advanced: Dose-Response Profiles

For screens with multiple compounds and overlapping doses, `cellmig` can
cluster compounds based on their dose-response similarity, accounting
for uncertainty.

- **Left:** Dendrogram showing similarity between compounds.
- **Middle/Right:** Dose-response curves for overall ($\delta$) and
  plate-specific ($\gamma$) effects.

```{r dose-response, fig.width=8, fig.height=5}
get_dose_response_profile(x = o, exponentiate = TRUE) +
  patchwork::plot_layout(widths = c(.7, 1, 2))
```

# Session Info

```{r}
sessionInfo()
```