--- title: "cellmig: Quantifying Cell Migration Velocity with Hierarchical Bayesian Models" author: "Simo Kitanovski (simo.kitanovski@uni-due.de)" output: BiocStyle::html_document vignette: > %\VignetteEncoding{UTF-8} %\VignetteIndexEntry{User Manual: cellmig} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: 72 --- ```{r setup, include = FALSE, warning = FALSE} knitr::opts_chunk$set(comment = FALSE, warning = FALSE, message = FALSE) ``` ```{r} library(cellmig) library(ggplot2) library(ggforce) library(rstan) ggplot2::theme_set(new = theme_bw(base_size = 10)) ``` # Introduction High-throughput time-lapse microscopy allows us to track thousands of cells across many wells and plates. However, these experiments generate complex data with multiple sources of variability: 1. **Biological variability:** Cells on different experimental plates (biological replicates) behave differently even under the same conditions. 2. **Technical variability:** Differences between wells on the same plate. 3. **Batch effects:** Systematic differences between plates (e.g., imaging days, reagent batches). Standard statistical tests often ignore this hierarchical structure (cells $\rightarrow$ wells $\rightarrow$ plates), which can lead to false positives or missed discoveries. The Bioconductor package `r Biocpkg("cellmig")` addresses this by using **Bayesian hierarchical models**. It separates biological signals from technical noise and provides uncertainty-aware estimates (credible intervals) for treatment effects. This vignette guides you through installing `cellmig`, formatting your data, fitting a model, and interpreting the results. # Installation To install this package, start R (version "4.5") and enter: ```{r, eval=FALSE} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("cellmig") ``` # Data Structure ## Required Columns | Column | Description | Example | |:-----------------|:--------------------------------|:--------------------| | `well` | Unique well ID | "w1", "w2", ... | | `plate` | Unique plate ID (Biological Replicate) | "p1", "p2", ... | | `compound` | Name of the compound/drug | "DMSO", "DrugA", ... | | `dose` | Concentration or level | "0", "10", "high", ... | | `v` | Observed migration velocity (numeric) | 0.54 (µm/min) | | `offset` | Batch correction flag (0 or 1) | 0 or 1 | ### Important: The `offset` Column To correct for plate-to-plate batch effects, you must identify a **control treatment** that appears on every plate (e.g., DMSO). - Set `offset = 1` for cells in this control group. - Set `offset = 0` for all other treatments. - *Note:* The model uses this group to normalize velocities across plates ## Example Data We will use the simulated dataset included in the package. It contains: - **3 Plates** (Biological replicates) - **378 Wells** (Technical replicates nested in plates) - **6 Compounds** at **7 Doses** (42 Treatment Groups) ```{r load-data} data("d", package = "cellmig") str(d) head(d) ``` # Exploratory Data Analysis Before modeling, it is good practice to visualize the raw data to check for obvious batch effects, outliers, multimodal distributions, or other imaging- related effects. ## Raw Cell Velocities Each dot represents a single cell. Facets show different compounds. Colors indicate plates. If plate colors cluster separately within facets, you likely have batch effects. ```{r, fig.width=7, fig.height=6} ggplot(data = d)+ facet_wrap(facets = ~paste0("compound=", compound), scales = "free_y", ncol = 2)+ geom_sina(aes(x = as.factor(dose), col = plate, y = v, group = well), size = 0.5)+ theme_bw()+ theme(legend.position = "top", strip.text.x = element_text(margin = margin(0.03,0,0.03,0, "cm")))+ ylab(label = "migration velocity")+ xlab(label = '')+ scale_color_grey()+ guides(color = guide_legend(override.aes = list(size = 3)))+ guides(shape = guide_legend(override.aes = list(size = 3)))+ scale_y_log10()+ annotation_logticks(base = 10, sides = "l") ``` ## Mean Velocity per Well Aggregating to the well level often makes batch effects clearer. Here, we see the mean velocity per well. ```{r, fig.width=7, fig.height=6} dm <- aggregate(v~well+plate+compound+dose, data = d, FUN = mean) ggplot(data = dm)+ facet_wrap(facets = ~paste0("compound=", compound), scales = "free_y", ncol = 2)+ geom_sina(aes(x = as.factor(dose), col = plate, y = v, group = well), size = 1.5, alpha = 0.7)+ theme_bw()+ theme(legend.position = "top", strip.text.x = element_text(margin = margin(0.03,0,0.03,0, "cm")))+ ylab(label = "migration velocity")+ xlab(label = '')+ scale_color_grey()+ guides(color = guide_legend(override.aes = list(size = 3)))+ guides(shape = guide_legend(override.aes = list(size = 3)))+ scale_y_log10()+ annotation_logticks(base = 10, sides = "l") ``` # Model Fitting Now we fit the hierarchical Bayesian model. The `cellmig()` function handles the complex Stan modeling behind the scenes. ## Control Parameters You can adjust the MCMC (Markov Chain Monte Carlo) settings via the `control` list. - `mcmc_warmup`: Steps to tune the sampler. - `mcmc_steps`: Actual sampling steps used for inference. - `mcmc_chains`: Number of independent chains (recommend $\ge$ 2 for convergence checks). *Note: For this vignette, we use fewer steps for speed. For real analysis, increase `mcmc_steps` to 2000+.* ```{r fit-model, fig.width=7, fig.height=3.5} o <- cellmig(x = d, control = list(mcmc_warmup = 300, # Warmup iterations mcmc_steps = 1000, # Sampling iterations mcmc_chains = 2, # Number of chains mcmc_cores = 2)) # Parallel cores ``` # Interpreting Results The model output `o` contains posterior distributions for all parameters. We focus on the **overall treatment effects** ($\delta_t$), which represent the change in velocity relative to the control (offset). ## Overall Treatment Effects ($\delta_t$) The `delta_t` data frame contains the mean, median, and 95% Highest Density Intervals (HDI) for each treatment. ```{r view-delta} knitr::kable(o$posteriors$delta_t, digits = 2) ``` ### Visualizing Effects - **Dot:** Posterior mean effect. - **Error Bar:** 95% HDI (Uncertainty range). - **Y-axis:** Log-fold change ($\delta$). - If the error bar **crosses 0**, there is no clear evidence of an effect. - **Positive values:** Increased migration. - **Negative values:** Decreased migration. ```{r plot-delta, fig.width=6, fig.height=4} ggplot(data = o$posteriors$delta_t) + geom_line(aes(x = dose, y = mean, col = compound, group = compound)) + geom_point(aes(x = dose, y = mean, col = compound)) + geom_errorbar(aes(x = dose, y = mean, ymin = X2.5., ymax = X97.5., col = compound), width = 0.1) + ylab(label = expression("Log-Fold Change ("*delta*")")) + xlab("Dose") + theme(legend.position = "top") ``` ## From Log-Fold-Change to Fold-Change Log-scale values can be hard to interpret biologically. We often prefer **Fold-Change**, where 1 = no effect, \>1 = increase, \<1 = decrease. We calculate this by exponentiating the values ($\exp(\delta)$). ```{r plot-fold-change, fig.width=6, fig.height=4} ggplot(data = o$posteriors$delta_t) + geom_line(aes(x = dose, y = exp(mean), col = compound, group = compound)) + geom_point(aes(x = dose, y = exp(mean), col = compound)) + geom_errorbar(aes(x = dose, y = exp(mean), ymin = exp(X2.5.), ymax = exp(X97.5.), col = compound), width = 0.1) + ylab(label = expression("Fold Change ("*delta*"')")) + xlab("Dose") + theme(legend.position = "top") ``` # Pairwise Comparisons Often, you want to compare specific treatments against each other (e.g., Drug A vs. Drug B), not just against the control. ## Comparison Matrix The `get_pairs()` function computes the difference between all treatment groups. - $\rho$: Log-fold change difference between Treatment X and Treatment Y. - $\pi$: Probability that the effect is truly different (0 to 1). Closer to 1 means strong evidence. ```{r} # Get pairwise comparisons (log-scale) u <- get_pairs(x = o, exponentiate = FALSE) ``` ### Visualize $\rho$ ```{r} # vislualize matrix of rhos u$plot_rho ``` ### Visualize $\pi$ ```{r} # visualize matrix of pis u$plot_pi ``` ### Visualize $\rho$ vs. $\pi$ with a volcano plot ```{r} # visualize volcano plot u$plot_volcano ``` *Tip: Set `exponentiate = TRUE` to get fold-change ratios instead of log-differences.* ## Violin Plots for Specific Comparisons To inspect the distribution of differences between a specific target group and all others, use `get_violins()`. ```{r get-groups} # View available group labels groups <- get_groups(x = o) head(groups) ``` ```{r plot-violins, fig.width=7, fig.height=3} # Compare all groups against Compound 2, Dose 1 u_violin <- get_violins(x = o, from_groups = groups$group, to_group = "C2|D1", exponentiate = FALSE) u_violin$plot ``` - **Violins:** Show the posterior distribution of the difference. - **Label:** Probability ($\pi$) that the difference is non-zero. # Model Diagnostics and Calibration Reliable uncertainty quantification requires verifying that the model fits the data well and that probability metrics ($\pi$) are calibrated to your experimental design. `cellmig` provides tools for both. ## Posterior Predictive Checks (PPC) PPCs compare observed data to data simulated from the fitted model. If the model is well-specified, the simulated data (pink) should overlap the observed data (black). It is crucial to verify that the model fits your data well. `cellmig` provides Posterior Predictive Checks (PPC). **Interpretation:** - **Good Fit**: Observed points fall within the pink violin densities; well-means cluster near the diagonal. - **Bad Fit**: Systematic deviations suggest the model may not be valid and the outputs (parameter values) should not be trusted! ### Cell-Level Check Do the simulated velocities (pink) match the observed velocities (black)? If they overlap well, the model captures the data distribution. ```{r ppc-cell, fig.width=6, fig.height=9} g <- get_ppc_violins(x = o, wrap = TRUE, ncol = 3) g + scale_y_log10() ``` ### Well-Level Check Does the model predict the mean velocity per well accurately? Points should lie close to the diagonal line. ```{r ppc-well, fig.width=5, fig.height=5} g <- get_ppc_means(x = o) g ``` ## Leave-One-Out (LOO) diagnostics To detect influential observations or model misspecification, cellmig supports Pareto k diagnostics via the rstan or loo packages: https://mc-stan.org/loo/reference/pareto-k-diagnostic.html). High k values (>0.7) indicate highly influential observations that the model struggles to predict, e.g. outlying cell velocities generated by tracking or segmentation artifacts or experimental errors. ```{r} # run loo loo_o <- rstan::loo(x = o$fit) # Print diagnostic table print(loo_o) # Plot diagnostic estimates for each cell plot(loo_o) # Which cells have k>0.7? Inspect these cells which(loo_o$pointwise[,"influence_pareto_k"]>0.7) ``` ## MCMC diagnostics Bayesian uncertainty estimates (HDI, $\pi$) are only valid if the Markov Chain Monte Carlo (MCMC) sampler has successfully explored the posterior distribution. `cellmig` provides tools to verify convergence. ### Check $\hat{R}$ and ESS The potential scale reduction factor ($\hat{R}$) should be close to 1.0 (typically $< 1.01$) for all parameters. Additionally, the Effective Sample Size (ESS) should be sufficiently large to ensure stable HDI estimates. ```{r rhat, fig.width=3, fig.height=3} rstan::stan_rhat(o$fit) ``` ```{r mcmc-summary} summary_stats <- summary(o$fit)$summary hist(summary_stats[,"n_eff"]) ``` ### Check for divergent transitions and other issues Divergent transitions indicate that the sampler encountered regions of high curvature it could not explore. Valid uncertainty quantification **requires 0 divergent transitions**. ```{r} rstan::check_hmc_diagnostics(object = o$fit) ``` Is any warnings are produced by this check, please consult the guidelines provided in the following vignette: https://mc-stan.org/learn-stan/diagnostics-warnings.html # Variance Components You can inspect how much variability comes from different sources (plates, wells, treatments). This helps in planning future experiments. ```{r plot-variance, fig.height=3, fig.width=7} # Plate-specific baseline effects g_alpha_p <- ggplot(data = o$posteriors$alpha_p) + geom_errorbarh(aes(y = plate, x = mean, xmin = X2.5., xmax = X97.5.), height = 0.2) + geom_point(aes(y = plate, x = mean)) + xlab("Plate Effect (log-scale)") # Variance parameters (Biological vs Technical) g_sigma <- ggplot() + geom_errorbarh(data = o$posteriors$sigma_bio, aes(y = "Biological (Plate)", x = mean, xmin = X2.5., xmax = X97.5.), height = 0.2) + geom_errorbarh(data = o$posteriors$sigma_tech, aes(y = "Technical (Well)", x = mean, xmin = X2.5., xmax = X97.5.), height = 0.2) + geom_errorbarh(data = o$posteriors$sigma_delta, aes(y = "Treatment Variation", x = mean, xmin = X2.5., xmax = X97.5.), height = 0.2) + geom_point(data = o$posteriors$sigma_bio, aes(y = "Biological (Plate)", x = mean)) + geom_point(data = o$posteriors$sigma_tech, aes(y = "Technical (Well)", x = mean)) + geom_point(data = o$posteriors$sigma_delta, aes(y = "Treatment Variation", x = mean)) + xlab("Standard Deviation") g_alpha_p | g_sigma ``` # Advanced: Dose-Response Profiles For screens with multiple compounds and overlapping doses, `cellmig` can cluster compounds based on their dose-response similarity, accounting for uncertainty. - **Left:** Dendrogram showing similarity between compounds. - **Middle/Right:** Dose-response curves for overall ($\delta$) and plate-specific ($\gamma$) effects. ```{r dose-response, fig.width=8, fig.height=5} get_dose_response_profile(x = o, exponentiate = TRUE) + patchwork::plot_layout(widths = c(.7, 1, 2)) ``` # Session Info ```{r} sessionInfo() ```