---
title: "Modeling the origin of mutations identified in a liquid biopsy: cancer or clonal hematopoiesis?"
author: "Adith S. Arun and Robert B. Scharpf"
date: "`r format(Sys.Date())`"
output: BiocStyle::html_document
abstract: >
  Mutation-based approaches for detection of cancer from cell-free DNA (cfDNA) using liquid biopsies have the potential to track a patient's response to treatment, enabling effective and timely decisions on therapy.  However, mutations arising from clonal hematopoeisis (CH) are common and tumor biopsies for definitive identification of the origin of these mutations is not always available. Sequencing of matched cells from buffy coat and the absence of mutations in these cells has been used as a test to rule-out CH, but uneven sequencing depths between matched cell-free DNA and buffy coat samples and the potential for contamination of buffy coat with circulating tumor cells (CTCs) are not captured by rule-based analyses.  This package estimates Bayes factors that weigh the evidence of competing CH- and tumor-origin models of cfDNA mutations detected in cfDNA, requiring only the allele frequencies of high quality alignments available from standard mutation callers.
vignette: >
  %\VignetteIndexEntry{Modeling the origin of mutations in a liquid biopsy: cancer or clonal hematopoiesis?}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography:  arun2023.bib
---

# Motivating example

Assume mutation analyses identify four fragments with a *TP53* variant out of 1000 fragments overlapping that position.
In matched WBC sequencing, we observed no fragments with this mutation out of 600 distinct fragments spanning this position.
While *TP53* is a well-known tumor suppressor, mutations in *TP53* are also common in CH.
Given the mutant allele frequencies in cfDNA and matched buffy coat sequencing and prior studies, how strong is the evidence that the *TP53* mutation is tumor-derived?

# Installation

Install the plasmut package from Bioconductor

```{r installation, eval = FALSE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("plasmut")
```

# Data organization

```{r setup, message=FALSE}
library(magrittr)
library(tidyverse)
library(plasmut)
knitr::opts_chunk$set(cache=TRUE)
lo <- function(p) log(p/(1-p))
```

We assume the following minimal representation for the mutation data such that each row uniquely identifies a mutation within a sample.

```{r data_wrangling}
## sample data
p53 <- tibble(y=c(4, 0),
              n=c(1000, 600),
              analyte=c("plasma", "buffy coat"),
              mutation="TP53",
              sample_id="id1")
dat <- p53 %>%
    unite("uid", c(sample_id, mutation), remove=FALSE) %>%
    group_by(uid) %>%
    nest()
## required format for plasmut
dat
```


# Approach and implementation

## Bayesian model

Let ($y_p$, $n_p$) denote the number of mutant reads and total number of distinct reads in plasma and ($y_w$, $n_w$) the corresponding frequencies from the buffy coat.
Assuming that the mutation is either tumor- or CH-derived, the posterior odds is given by
$$\frac{p(S | y_p, y_w, n_p, n_w)}{p(H | y_p, y_w, n_p, n_w)} = \frac{p(y_w, y_p | n_p, n_w, S)}{p(y_w, y_p | n_p, n_w, H)} \cdot \frac{p(S)}{p(H)}, $$
where *S* indicates the somatic (tumor-derived) model and H denotes the hematopietic (CH-derived) model.
The term $\frac{p(y_w, y_p | n_p, n_w, S)}{p(y_w, y_p | n_p, n_w, H)}$ is the Bayes factor and is a ratio of the marginal likelihoods.
For the denominator, we assume that the unobserved true mutant allele frequency ($\theta$) in plasma is the same as the mutant allele frequency in buffy coat and rewrite the marginal likelihood as $\int_{\theta} p(y_p | \theta, n_p) p(y_w | \theta,  n_w) p(\theta | H)d\theta$.
We suggest a diffuse prior for $\theta$ to provide support for both rare and common CH mutations. 

For model *S*, the marginal likelihood is factored as $\int_{\theta_p} p(y_p | n_p, \theta_p) p(\theta_p | S)d\theta_p\int_{\theta_w} p(y_w | n_w, \theta_w) p(\theta_w | S) d\theta_w$.
A diffuse prior for $\theta_p$ allows support for both rare and common somatic mutations.
Under model $S$, $\theta_W$ is $> 0$ if circulating tumor cells (CTCs) are inter-mixed with white blood cells in the buffy coat.
As CTCs tend to be uncommon even in patients with late-stage disease, we suggest a prior distribution that concentrates most of the mass near zero (i.e., Beta(1, $10^3$)).

## Implementation

We can compute marginal likelihoods for model $S$ and $H$ by simply drawing Monte Carlo samples from the priors and approximating the above integrals by the mean. 
However, when the likelihood is much more concentrated than the prior, this sampling approach can be inefficient and numerically unstable. 
To mitigate these issues, we implemented an importance sampling approach where the target distribution for the importance sampler is a weighted average of the prior and posterior. 
The resulting mixture distribution has a shape similar to the posterior but with fatter tails.
As the target distribution ensures that we sample values of $\theta$ where the posterior has a positive likelihood, approximation of the marginal likelihoods is more accurate with fewer Monte Carlo simulations.

In the following code block, we assume that
(1) the probability that CTCs are mixed in with the WBCs is small (e.g., 1 CTC per 1,000 cells) using a Beta(1, 10^3) prior,
(2) the prior for $\theta_p$ is relatively diffuse (Beta(1, 9)), and
(3) the prior for germline or CHIP variants in WBCs is also relatively diffuse (Beta(1, 9)).
A mixture weight of 0.1 for the prior (`prior.weight`) concentrates most of the mass of the target distribution on the posterior:
```{r parameters}
## Parameters
param.list <- list(ctc=list(a=1, b=10e3),
                   ctdna=list(a=1, b=9),
                   chip=list(a=1, b=9),
                   montecarlo.samples=50e3,
                   prior.weight=0.1)
```

Next, we estimate the marginal likelihood for the mutation frequencies under the $S$ and $H$ models and return all Monte Carlo samples in a list.
For running this model on datasets with a large number of candidate tumor-derived mutations, we recommend saving only the marginal likelihoods and Bayes factors by setting  `save_montecarlo=FALSE`.

```{r montecarlo}
stats <- importance_sampler(dat$data[[1]], param.list)
## Just the mutation-level summary statistics (marginal likelihoods and bayes factors)
importance_sampler(dat$data[[1]], param.list, save_montecarlo=FALSE)
```

We view the plasma MAF of 4/1000 and buffy coat MAF of 0/600 as weak evidence that the mutation is tumor derived (Bayes factor  = `r round(exp(stats$bayesfactor$bayesfactor), 2)`).
As previous studies have demonstrated that *TP53* mutations are common in CH, our prior odds is 1 and so the posterior odds for the tumor-origin model is the same as the Bayes factor, `r round(exp(stats$bayesfactor$bayesfactor), 2)`.

## Efficiency of importance sampler

As long as `montecarlo.samples` is big enough, we should obtain a similar estimate of the marginal likelihood without importance sampling.
Since our target distribution $g$ is a mixture of the prior and posterior with weight `prior.weight`, setting `prior.weight=1` just samples $\theta$'s from our prior (i.e., importance sampling is not implemented).  Below, we compare the stability of the Bayes factor estimate as a function of the Monte Carlo sample size and prior weight:

```{r prior.weight, cache=TRUE}
fun <- function(montecarlo.samples, data,
                param.list, prior.weight=0.1){
    param.list$montecarlo.samples <- montecarlo.samples
    param.list$prior.weight <- prior.weight
    res <- importance_sampler(data, param.list,
                              save_montecarlo=FALSE) %>%
        mutate(montecarlo.samples=montecarlo.samples,
               prior.weight=param.list$prior.weight)
    res
}
fun2 <- function(montecarlo.samples, data,
                 param.list, prior.weight=0.1,
                 nreps=100){
    res <- replicate(nreps, fun(montecarlo.samples, data,
                                param.list,
                                prior.weight=prior.weight),
                     simplify=FALSE) %>%
        do.call(bind_rows, .) %>%
        group_by(montecarlo.samples, prior.weight) %>%
        summarize(mean_bf=mean(bayesfactor),
                  sd_bf=sd(bayesfactor),
                  .groups="drop")
    res
}
S <- c(100, 1000, seq(10e3, 50e3, by=10000))
results <- S %>%
    map_dfr(fun2, data=dat$data[[1]], param.list=param.list)
results2 <- S %>%
    map_dfr(fun2, data=dat$data[[1]], param.list=param.list,
            prior.weight=1)
combined <- bind_rows(results, results2)
```


```{r standardev, fig.width=8, fig.height=5}
combined %>%
    mutate(prior.weight=factor(prior.weight)) %>%
    ggplot(aes(montecarlo.samples, sd_bf)) +
    geom_point(aes(color=prior.weight)) +
    geom_line(aes(group=prior.weight, color=prior.weight)) +
    scale_y_log10() +
    theme_bw(base_size=16) +
    xlab("Monte Carlo samples") +
    ylab("Standard deviation of\n log Bayes Factor")
```

Note that with importance sampling, relatively stable estimates for the Bayes factor are obtained with as few as 10,000 Monte Carlo samples while sampling from the prior distribution is very unstable for small sample sizes.

```{r means, fig.width=8, fig.height=5}
combined %>%
    mutate(prior.weight=factor(prior.weight)) %>%
    filter(montecarlo.samples > 100) %>%
    ggplot(aes(prior.weight, mean_bf)) +
    geom_point() +
    theme_bw(base_size=16) +
    ylab("Mean log Bayes factor") +
    xlab("Prior weight")
```


# Application to van't Erve et al.

We illustrate this approach on a dataset of cfDNA and matched buffy coat sequencing for patients with metastatic colorectal cancer [@vanterve2023].
Below, we select four mutations and run the importance sampler for these candidate mutations independently.

```{r vanterve}
data(crcseq, package="plasmut")
crcseq %>% select(-position)
```

```{r vanterve-importance-sampling}
params <- list(ctdna = list(a = 1, b = 9), 
               ctc = list(a = 1, b = 10^3), 
               chip = list(a= 1, b = 9), 
               montecarlo.samples = 50e3, 
               prior.weight = 0.1)
muts <- unite(crcseq, "uid", c(patient, gene), remove = FALSE) %>% 
        group_by(uid) %>% nest()
#Each element of the data column contains a table with the variant and total allele counts in plasma and buffy coat. 
#Run the importance sampler
muts$IS <- muts$data %>% map(importance_sampler, params)
fun <- function(x){
    result <- x$data %>%
        select(-position) %>%
        mutate(bayes_factor = x$IS$bayesfactor$bayesfactor)
    return(result)
}
bf <- apply(muts, 1, fun) 
bf %>% do.call(rbind, .) %>%
    as_tibble() %>%
    select(patient, gene, aa, bayes_factor) %>%
    rename(log_bf=bayes_factor) %>%
    distinct()
```

For the E1306\* *APC* mutation,  395 of 1750 fragments in cfDNA contain the mutation while zero of 963 fragments from buffy coat contain this mutation.
The evidence that E1306\* is tumor-derived is definitive (log Bayes factor = 190).
For the M237K *TP53* mutation, we observe 5 mutations out of 1495 fragments in buffy coat and 15 mutations out of 2969 fragments in cfDNA.
The observed mutant read rate is roughly equal in WBC and cfDNA, providing further evidence that the variant likely originates from CH.
As indicated by our prior, we feel that a high CTC fraction in buffy coat is very unlikely given the rarity of CTCs relative to white blood cells in buffy coat.
The log Bayes factor (equivalent to the posterior log odds assuming a prior odds of 1) is -1.14.
The probability that the mutation is tumor-derived is only 0.25 ($\frac{exp(-1.14)}{exp(-1.14) + 1}$ or 0.24).

# Session information

```{r session}
sessionInfo()
```