Assume mutation analyses identify four fragments with a TP53 variant out of 1000 fragments overlapping that position. In matched WBC sequencing, we observed no fragments with this mutation out of 600 distinct fragments spanning this position. While TP53 is a well-known tumor suppressor, mutations in TP53 are also common in CH. Given the mutant allele frequencies in cfDNA and matched buffy coat sequencing and prior studies, how strong is the evidence that the TP53 mutation is tumor-derived?
Install the plasmut package from Bioconductor
library(magrittr)
library(tidyverse)
library(plasmut)
knitr::opts_chunk$set(cache=TRUE)
lo <- function(p) log(p/(1-p))
We assume the following minimal representation for the mutation data such that each row uniquely identifies a mutation within a sample.
## sample data
p53 <- tibble(y=c(4, 0),
n=c(1000, 600),
analyte=c("plasma", "buffy coat"),
mutation="TP53",
sample_id="id1")
dat <- p53 %>%
unite("uid", c(sample_id, mutation), remove=FALSE) %>%
group_by(uid) %>%
nest()
## required format for plasmut
dat
## # A tibble: 1 × 2
## # Groups: uid [1]
## uid data
## <chr> <list>
## 1 id1_TP53 <tibble [2 × 5]>
Let (yp, np) denote the number of mutant reads and total number of distinct reads in plasma and (yw, nw) the corresponding frequencies from the buffy coat. Assuming that the mutation is either tumor- or CH-derived, the posterior odds is given by $$\frac{p(S | y_p, y_w, n_p, n_w)}{p(H | y_p, y_w, n_p, n_w)} = \frac{p(y_w, y_p | n_p, n_w, S)}{p(y_w, y_p | n_p, n_w, H)} \cdot \frac{p(S)}{p(H)}, $$ where S indicates the somatic (tumor-derived) model and H denotes the hematopietic (CH-derived) model. The term $\frac{p(y_w, y_p | n_p, n_w, S)}{p(y_w, y_p | n_p, n_w, H)}$ is the Bayes factor and is a ratio of the marginal likelihoods. For the denominator, we assume that the unobserved true mutant allele frequency (θ) in plasma is the same as the mutant allele frequency in buffy coat and rewrite the marginal likelihood as ∫θp(yp|θ, np)p(yw|θ, nw)p(θ|H)dθ. We suggest a diffuse prior for θ to provide support for both rare and common CH mutations.
For model S, the marginal likelihood is factored as ∫θpp(yp|np, θp)p(θp|S)dθp∫θwp(yw|nw, θw)p(θw|S)dθw. A diffuse prior for θp allows support for both rare and common somatic mutations. Under model S, θW is > 0 if circulating tumor cells (CTCs) are inter-mixed with white blood cells in the buffy coat. As CTCs tend to be uncommon even in patients with late-stage disease, we suggest a prior distribution that concentrates most of the mass near zero (i.e., Beta(1, 103)).
We can compute marginal likelihoods for model S and H by simply drawing Monte Carlo samples from the priors and approximating the above integrals by the mean. However, when the likelihood is much more concentrated than the prior, this sampling approach can be inefficient and numerically unstable. To mitigate these issues, we implemented an importance sampling approach where the target distribution for the importance sampler is a weighted average of the prior and posterior. The resulting mixture distribution has a shape similar to the posterior but with fatter tails. As the target distribution ensures that we sample values of θ where the posterior has a positive likelihood, approximation of the marginal likelihoods is more accurate with fewer Monte Carlo simulations.
In the following code block, we assume that (1) the probability that
CTCs are mixed in with the WBCs is small (e.g., 1 CTC per 1,000 cells)
using a Beta(1, 10^3) prior, (2) the prior for θp is relatively
diffuse (Beta(1, 9)), and (3) the prior for germline or CHIP variants in
WBCs is also relatively diffuse (Beta(1, 9)). A mixture weight of 0.1
for the prior (prior.weight
) concentrates most of the mass
of the target distribution on the posterior:
## Parameters
param.list <- list(ctc=list(a=1, b=10e3),
ctdna=list(a=1, b=9),
chip=list(a=1, b=9),
montecarlo.samples=50e3,
prior.weight=0.1)
Next, we estimate the marginal likelihood for the mutation
frequencies under the S and
H models and return all Monte
Carlo samples in a list. For running this model on datasets with a large
number of candidate tumor-derived mutations, we recommend saving only
the marginal likelihoods and Bayes factors by setting
save_montecarlo=FALSE
.
stats <- importance_sampler(dat$data[[1]], param.list)
## Just the mutation-level summary statistics (marginal likelihoods and bayes factors)
importance_sampler(dat$data[[1]], param.list, save_montecarlo=FALSE)
## # A tibble: 1 × 4
## ctc ctdna chip bayesfactor
## <dbl> <dbl> <dbl> <dbl>
## 1 -0.0583 -4.75 -7.09 2.28
We view the plasma MAF of 4/1000 and buffy coat MAF of 0/600 as weak evidence that the mutation is tumor derived (Bayes factor = 9.76). As previous studies have demonstrated that TP53 mutations are common in CH, our prior odds is 1 and so the posterior odds for the tumor-origin model is the same as the Bayes factor, 9.76.
As long as montecarlo.samples
is big enough, we should
obtain a similar estimate of the marginal likelihood without importance
sampling. Since our target distribution g is a mixture of the prior and
posterior with weight prior.weight
, setting
prior.weight=1
just samples θ’s from our prior (i.e., importance
sampling is not implemented). Below, we compare the stability of the
Bayes factor estimate as a function of the Monte Carlo sample size and
prior weight:
fun <- function(montecarlo.samples, data,
param.list, prior.weight=0.1){
param.list$montecarlo.samples <- montecarlo.samples
param.list$prior.weight <- prior.weight
res <- importance_sampler(data, param.list,
save_montecarlo=FALSE) %>%
mutate(montecarlo.samples=montecarlo.samples,
prior.weight=param.list$prior.weight)
res
}
fun2 <- function(montecarlo.samples, data,
param.list, prior.weight=0.1,
nreps=100){
res <- replicate(nreps, fun(montecarlo.samples, data,
param.list,
prior.weight=prior.weight),
simplify=FALSE) %>%
do.call(bind_rows, .) %>%
group_by(montecarlo.samples, prior.weight) %>%
summarize(mean_bf=mean(bayesfactor),
sd_bf=sd(bayesfactor),
.groups="drop")
res
}
S <- c(100, 1000, seq(10e3, 50e3, by=10000))
results <- S %>%
map_dfr(fun2, data=dat$data[[1]], param.list=param.list)
results2 <- S %>%
map_dfr(fun2, data=dat$data[[1]], param.list=param.list,
prior.weight=1)
combined <- bind_rows(results, results2)
combined %>%
mutate(prior.weight=factor(prior.weight)) %>%
ggplot(aes(montecarlo.samples, sd_bf)) +
geom_point(aes(color=prior.weight)) +
geom_line(aes(group=prior.weight, color=prior.weight)) +
scale_y_log10() +
theme_bw(base_size=16) +
xlab("Monte Carlo samples") +
ylab("Standard deviation of\n log Bayes Factor")
Note that with importance sampling, relatively stable estimates for the Bayes factor are obtained with as few as 10,000 Monte Carlo samples while sampling from the prior distribution is very unstable for small sample sizes.
We illustrate this approach on a dataset of cfDNA and matched buffy coat sequencing for patients with metastatic colorectal cancer (van ’t Erve et al. 2023). Below, we select four mutations and run the importance sampler for these candidate mutations independently.
## # A tibble: 8 × 6
## patient gene aa analyte y n
## <int> <chr> <chr> <chr> <int> <int>
## 1 12 APC E1306* plasma 395 1750
## 2 12 APC E1306* buffy coat 0 963
## 3 12 HRAS Y96F plasma 4 1541
## 4 12 HRAS Y96F buffy coat 0 946
## 5 13 FGFR2 428V>E plasma 4 2400
## 6 13 FGFR2 428V>E buffy coat 0 1297
## 7 157 TP53 M237K plasma 15 2969
## 8 157 TP53 M237K buffy coat 5 1495
params <- list(ctdna = list(a = 1, b = 9),
ctc = list(a = 1, b = 10^3),
chip = list(a= 1, b = 9),
montecarlo.samples = 50e3,
prior.weight = 0.1)
muts <- unite(crcseq, "uid", c(patient, gene), remove = FALSE) %>%
group_by(uid) %>% nest()
#Each element of the data column contains a table with the variant and total allele counts in plasma and buffy coat.
#Run the importance sampler
muts$IS <- muts$data %>% map(importance_sampler, params)
fun <- function(x){
result <- x$data %>%
select(-position) %>%
mutate(bayes_factor = x$IS$bayesfactor$bayesfactor)
return(result)
}
bf <- apply(muts, 1, fun)
bf %>% do.call(rbind, .) %>%
as_tibble() %>%
select(patient, gene, aa, bayes_factor) %>%
rename(log_bf=bayes_factor) %>%
distinct()
## # A tibble: 4 × 4
## patient gene aa log_bf
## <int> <chr> <chr> <dbl>
## 1 12 APC E1306* 190.
## 2 12 HRAS Y96F 1.72
## 3 13 FGFR2 428V>E 1.33
## 4 157 TP53 M237K -1.14
For the E1306* APC mutation, 395 of 1750 fragments in cfDNA contain the mutation while zero of 963 fragments from buffy coat contain this mutation. The evidence that E1306* is tumor-derived is definitive (log Bayes factor = 190). For the M237K TP53 mutation, we observe 5 mutations out of 1495 fragments in buffy coat and 15 mutations out of 2969 fragments in cfDNA. The observed mutant read rate is roughly equal in WBC and cfDNA, providing further evidence that the variant likely originates from CH. As indicated by our prior, we feel that a high CTC fraction in buffy coat is very unlikely given the rarity of CTCs relative to white blood cells in buffy coat. The log Bayes factor (equivalent to the posterior log odds assuming a prior odds of 1) is -1.14. The probability that the mutation is tumor-derived is only 0.25 ($\frac{exp(-1.14)}{exp(-1.14) + 1}$ or 0.24).
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] plasmut_1.5.0 lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1
## [5] dplyr_1.1.4 purrr_1.0.2 readr_2.1.5 tidyr_1.3.1
## [9] tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0 magrittr_2.0.3
## [13] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 utf8_1.2.4 generics_0.1.3
## [4] stringi_1.8.4 hms_1.1.3 digest_0.6.37
## [7] evaluate_1.0.1 grid_4.4.2 timechange_0.3.0
## [10] fastmap_1.2.0 jsonlite_1.8.9 BiocManager_1.30.25
## [13] scales_1.3.0 codetools_0.2-20 jquerylib_0.1.4
## [16] cli_3.6.3 rlang_1.1.4 munsell_0.5.1
## [19] withr_3.0.2 cachem_1.1.0 yaml_2.3.10
## [22] tools_4.4.2 tzdb_0.4.0 colorspace_2.1-1
## [25] buildtools_1.0.0 vctrs_0.6.5 R6_2.5.1
## [28] lifecycle_1.0.4 pkgconfig_2.0.3 pillar_1.10.0
## [31] bslib_0.8.0 gtable_0.3.6 glue_1.8.0
## [34] xfun_0.49 tidyselect_1.2.1 sys_3.4.3
## [37] knitr_1.49 farver_2.1.2 htmltools_0.5.8.1
## [40] rmarkdown_2.29 maketools_1.3.1 labeling_0.4.3
## [43] compiler_4.4.2