Phage immuno-precipitation sequencing (PhIP-seq) is a high-throughput
approach for characterizing antibody responses to a variety of target
antigens. A typical component of PhIP-seq analyses involves identifying
which peptides elicit enriched antibody responses. beer
provides two approaches for identifying peptide enrichments.
The first approach is based on edgeR
’s
standard pipeline for identifying differential expression from read
count data123. Though edgeR
is remarkably effective at quickly identifying enriched antibody
responses, it is less likely to pick up enriched peptides at the lower
fold-change range.
The second approach, Bayesian Estimation in R (BEER) was developed specifically for the PhIP-seq setting and implements a Bayesian model to identify peptide enrichments as described in Chen et. al4. Though BEER is more likely to identify enriched peptides at the lower fold-change range, it tends to take much longer to run.
Along with beer
, we will use the following packages in
this vignette:
rjags
For Bayesian MCMC modeling, beer
relies on rjags to
interface Just Another Gibbs
Sampler (JAGS). JAGS can be downloaded from this link.
Homebrew users can install JAGS
using,
brew install jags
For M1 Mac users using Rosetta emulation of intel, Homebrew installation of JAGS will likely work. However, we recommend installing JAGS from source for all other M1 Mac users.
Once JAGS has been installed, rjags can be
installed in R
via
install.packages("rjags")
.
To demonstrate beer
, we simulate a small toy data set of
10 samples, each with 50 peptides. Four of the ten samples were
beads-only samples, and each of the six remaining samples had 5 enriched
peptides. Note that since there are rather few beads-only samples, each
with very few peptides, the inference is likely to be very poor.
Parameters for non-enriched peptides were derived from beads-only
samples run with samples from HIV elite controllers. For the code to
generate the data set, see
file.path(package = "beer", "script/sim_data.R")
.
data_path <- system.file("extdata/sim_data.rds", package = "beer")
sim_data <- readRDS(data_path)
sim_data
#> class: PhIPData
#> dim: 10 10
#> metadata(8): seed a_pi ... b_c fc
#> assays(8): counts logfc ... true_b true_theta
#> rownames(10): 1 2 ... 9 10
#> rowData names(2): a_0 b_0
#> colnames(10): 1 2 ... 9 10
#> colData names(5): group n_init n true_c true_pi
#> beads-only name(4): beads
Differentially enriched peptides between a particular serum sample and all beads-only samples indicate enriched antibody responses to those peptides. Thus, to identify enriched peptides, we can run the standard edgeR pipeline for differential expression.
The runEdgeR()
function estimates peptide-specific
dispersion parameters then tests identifies differentially expressed
peptides using either the exact test proposed by Robinson and Smyth5 (default,
also specified with de.method = "exactTest"
), or the GLM
quasi-likelihood F-test6 (specified with
de.method = "glmQLFTest"
). Since peptides are enriched only
if average proportion of reads pulled in the serum sample is higher than
the average proportion of reads pulled in a beads-only samples,
two-sided p-values are converted to one-sided p-values.
edgeR log10 p-values and log2 estimated fold-changes are returned in
the assays specified by assay.names
.
Using BH correction to adjust for multiple testing, enriched peptides are given by the matrix,
BEER uses a Bayesian hierarchical model to derive posterior probabilities of enrichment and estimated fold-changes. Briefly, each sample is run individually in comparison to all beads-only samples as follows:
BEER can be easily run with brew()
. Like with
runEdgeR()
, results are stored in the locations specified
by assay.names
. We can add the existing results to our
edgeR output as follows.
## Named vector specifying where we want to store the summarized MCMC output
## NULL indicates that the output should not be stored.
assay_locations <- c(
phi = "beer_fc_marg",
phi_Z = "beer_fc_cond",
Z = "beer_prob",
c = "sampleInfo",
pi = "sampleInfo"
)
beer_out <- brew(edgeR_out, assay.names = assay_locations)
Thus, supposing peptides with posterior probability above 0.5 are enriched and noting that super enriched peptides were not run (and thus are missing entries in the posterior probability matrix), the matrix of enriched peptides is given by,
## Define matrix of peptides that were run in BEER
was_run <- matrix(rep(beer_out$group != "beads", each = nrow(beer_out)),
nrow = nrow(beer_out))
## Identify super-enriched peptides
## These peptides were in samples that were run, but have missing posterior
## probabilities
are_se <- was_run & is.na(assay(beer_out, "beer_prob"))
## Enriched peptides are peptides with:
## - posterior probability > 0.5, OR
## - super-enriched peptides
assay(beer_out, "beer_hits") <- assay(beer_out, "beer_prob") > 0.5 | are_se
colSums(assay(beer_out, "beer_hits"))
#> 1 2 3 4 5 6 7 8 9 10
#> NA NA NA NA 3 1 1 1 0 1
Each of the steps above are controlled by arguments within
brew()
and are described in a bit more detail in the
following sections.
The model relies on the following prior parameters:
a_pi
, b_pi
: shape parameters for a beta
distribution that describes the proportion of peptides expected to be
enriched in a given sample. The defaults are a_pi = 2
,
b_pi = 300
.a_phi
, b_phi
: shape parameters for a gamma
distribution that describes fold-change for enriched peptides. This
distribution is shifted by fc
(see below). By default,
a_phi = 1.25
, b_phi = 0.1
.a_c
, b_c
: shape parameters for the
attenuation constant. The defaults are a_c = 80
,
b_c = 20
.fc
: minimum fold-change for an enriched peptide,
defaults to 1.a_0j
, b0j
: peptide-specific shape
parameters. For peptide j,
a_0j
, b_0j
are the shape parameters for the
beta distribution that describes the proportion of reads pulled
presuming the peptide is not enriched.The default prior distributions for the proportion of enriched peptides in a serum sample, the fold-change for enriched peptides, and the attenuation constants are shown below.
Prior parameters are specified in the
prior.params
argument of
brew()
. Rather than supplying a_0j
and
b_0j
, the user can alternatively specify a method of
deriving a_0j
and b_0j
from the beads-only
samples. Currently, beer
supports MOM, MLE, and edgeR
estimates for a_0j
, b_0j
. Additional
parameters for these methods can be supplied using the
beads.args
parameter of
brew()
. Alternatively the user can supply custom
a_0j
, b_0j
to prior.params
. Note
that when a_0j
and b_0j
are supplied, the
prior parameters are not re-calculated after tossing out clearly
enriched peptides.
For convenience, beta parameters can be estimated using MLE or MOM
given a vector of proportions using getAB()
. For any beta
distribution, the MCMC sampler may get stuck for shape parameters less
than one, so if possible it is safer for beta shape parameters to be
greater than one.
Super enriched peptides are peptides with MLE or edgeR estimated
fold-changes in comparison to beads-only samples above a given
threshold, which is defaulted to 15. These peptides should always have
posterior probability 1 of being enriched. The more peptides that are
identified as super enriched, the faster BEER runs. However, the
threshold should be conservatively set such that all peptides are
guaranteed to be enriched. In brew()
, the argument
se.params
controls how super enriched
peptides are identified.
JAGS parameters, such as the number of chains, number of iteration,
thinning parameters, and more are defined in
jags.params
and directly passed to
rjags::jags.model()
and
rjags::coda.samples()
.
By default, if sample.dir = NULL
in
brew()
, the samples from each MCMC run are saved in the
R
temporary directory given by tempdir()
.
Samples can be saved by specifying a non-null sample directory. The
posterior samples for each serum sample are saved in an RDS files named
after the serum sample ID.
Is is important to note that MCMC parameters are indexed by row and
column. Since super enriched peptides are tossed out before
running the MCMC, the row index of the parameter does not necessarily
correspond to the row of the PhIPData object. For example parameter
Z[5, 1]
could in reality correspond to peptide 8 in the
PhIPData object if there were three peptides before peptide 8 that were
labeled as super-enriched.
To approximate the false positive rate, we often run each of the beads-only samples against all other beads-only samples. This beads-only round robin also provides a sense of how similar the beads-only samples are to each other.
The beads-only round robin can be included in brew()
and
runEdgeR()
by specifying beadsRR = TRUE
.
## edgeR with beadsRR
edgeR_beadsRR <- runEdgeR(sim_data, beadsRR = TRUE,
assay.names = c(logfc = "edgeR_logfc",
prob = "edgeR_logpval"))
## Calculate hits
assay(edgeR_beadsRR, "edgeR_hits") <- apply(
assay(edgeR_beadsRR, "edgeR_logpval"), 2,
function(sample){
pval <- 10^(-sample)
p.adjust(pval, method = "BH") < 0.05
})
## Note samples 1-4 have 0 instead of NA now
colSums(assay(edgeR_beadsRR, "edgeR_hits"))
#> 1 2 3 4 5 6 7 8 9 10
#> 0 0 0 0 1 1 1 0 0 1
## BEER with beadsRR added to edgeR output
beer_beadsRR <- brew(edgeR_beadsRR, beadsRR = TRUE,
assay.names = assay_locations)
## Check BEER hits like before
was_run <- matrix(rep(beer_beadsRR$group != "beads", each = nrow(beer_beadsRR)),
nrow = nrow(beer_beadsRR))
are_se <- was_run & is.na(assay(beer_beadsRR, "beer_prob"))
beer_hits <- assay(beer_beadsRR, "beer_prob") > 0.5 | are_se
## Note again that samples 1-4 are not NA
colSums(beer_hits)
#> 1 2 3 4 5 6 7 8 9 10
#> 0 0 0 0 3 1 1 1 0 1
Alternatively, one can run beadsRR()
separately,
## edgeR with beadsRR
edgeR_beadsRR <- beadsRR(sim_data, method = "edgeR",
assay.names = c(logfc = "edgeR_logfc",
prob = "edgeR_logpval"))
## Calculate hits
assay(edgeR_beadsRR, "edgeR_hits") <- apply(
assay(edgeR_beadsRR, "edgeR_logpval"), 2,
function(sample){
pval <- 10^(-sample)
p.adjust(pval, method = "BH") < 0.05
})
## Note samples 5-10 are NA now
colSums(assay(edgeR_beadsRR, "edgeR_hits"))
#> 1 2 3 4 5 6 7 8 9 10
#> 0 0 0 0 NA NA NA NA NA NA
## BEER with beadsRR added to edgeR output
beer_beadsRR <- beadsRR(edgeR_beadsRR, method = "beer",
assay.names = assay_locations)
## Check BEER hits like before
was_run <- matrix(rep(beer_beadsRR$group == "beads", each = nrow(beer_beadsRR)),
nrow = nrow(beer_beadsRR))
are_se <- was_run & is.na(assay(beer_beadsRR, "beer_prob"))
beer_hits <- assay(beer_beadsRR, "beer_prob") > 0.5 | are_se
## Note again that samples 5-10 are now NA
colSums(beer_hits)
#> 1 2 3 4 5 6 7 8 9 10
#> 0 0 0 0 NA NA NA NA NA NA
By default, beer
runs using the first registered
back-end for parallelization as returned by
BiocParallel::bpparam()
. Other parallel evaluation
environments are supported via the BiocParallel
package (see the BiocParallel
vignette for a list of possible parallelization environments). In
both brew()
and runEdgeR
, the parallel
environments are passed via the BPPARAM
argument which
takes a BiocParallelParam
object.
## Run edgeR using different parallel environments
runEdgeR(sim_data, BPPARAM = BiocParallel::SerialParam())
#> class: PhIPData
#> dim: 10 10
#> metadata(8): seed a_pi ... b_c fc
#> assays(8): counts logfc ... true_b true_theta
#> rownames(10): 1 2 ... 9 10
#> rowData names(2): a_0 b_0
#> colnames(10): 1 2 ... 9 10
#> colData names(5): group n_init n true_c true_pi
#> beads-only name(4): beads
runEdgeR(sim_data, BPPARAM = BiocParallel::SnowParam())
#> class: PhIPData
#> dim: 10 10
#> metadata(8): seed a_pi ... b_c fc
#> assays(8): counts logfc ... true_b true_theta
#> rownames(10): 1 2 ... 9 10
#> rowData names(2): a_0 b_0
#> colnames(10): 1 2 ... 9 10
#> colData names(5): group n_init n true_c true_pi
#> beads-only name(4): beads
## Run beer in parallel
brew(sim_data, BPPARAM = BiocParallel::SerialParam())
#> class: PhIPData
#> dim: 10 10
#> metadata(8): seed a_pi ... b_c fc
#> assays(8): counts logfc ... true_b true_theta
#> rownames(10): 1 2 ... 9 10
#> rowData names(2): a_0 b_0
#> colnames(10): 1 2 ... 9 10
#> colData names(7): group n_init ... c pi
#> beads-only name(4): beads
brew(sim_data, BPPARAM = BiocParallel::SnowParam())
#> class: PhIPData
#> dim: 10 10
#> metadata(8): seed a_pi ... b_c fc
#> assays(8): counts logfc ... true_b true_theta
#> rownames(10): 1 2 ... 9 10
#> rowData names(2): a_0 b_0
#> colnames(10): 1 2 ... 9 10
#> colData names(7): group n_init ... c pi
#> beads-only name(4): beads
To facilitate visualization of PhIP-Seq analyses, beer
includes plot helpers that calculate values of interest and stores these
values as a new assays in the PhIPData
object. Currently,
these functions include:
getExpected()
returns expected reads/proportions pulled
by a peptide based on the average proportion of reads pulled across all
beads-only samples,getBF()
: returns Bayes factors for the probability of
enriched antibody responses for each peptide.getExpected()
To calculate the expected read counts and proportion of reads, we can
specify type = c("rc", "prop")
in
getExpected()
. This resulting PhIPData
object
can be converted to a DataFrame
/tibble
which
can be plotted using any plotting method of choice. For example, we can
visualize the simulated data set by plotting the observed read counts to
the expected read counts as follows,
We can visualize the simulated data by plotting the observed versus the expected number of reads.
getBF()
Once brew()
has been run on a PhIPData
object, we can plot Bayes factors for peptide enrichment using the
getBF()
plot helper. getBF()
returns a
PhIPData
object with Bayes factors stored as an additional
assay. For example, on the simulated data set, a plot of Bayes factors
by peptide can be generated as follows,
sessionInfo()
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] beer_1.11.0 rjags_4-16
#> [3] coda_0.19-4.1 PhIPData_1.13.0
#> [5] SummarizedExperiment_1.35.5 Biobase_2.65.1
#> [7] GenomicRanges_1.57.2 GenomeInfoDb_1.41.2
#> [9] IRanges_2.39.2 S4Vectors_0.43.2
#> [11] BiocGenerics_0.51.3 MatrixGenerics_1.17.1
#> [13] matrixStats_1.4.1 dplyr_1.1.4
#> [15] ggplot2_3.5.1 BiocStyle_2.33.1
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 farver_2.1.2 blob_1.2.4
#> [4] filelock_1.0.3 fastmap_1.2.0 BiocFileCache_2.13.2
#> [7] digest_0.6.37 lifecycle_1.0.4 statmod_1.5.0
#> [10] RSQLite_2.3.7 magrittr_2.0.3 compiler_4.4.1
#> [13] rlang_1.1.4 sass_0.4.9 tools_4.4.1
#> [16] utf8_1.2.4 yaml_2.3.10 knitr_1.48
#> [19] S4Arrays_1.5.11 labeling_0.4.3 bit_4.5.0
#> [22] curl_5.2.3 DelayedArray_0.31.14 abind_1.4-8
#> [25] BiocParallel_1.39.0 withr_3.0.2 purrr_1.0.2
#> [28] sys_3.4.3 grid_4.4.1 fansi_1.0.6
#> [31] colorspace_2.1-1 progressr_0.15.0 edgeR_4.3.21
#> [34] scales_1.3.0 cli_3.6.3 rmarkdown_2.28
#> [37] crayon_1.5.3 generics_0.1.3 httr_1.4.7
#> [40] DBI_1.2.3 cachem_1.1.0 zlibbioc_1.51.2
#> [43] splines_4.4.1 parallel_4.4.1 BiocManager_1.30.25
#> [46] XVector_0.45.0 vctrs_0.6.5 Matrix_1.7-1
#> [49] jsonlite_1.8.9 bit64_4.5.2 maketools_1.3.1
#> [52] locfit_1.5-9.10 limma_3.61.12 jquerylib_0.1.4
#> [55] snow_0.4-4 glue_1.8.0 codetools_0.2-20
#> [58] gtable_0.3.6 UCSC.utils_1.1.0 munsell_0.5.1
#> [61] tibble_3.2.1 pillar_1.9.0 htmltools_0.5.8.1
#> [64] GenomeInfoDbData_1.2.13 R6_2.5.1 dbplyr_2.5.0
#> [67] evaluate_1.0.1 lattice_0.22-6 highr_0.11
#> [70] memoise_2.0.1 bslib_0.8.0 SparseArray_1.5.45
#> [73] xfun_0.48 buildtools_1.0.0 pkgconfig_2.0.3
Robinson MD, McCarthy DJ and Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139-140↩︎
McCarthy DJ, Chen Y and Smyth GK (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research 40, 4288-4297↩︎
Chen Y, Lun ATL, Smyth GK (2016). From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline. F1000Research 5, 1438↩︎
Chen A, Kammers K, Larman HB, Scharpf R, Ruczinski I. Detecting antibody reactivities in phage immunoprecipitation sequencing data (2022). bioRxiv. https://www.biorxiv.org/content/10.1101/2022.01.19.476926v1↩︎
Robinson MD and Smyth GK. Small-sample estimation of negative binomial dispersion, with applications to SAGE data (2008). Biostatistics, 9, 321-332. https://doi.org/10.1093/biostatistics/kxm030↩︎
Lun, ATL, Chen, Y, and Smyth, GK. It’s DE-licious: a recipe for differential expression analyses of RNA-seq experiments using quasi-likelihood methods in edgeR (2016). Methods in Molecular Biology, 1418, 391–416.↩︎