Droplet Digital PCR (ddPCR) is a system from Bio-Rad for estimating the number of genomic fragments in samples. ddPCR attaches fluorochromes to targets of interest, for example, mutant and wild type KRAS. Each sample is then divided into 20,000 droplets and qPCR is run on each droplet. The brightness of these droplets is measured in two channels, each corresponding to our targets. The amplitudes of the droplets can be plotted and the results analysed to see whether droplets can be called as:
There are variations in the brightnesses of the droplets; this can be particularly evident in and around the PP cluster, where there may be some crosstalk due the presence of both fluorochromes. The classification of droplets is therefore not necessarily as simple as deciding on brightness thresholds for each channel, above which a positive reading is called in that channel.
This vignette demonstrates how the twoddpcr
package may
be used to load data, classified and how to quickly create
summaries.
twoddpcr
packageThe package can be installed from Bioconductor using:
Alternatively, it can be installed from GitHub using:
Another alternative is to install the package from source:
twoddpcr
packageOnce the package has been installed, it can be loaded in the usual way:
Our example uses the KRASdata
dataset, which comes as
part of the package. This dataset was created as a triplicate four-fold
serial dilution with 5% A549 mutant KRAS cell lines and 95% H1048 wild
type KRAS cells as starting material.
To follow along with this dataset, create a ddpcrPlate
object using:
To follow along with your own dataset, see the Appendix.
All of the droplets can be plotted to see how they tend cluster:
This might not be particularly informative; for example, a density plot may be more appropriate:
It can be seen here that most of the droplets are concentrated in the bottom-left and bottom-right clusters.
To take a different view, all of the wells could be plotted side-by-side:
Since the droplet amplitudes were extracted from Bio-Rad’s
QuantaSoft, there may already be some kind of classification.
This can be checked with the commonClassificationMethod
method, which retrieves the classification methods that exist for all of
the wells in the plate:
## [1] "None" "Cluster"
The Cluster
classification can be plotted:
Again, these are all of the wells in the plate superimposed onto the same plot. This gives a good overall picture, but the detection of rare alleles in individual wells is where ddPCR is particularly useful. The wells that used in the plate are:
## [1] "E03" "F03" "G03" "H03" "A04" "B04" "C04" "D04" "E04" "F04" "G04" "H04"
Individual wells can be selected using the [[...]]
syntax (as with lists in R). These can be plotted in the same
way using dropletPlot
. For example:
thresholdClassify
)This section illustrates how the Cluster
classification
was obtained, although the original classification was found using
QuantaSoft. The classification here involves setting linear
gates (thresholds) for the two channels Ch1.Amplitude
and
Ch2.Amplitude
, above each of which we will call a positive
reading for that channel.
The commonClassificationMethod
method shows that there
is now a new classification method:
## [1] "None" "Cluster" "thresholds"
The thresholds
classification can be plotted using
dropletPlot
but changing the cMethod
parameter:
kmeansClassify
)Visually, it appears that the classification in the previous section
does not accurately classify a region between the main NP
and PP
clusters. There are a number of algorithms that
could be used to better classify the clusters; one such example is the
k-means clustering algorithm. The k-means algorithm is relatively fast
but requires that we know how many clusters there are. With this in
mind, it helps to classify all of the wells together so that human
intervention is not required to judge whether some clusters in
individual wells are empty. To run the algorithm on
ddpcrPlate
objects, the kmeansClassify
method
is used:
## [1] "None" "Cluster" "thresholds" "kmeans"
Notice how the PP
cluster incorporates more of the
droplets when compared to the thresholds
case. Visually, it
appears that k-means captures the clustering behaviour of the droplets
more accurately.
Using the same wells chosen before, it is interesting to see how the individual wells classify:
There are regions between clusters where the classification is
ambiguous, e.g. above and to the left of the NP cluster. These regions
can be labelled as “Rain” and removed from the droplet counts in each of
the clusters. To achieve this, the mahalanobisRain
method
can be used.
The classification methods are now:
## [1] "None" "Cluster" "thresholds" "kmeans"
## [5] "kmeansMahRain"
Whenever droplets are relabelled as Rain
using the
mahalanobisRain
method, the character string “MahRain” is
appended to the classification name to distinguish it from the original.
This classification is plotted as:
This does not look particularly good; a lot of droplets that should
be classified have been labelled as “Rain” instead. To remedy this, the
maxDistances
parameter can be adjusted to control the
maximum (Mahalanobis) distance that droplets can be from the cluster
centres. Some fine-tuning of this parameter gives:
plate <- mahalanobisRain(plate, cMethod="kmeans",
maxDistances=list(NN=35, NP=35, PN=35, PP=35))
commonClassificationMethod(plate)
## [1] "None" "Cluster" "thresholds" "kmeans"
## [5] "kmeansMahRain"
The plot now looks slightly different:
Using the number of droplets in each classification, the Poisson distribution can be used to estimate the number of fragments/molecules in the starting sample. For the k-means classification with rain, this gives the summary:
## PP PN NP NN AcceptedDroplets MtPositives MtNegatives WtPositives
## E03 292 273 5775 6229 12569 565 12004 6067
## F03 305 256 5840 5946 12347 561 11786 6145
## G03 236 222 4877 4860 10195 458 9737 5113
## H03 24 95 1630 9931 11680 119 11561 1654
## A04 22 101 1844 10840 12807 123 12684 1866
## B04 19 112 1924 10998 13053 131 12922 1943
## WtNegatives MtConcentration WtConcentration MtCopiesPer20uLWell
## E03 6502 54.110 775.440 1082.201
## F03 6202 54.707 810.049 1094.135
## G03 5082 54.076 819.050 1081.514
## H03 10026 12.048 179.643 240.956
## A04 10941 11.354 185.264 227.072
## B04 11110 11.867 189.615 237.334
## WtCopiesPer20uLWell TotalCopiesPer20uLWell Ratio FracAbun
## E03 15508.792 16590.992 0.0698 6.523
## F03 16200.972 17295.107 0.0675 6.326
## G03 16381.000 17462.514 0.0660 6.193
## H03 3592.853 3833.809 0.0671 6.285
## A04 3705.287 3932.358 0.0613 5.774
## B04 3792.292 4029.626 0.0626 5.890
The first few columns PP
, PN
,
NP
and NN
are the numbers of droplets in each
class, whereas AcceptedDroplets
is the sum of these.
MtPositives
is the number of droplets where a mutant has
been called and conversely MtNegatives
is the number of
droplets with no mutants called. The MtConcentration
is the
Poisson estimate of how many mutant fragments there are per 1uL, while
the MtCopiesPer20uLWell
is the same figure multiplied by
20. There are Wt
(wild type) analogues of all of these
Mt
figures. Finally, Ratio
is the figure
MtConcentration/WtConcentration
and FracAbun
is the fractional abundance of mutants in the sample,
i.e. 100 * MtConcentration/(MtConcentration + WtConcentration)
.
The summaries for other classifications can still be produced by
changing the cMethod
parameter to one of those that exist
in commonClassificationMethod(plate)
.
This concludes the main walkthrough of this vignette.
As mentioned above, the KRASdata
dataset was created as
a triplicate four-fold serial dilution with 5% mutant and 95% wild type
starting material. A data frame can be created to reflect this along
with the mutant concentration values of each well.
inputNg <- c(rep(64, 3), rep(16, 3), rep(4, 3), rep(1, 3))
mtConcentrations <-
data.frame(
x=inputNg,
Cluster=plateSummary(plate, cMethod="Cluster")$MtConcentration,
kmeans=plateSummary(plate, cMethod="kmeans")$MtConcentration,
kmeansMahRain=kmeansMahRainSummary$MtConcentration)
knitr::kable(mtConcentrations)
x | Cluster | kmeans | kmeansMahRain |
---|---|---|---|
64 | 51.028 | 54.152 | 54.110 |
64 | 48.540 | 54.689 | 54.707 |
64 | 49.161 | 54.204 | 54.076 |
16 | 12.036 | 12.036 | 12.048 |
16 | 11.244 | 11.337 | 11.354 |
16 | 11.848 | 11.848 | 11.867 |
4 | 3.340 | 3.435 | 3.436 |
4 | 3.358 | 3.271 | 3.272 |
4 | 3.042 | 3.042 | 3.043 |
1 | 0.547 | 0.547 | 0.547 |
1 | 0.637 | 0.637 | 0.637 |
1 | 0.470 | 0.470 | 0.470 |
The mutant concentration values can be plotted and the various classification methods compared against each other:
library(ggplot2)
library(reshape2)
mtConcentrationsLong <- melt(mtConcentrations, id.vars=c("x"))
ggplot(mtConcentrationsLong, aes_string("x", "value")) +
geom_point() + geom_smooth(method="lm") + facet_wrap(~variable)
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
Numerically, the regression lines have coefficients of determination (R2 values):
bioradLmSummary <- summary(lm(x ~ Cluster, data=mtConcentrations))
kmLmSummary <- summary(lm(x ~ kmeans, data=mtConcentrations))
kmMahRainLmSummary <- summary(lm(x ~ kmeansMahRain, data=mtConcentrations))
knitr::kable(c("Cluster"=bioradLmSummary$r.squared,
"kmeans"=kmLmSummary$r.squared,
"kmeansMahRain"=kmMahRainLmSummary$r.squared))
x | |
---|---|
Cluster | 0.9989393 |
kmeans | 0.9988022 |
kmeansMahRain | 0.9988212 |
As shown above, the regression lines fit the data from all of the
classification methods very well. Moreover, the R2 values are
all similar and very close to 1. Therefore all of the approaches are
very good and nothing can be said about which of the methods is better
or worse. The density plot created by heatPlot
above shows
that the number of droplets in the PP
cluster is relatively
small compared to the other clusters, particularly at the bottom of the
PP
cluster. This explains why the regression lines are very
similar.
An advantage of the twoddpcr
package’s k-means based
approach is that setting thresholds manually can be subjective. In
addition, the k-means clustering algorithm is more appropriate for
finding clusters when the PN
cluster ‘leans’ and the
NP
cluster ‘lifts’.
Setting rain using standard deviation is a commonly used approach in
ddPCR analysis to remove false positives. It involves setting Ch1 and
Ch2 thresholds for each cluster and removes droplets that are too far
from the cluster centres. This method was introduced in (Jones et al. 2014). However, it is not possible
to set such thresholds for the NP
cluster above because,
for example, setting a low Ch1 threshold would exclude too much from the
top-right of the cluster, whereas setting a high Ch1 threshold would
exclude nothing at all. The twoddpcr
package’s Mahalanobis
rain method allows this kind of approach to be used, while still
respecting the shapes of the clusters.
This section explains how other classification methods can be used. The methods already described above should suffice, but there are some droplet patterns that prevent them from working as well as we would like. The following methods are alternative techniques that can be used.
knnClassify
)Another classification algorithm is the k-nearest neighbour (k-NN) algorithm. The algorithm is very simple: For each droplet in the plate, look at the classifications of its nearest k-neighbours in a training set. Assign the majority classification to the droplet.
The challenge now is to find a good dataset that is not too large, since this would slow the algorithm considerably for marginal gains. A training set should also have minimal noise.
To start, two wells E03
and A04
are chosen
that reflect the clustering pattern of the plate without too much noise.
We create a new (virtual) plate with the amplitudes from these
wells.
To create the training classification, the k-means algorithm is useful:
We see that k-means has worked quite well here, since the training
set is relatively noise-free. However, it is clear that there is a
PN
droplet that is much closer to other PP
droplets than the rest of the PN
cluster. Noise can be
removed by adding rain:
trainPlate <- mahalanobisRain(trainPlate, cMethod="kmeans", maxDistances=3)
dropletPlot(trainPlate, cMethod="kmeansMahRain")
This is a much less noisy classification to use. The training data
needs to be a data frame and should also ignore the droplets classified
as Rain
; the removeDropletClasses
method
removes these droplets:
trainSet <- removeDropletClasses(trainPlate, cMethod="kmeansMahRain")
trainSet <- do.call(rbind, trainSet)
colnames(trainSet)
## [1] "Ch1.Amplitude" "Ch2.Amplitude" "kmeansMahRain"
We can check that the Rain
droplets have been
removed:
##
## NN NP PN PP Rain N/A
## 14866 6418 329 246 0 0
Next, we use this classification as the training set for the k-NN algorithm:
trainAmplitudes <- trainSet[, c("Ch1.Amplitude", "Ch2.Amplitude")]
trainCl <- trainSet$kmeansMahRain
plate <- knnClassify(plate, trainData=trainAmplitudes, cl=trainCl, k=3)
Again, it can be checked that there is a new classification method:
## [1] "None" "Cluster" "thresholds" "kmeans"
## [5] "kmeansMahRain" "knn"
This classification can be plotted in the same way as before:
gridClassify
)There may be some datasets where the above classification techniques
do not work satisfactorily. As long as the main clusters have good
separation from each other, the gridClassify
method may be
used. This method defines four ‘corner’ regions with linear cut-offs in
each channel; the remaining droplets are labelled as “Rain”. To see how
this works, consider the following (crude) example:
plate <- gridClassify(plate,
ch1NNThreshold=6500, ch2NNThreshold=2110,
ch1NPThreshold=5765, ch2NPThreshold=5150,
ch1PNThreshold=8550, ch2PNThreshold=2450,
ch1PPThreshold=6700, ch2PPThreshold=3870)
dropletPlot(plate, cMethod="grid")
This is not a particularly great classification, but this option exists should it be required. It is tedious to set the parameters above, so it may be helpful to use the Shiny app to aid in this process.
sdRain
Since droplets tend to cluster into ellipse-like structures, the
mahalanobisRain
method should usually suffice for labelling
ambiguous droplets as “Rain”. An alternative way is to use the mean and
standard deviation of each of the clusters (in both channels). To do
this, use the sdRain
method:
As is the case with Mahalanobis rain, the rain levels could be tweaked a little:
If you wish to use your own classification methods, the droplet
information would need to be extracted and can also be added to the
ddpcrPlate
object. The basic workflow would be:
Retrieve the droplet amplitudes using amplitudes
and
combine them in a single data frame:
Classify the droplets using your own method:
Add the classification to plate
:
The ddpcrPlate
class only understands classifications if
it is a factor with levels
c("NN", "NP", "PN", "PP", "Rain", "N/A")
. If the result of
your custom classification method returns a vector/factor with four
classes (with maybe some “Rain” or “N/A”), then the vector/factor may be
relabelled by:
If there are fewer than four classes, relabelClasses
will try to guess which of the classes are present. To help the method
correctly label the clusters, set the presentClasses
parameter:
A Shiny app is included in the package, which provides a GUI that allows interactive use of the package for ddPCR analysis. This can be run from an interactive R session using:
This can also be accessed at http://shiny.cruk.manchester.ac.uk/twoddpcr/.
To run on your own Shiny server, a file called app.R
should be created with the following code:
If you have run your own two channel ddPCR experiments that have
produced a QuantaSoft Plate (.qlp
) file, then the
raw droplet amplitudes can be extracted for use with the
twoddpcr
package. To do this:
.qlp
file).Ctrl
and/or
Shift
key with the mouse.Options
in the top-right.Export Amplitude and Cluster Data
.The amplitudes will be exported to a number of CSV files in the
chosen location, with one file for each well. Each file is named
<PlateName>_<WellNumber>_Amplitude.csv
, where
<PlateName>
is the name of the .qlp
file
without the extension and <WellNumber>
is the
position in the plate, e.g. B03
. These amplitude files are
now ready to be loaded using the twoddpcr
package.
The example in this vignette can be followed using a different dataset, such as those from your own ddPCR experiments. To load a dataset:
Follow the instructions in the exporting droplet amplitudes section.
The droplets can be imported using:
Here, data/amplitudes
should be changed to the directory
containing the droplet amplitude files.
While loading data, the following error message may appear:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
duplicate 'row.names' are not allowed
Possible solution: The number of columns in the header row might differ from the number of columns in the other rows. For example, there may be extra commas/tabs at the end of some lines. In such cases, the removal of ‘empty’ columns should fix the problem.
twoddpcr
If you use the twoddpcr
package in your work, please
cite the Bioinformatics
paper:
## To cite twoddpcr in publications, please use:
##
## Anthony Chiu, Mahmood Ayub, Caroline Dive, Ged Brady, Crispin J
## Miller; twoddpcr: An R/Bioconductor package and Shiny app for Droplet
## Digital PCR analysis. Bioinformatics 2017 btx308. doi:
## 10.1093/bioinformatics/btx308
##
## A BibTeX entry for LaTeX users is
##
## @Article{,
## author = {Anthony Chiu and Mahmood Ayub and Caroline Dive and Ged Brady and Crispin Miller},
## title = {twoddpcr: An R/Bioconductor package and Shiny app for Droplet Digital PCR analysis},
## journal = {Bioinformatics},
## publisher = {Oxford University Press},
## year = {2017},
## }
(Rödiger et al. 2015) describes how to
use R in order to analyse ddPCR data using the dpcR
package.
Here is the output of sessionInfo()
on the system on
which this document was compiled:
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] reshape2_1.4.4 ggplot2_3.5.1 twoddpcr_1.31.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 utf8_1.2.4 generics_0.1.3
## [4] class_7.3-22 stringi_1.8.4 lattice_0.22-6
## [7] digest_0.6.37 magrittr_2.0.3 evaluate_1.0.1
## [10] grid_4.4.2 RColorBrewer_1.1-3 fastmap_1.2.0
## [13] Matrix_1.7-1 plyr_1.8.9 jsonlite_1.8.9
## [16] promises_1.3.0 BiocManager_1.30.25 mgcv_1.9-1
## [19] fansi_1.0.6 scales_1.3.0 jquerylib_0.1.4
## [22] cli_3.6.3 shiny_1.9.1 crayon_1.5.3
## [25] rlang_1.1.4 splines_4.4.2 munsell_0.5.1
## [28] withr_3.0.2 cachem_1.1.0 yaml_2.3.10
## [31] tools_4.4.2 colorspace_2.1-1 httpuv_1.6.15
## [34] BiocGenerics_0.53.3 buildtools_1.0.0 vctrs_0.6.5
## [37] R6_2.5.1 mime_0.12 stats4_4.4.2
## [40] lifecycle_1.0.4 stringr_1.5.1 S4Vectors_0.45.2
## [43] pkgconfig_2.0.3 pillar_1.9.0 bslib_0.8.0
## [46] hexbin_1.28.5 later_1.3.2 gtable_0.3.6
## [49] glue_1.8.0 Rcpp_1.0.13-1 xfun_0.49
## [52] tibble_3.2.1 sys_3.4.3 knitr_1.49
## [55] farver_2.1.2 xtable_1.8-4 nlme_3.1-166
## [58] htmltools_0.5.8.1 rmarkdown_2.29 maketools_1.3.1
## [61] labeling_0.4.3 compiler_4.4.2
twoddpcr
packagetwoddpcr
packagethresholdClassify
)kmeansClassify
)twoddpcr