A guide to use the Polytect: an automatic clustering and labeling method for multi-color digital PCR data

Introduction

Digital PCR (dPCR) is a state-of-the-art quantification method of target nucleic acids. The technology is based on massive partitioning of a reaction mixture into individual PCR reactions. This results in partition-level end-point fluorescence intensities that are subsequently used to classify partitions as positive or negative, i.e., containing or not containing the target nucleic acid(s). Many automatic dPCR partition classification methods have been proposed, but they are limited to the analysis of single or dual color dPCR data. While general-purpose or flow cytometry clustering methods can be directly applied to multi-color dPCR data, these methods do not exploit the approximate prior knowledge on cluster center locations available in dPCR data.

We developed a novel method, Polytect, that relies on crude cluster results from flowPeaks, previously shown to offer good partition classification performance, and subsequently refines flowPeaks’ results by cluster merging and automatic cluster labeling, exploiting the prior knowledge on cluster center locations.

Once the package is accepted and added to Bioconductor, you will be able to install it with BiocManager:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("Polytect")

Examples

To demonstrate the core functions of this package, we will use the HR digital PCR dataset, which can be accessed using the command data(HR). The HR dataset is a data frame consisting of 18,233 rows and 2 columns. The dataset has 4 clusters. The clustering analysis can be performed using the following commands.

library(Polytect)
library(ggplot2)
library(flowPeaks)
library(ddPCRclust)

data(HR)
head(HR)
#>   channel1  channel2
#> 1 32.25462 10622.230
#> 2 32.55316 10081.106
#> 3 37.41653 10363.010
#> 4 39.65380 10140.198
#> 5 40.31481 10319.560
#> 6 42.58688  9864.532
ggplot(data=HR, aes(channel1, channel2))+ 
    geom_point(size=0.9,show.legend = FALSE) +
    labs(x = "color 1", y = "color 2") +
    theme(text = element_text(size = 15), 
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(),
          axis.line = element_line(colour = "black"),
          plot.margin = margin(t = 20, r = 20, b = 20, l = 20),
          axis.title.x = element_text(margin = margin(t = 20)),
          axis.title.y = element_text(margin = margin(r = 20)))

If we perform flowPeaks only, there will be more clusters than expected.

data_scaled<-apply(HR,2,function(x) (x-min(x))/(max(x)-min(x)))
data_input<-as.matrix(data_scaled)
fp<-flowPeaks(data_input)
#>         step 0, set the intial seeds, tot.wss=0.0767292
#>         step 1, do the rough EM, tot.wss=0.0519279 at 0.065455 sec
#>         step 2, do the fine transfer of Hartigan-Wong Algorithm
#>                  tot.wss=0.0483346 at 0.124875 sec

ggplot(data=HR, aes(channel1, channel2,colour = factor(fp$peaks.cluster)))+
    geom_point(size=0.9,show.legend = FALSE) +
    labs(x = "color 1", y = "color 2") +
    theme(text = element_text(size = 15), 
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(),
          axis.line = element_line(colour = "black"),
          plot.margin = margin(t = 20, r = 20, b = 20, l = 20),
          axis.title.x = element_text(margin = margin(t = 20)),
          axis.title.y = element_text(margin = margin(r = 20)))

The main function that performs flowPeaks first, then merges the excess clusters.

result<-polytect_clust(data=HR,cluster_num=4)
#>         step 0, set the intial seeds, tot.wss=0.0767292
#>         step 1, do the rough EM, tot.wss=0.0519279 at 0.063664 sec
#>         step 2, do the fine transfer of Hartigan-Wong Algorithm
#>                  tot.wss=0.0483346 at 0.12303 sec
print(head(result))
#>   channel1  channel2 cluster
#> 1 32.25462 10622.230       3
#> 2 32.55316 10081.106       3
#> 3 37.41653 10363.010       3
#> 4 39.65380 10140.198       3
#> 5 40.31481 10319.560       3
#> 6 42.58688  9864.532       3

ggplot(data=HR, aes(channel1, channel2,colour = factor(result$cluster)))+
    geom_point(size=0.9,show.legend = FALSE) +
    labs(x = "color 1", y = "color 2") +
    theme(text = element_text(size = 15), 
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(),
          axis.line = element_line(colour = "black"),
          plot.margin = margin(t = 20, r = 20, b = 20, l = 20),
          axis.title.x = element_text(margin = margin(t = 20)),
          axis.title.y = element_text(margin = margin(r = 20)))

Or you can use any initial clustering results as an input to the polytect_merge function

## it is advised to standardize the data
dist_matrix <- dist(data_input)
hc <- hclust(dist_matrix, method = "ward.D2")
# the number of clusters is specified at 6, which is larger than 4
hc_clusters <- cutree(hc, k = 6)

ggplot(data=HR, aes(channel1, channel2,colour = factor(hc_clusters)))+
    geom_point(size=0.9,show.legend = FALSE) +
    labs(x = "color 1", y = "color 2") +
    theme(text = element_text(size = 15), 
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(),
          axis.line = element_line(colour = "black"),
          plot.margin = margin(t = 20,  r = 20, b = 20, l = 20),
          axis.title.x = element_text(margin = margin(t = 20)),
          axis.title.y = element_text(margin = margin(r = 20)))


hc_parse<-list()
hc_parse$cluster<-hc_clusters

result<-polytect_merge(data=HR,cluster_num=4,base_clust=hc_parse)
print(head(result))
#>   channel1  channel2 cluster
#> 1 32.25462 10622.230       3
#> 2 32.55316 10081.106       3
#> 3 37.41653 10363.010       3
#> 4 39.65380 10140.198       3
#> 5 40.31481 10319.560       3
#> 6 42.58688  9864.532       3

The clustering results can be visualized by 2-d plots.

polytect_plot(result,cluster_num=4)

You can also summarise the results, which will give you cluster centers, group sizes and silhouette coefficients for each group

result_summary<-polytect_summary(result)
print(result_summary)
#>   cluster mean_channel1 mean_channel2 cluster_size silhouette_coef
#> 1       1      155.2730      1249.602        17342            0.98
#> 2       2     1330.4427      1216.940          259            0.87
#> 3       3      122.9398     10046.944          291            0.88
#> 4       4      988.5556     10044.658          341            0.65

You can also plot the individual silhouette coefficient in each cluster

sil_plot(result)

There is a function to calculate the concentration of the targets.

target_conc<-conc_cal(result,cluster_num=4,sampvol=0.91,volmix=20,voltemp=20)
print(target_conc)
#>   target concentration
#> 1      1      36.77032
#> 2      2      38.76640

This package can also handle 3-up to 6-color dPCR data. We first perform flowPeaks only.

data(BPV)
data_scaled<-apply(BPV,2,function(x) (x-min(x))/(max(x)-min(x)))
data_input<-as.matrix(data_scaled)
fp<-flowPeaks(data_input)
#>         step 0, set the intial seeds, tot.wss=1.00728
#>         step 1, do the rough EM, tot.wss=0.721981 at 0.220232 sec
#>         step 2, do the fine transfer of Hartigan-Wong Algorithm
#>                  tot.wss=0.708295 at 0.344272 sec
table(fp$peaks.cluster)
#> 
#>     1     2     3     4     5     6     7 
#> 24401   482   323   245     6     3     1
df_data<-as.data.frame(cbind(BPV,cluster=fp$peaks.cluster))
polytect_plot(df_data,cluster_num=8)

Then the main function.

result<-polytect_clust(data=BPV,cluster_num=8)
#>         step 0, set the intial seeds, tot.wss=1.00728
#>         step 1, do the rough EM, tot.wss=0.721981 at 0.232187 sec
#>         step 2, do the fine transfer of Hartigan-Wong Algorithm
#>                  tot.wss=0.708295 at 0.355961 sec
table(result$cluster)
#> 
#>     1     2     3     4     5     6     7 
#> 24401   482   245   323     3     6     1
polytect_plot(result,cluster_num = 8)

Polytect does not change the clustering result of flowPeaks. What it did is relabeling.

Comparison to the existing ddpcr package ddpcrclust

exampleFiles <- list.files(paste0(find.package('ddPCRclust'), '/extdata'), full.names = TRUE)
template <- readTemplate(exampleFiles[9])
template$template[1,3]<-2
template$template[1,c(6,7)]<-''
data_list<-list()
data_list$ids<-'B01'
data_list$data$B01<-data.frame(Ch1.Amplitude=HR[,1],Ch2.Amplitude=HR[,2])  

result <- ddPCRclust(data_list,template = template)
#>         step 0, set the intial seeds, tot.wss=1.22273e+06
#>         step 1, do the rough EM, tot.wss=848836 at 0.042462 sec
#>         step 2, do the fine transfer of Hartigan-Wong Algorithm
#>                  tot.wss=804148 at 0.091847 sec

ggplot(data=HR, 
       aes(channel1, channel2,colour = factor(result$B01$data$Cluster)))+
    geom_point(size=0.9,show.legend = FALSE) +
    labs(x = "color 1", y = "color 2") +
    theme(text = element_text(size = 15), 
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(),
          axis.line = element_line(colour = "black"),
          plot.margin = margin(t = 20, r = 20, b = 20, l = 20),
          axis.title.x = element_text(margin = margin(t = 20)),
          axis.title.y = element_text(margin = margin(r = 20)))

## Session Information

The following information was obtained from the R session that generated this vignette:

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ddPCRclust_1.27.0 flowPeaks_1.53.2  ggplot2_3.5.1     Polytect_0.99.5  
#> [5] rmarkdown_2.29   
#> 
#> loaded via a namespace (and not attached):
#>   [1] tidyselect_1.2.1    farver_2.1.2        dplyr_1.1.4        
#>   [4] R.utils_2.12.3      bitops_1.0-9        fastmap_1.2.0      
#>   [7] ParamHelpers_1.14.2 digest_0.6.37       lifecycle_1.0.4    
#>  [10] cluster_2.1.8       survival_3.8-3      parallelMap_1.5.1  
#>  [13] magrittr_2.0.3      smoof_1.6.0.3       compiler_4.4.2     
#>  [16] rlang_1.1.5         sass_0.4.9          tools_4.4.2        
#>  [19] plotrix_3.8-4       yaml_2.3.10         data.table_1.16.4  
#>  [22] knitr_1.49          sn_2.1.1            labeling_0.4.3     
#>  [25] interp_1.1-6        mnormt_2.1.1        RColorBrewer_1.1-3 
#>  [28] mlrMBO_1.1.5.1      abind_1.4-8         KernSmooth_2.23-26 
#>  [31] withr_3.0.2         RProtoBufLib_2.19.0 numDeriv_2016.8-1.1
#>  [34] R.oo_1.27.0         BiocGenerics_0.53.6 sys_3.4.3          
#>  [37] polyclip_1.10-7     grid_4.4.2          rgenoud_5.9-0.11   
#>  [40] stats4_4.4.2        latticeExtra_0.6-30 mlr_2.19.2         
#>  [43] caTools_1.18.3      SamSPECTRAL_1.60.0  colorspace_2.1-1   
#>  [46] MASS_7.3-64         scales_1.3.0        gtools_3.9.5       
#>  [49] BBmisc_1.13         cli_3.6.3           mvtnorm_1.3-3      
#>  [52] generics_0.1.3      IDPmisc_1.1.21      cachem_1.1.0       
#>  [55] flowCore_2.19.0     splines_4.4.2       parallel_4.4.2     
#>  [58] BiocManager_1.30.25 matrixStats_1.5.0   vctrs_0.6.5        
#>  [61] Matrix_1.7-2        jsonlite_1.8.9      carData_3.0-5      
#>  [64] cytolib_2.19.3      car_3.1-3           S4Vectors_0.45.2   
#>  [67] Formula_1.2-5       clue_0.3-66         jpeg_0.1-10        
#>  [70] maketools_1.3.1     DiceKriging_1.6.0   flowDensity_1.41.0 
#>  [73] hexbin_1.28.5       jquerylib_0.1.4     glue_1.8.0         
#>  [76] cowplot_1.1.3       stringi_1.8.4       gtable_0.3.6       
#>  [79] deldir_2.0-4        munsell_0.5.1       tibble_3.2.1       
#>  [82] pillar_1.10.1       htmltools_0.5.8.1   gplots_3.2.0       
#>  [85] R6_2.5.1            lhs_1.2.0           evaluate_1.0.3     
#>  [88] tidyverse_2.0.0     lattice_0.22-6      Biobase_2.67.0     
#>  [91] R.methodsS3_1.8.2   png_0.1-8           backports_1.5.0    
#>  [94] flowViz_1.71.0      bslib_0.9.0         Rcpp_1.0.14        
#>  [97] fastmatch_1.1-6     checkmate_2.3.2     xfun_0.50          
#> [100] buildtools_1.0.0    pkgconfig_2.0.3

References

Ge Y, Sealfon S (2012). “flowPeaks: a fast unsupervised clustering for flow cytometry data via K-means and density peak finding.” Bioinformatics. R package version 4.4.0.

Lun A (2024). bluster: Clustering Algorithms for Bioconductor. R package version 1.14.0.