---
title: "A guide to use the Polytect: an automatic clustering and labeling method for multi-color digital PCR data"
author: "Yao Chen"
data: "June 14, 2024"
output: rmarkdown::pdf_document
vignette: >
  %\VignetteIndexEntry{Polytect Vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteDepends{Polytect, flowPeaks, ggplot2}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  dev = 'png',
  dpi = 100,
  crop = NULL
)
```

## Introduction
Digital PCR (dPCR) is a state-of-the-art quantification method of target nucleic acids. The technology is based on massive partitioning of a reaction mixture into individual PCR reactions. This results in partition-level end-point fluorescence intensities that are subsequently used to classify partitions as positive or negative, i.e., containing or not containing the target nucleic acid(s). Many automatic dPCR partition classification methods have been proposed, but they are limited to the analysis of single or dual color dPCR data. While general-purpose or flow cytometry clustering methods can be directly applied to multi-color dPCR data, these methods do not exploit the approximate prior knowledge on cluster center locations available in dPCR data. 

We developed a novel method, `Polytect`, that relies on crude cluster results from flowPeaks, previously shown to offer good partition classification performance, and subsequently refines flowPeaks' results by cluster merging and automatic cluster labeling, exploiting the prior knowledge on cluster center locations. 

Once the package is accepted and added to Bioconductor, you will be able to install it with `BiocManager`:
```R
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("Polytect")
```

## Examples
To demonstrate the core functions of this package, we will use the HR digital PCR dataset, which can be accessed using the command data(HR). The HR dataset is a data frame consisting of 18,233 rows and 2 columns. The dataset has 4 clusters. The clustering analysis can be performed using the following commands.
```{r load package and data,warning=FALSE, message=FALSE}
library(Polytect)
library(ggplot2)
library(flowPeaks)
library(ddPCRclust)

data(HR)
head(HR)
ggplot(data=HR, aes(channel1, channel2))+ 
    geom_point(size=0.9,show.legend = FALSE) +
    labs(x = "color 1", y = "color 2") +
    theme(text = element_text(size = 15), 
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(),
          axis.line = element_line(colour = "black"),
          plot.margin = margin(t = 20, r = 20, b = 20, l = 20),
          axis.title.x = element_text(margin = margin(t = 20)),
          axis.title.y = element_text(margin = margin(r = 20)))

```
If we perform flowPeaks only, there will be more clusters than expected.
```{r flowpeaks,warning=FALSE, message=FALSE}
data_scaled<-apply(HR,2,function(x) (x-min(x))/(max(x)-min(x)))
data_input<-as.matrix(data_scaled)
fp<-flowPeaks(data_input)

ggplot(data=HR, aes(channel1, channel2,colour = factor(fp$peaks.cluster)))+
    geom_point(size=0.9,show.legend = FALSE) +
    labs(x = "color 1", y = "color 2") +
    theme(text = element_text(size = 15), 
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(),
          axis.line = element_line(colour = "black"),
          plot.margin = margin(t = 20, r = 20, b = 20, l = 20),
          axis.title.x = element_text(margin = margin(t = 20)),
          axis.title.y = element_text(margin = margin(r = 20)))
```

The main function that performs flowPeaks first, then merges the excess clusters.
```{r flowpeaks and merge,warning=FALSE, message=FALSE}
result<-polytect_clust(data=HR,cluster_num=4)
print(head(result))

ggplot(data=HR, aes(channel1, channel2,colour = factor(result$cluster)))+
    geom_point(size=0.9,show.legend = FALSE) +
    labs(x = "color 1", y = "color 2") +
    theme(text = element_text(size = 15), 
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(),
          axis.line = element_line(colour = "black"),
          plot.margin = margin(t = 20, r = 20, b = 20, l = 20),
          axis.title.x = element_text(margin = margin(t = 20)),
          axis.title.y = element_text(margin = margin(r = 20)))
```

Or you can use any initial clustering results as an input to the polytect_merge function
```{r merge the data,warning=FALSE, message=FALSE}
## it is advised to standardize the data
dist_matrix <- dist(data_input)
hc <- hclust(dist_matrix, method = "ward.D2")
# the number of clusters is specified at 6, which is larger than 4
hc_clusters <- cutree(hc, k = 6)

ggplot(data=HR, aes(channel1, channel2,colour = factor(hc_clusters)))+
    geom_point(size=0.9,show.legend = FALSE) +
    labs(x = "color 1", y = "color 2") +
    theme(text = element_text(size = 15), 
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(),
          axis.line = element_line(colour = "black"),
          plot.margin = margin(t = 20,  r = 20, b = 20, l = 20),
          axis.title.x = element_text(margin = margin(t = 20)),
          axis.title.y = element_text(margin = margin(r = 20)))

hc_parse<-list()
hc_parse$cluster<-hc_clusters

result<-polytect_merge(data=HR,cluster_num=4,base_clust=hc_parse)
print(head(result))
```


The clustering results can be visualized by 2-d plots.
```{r plot the data,warning=FALSE, message=FALSE}
polytect_plot(result,cluster_num=4)
```
You can also summarise the results, which will give you cluster centers, group sizes and silhouette coefficients for each group
```{r summarise the results,warning=FALSE, message=FALSE}
result_summary<-polytect_summary(result)
print(result_summary)
```

You can also plot the individual silhouette coefficient in each cluster
```{r plot sil coefs,warning=FALSE, message=FALSE}
sil_plot(result)
```

There is a function to calculate the concentration of the targets.
```{r calculate the conc, warning=FALSE, message=FALSE}
target_conc<-conc_cal(result,cluster_num=4,sampvol=0.91,volmix=20,voltemp=20)
print(target_conc)
```

This package can also handle 3-up to 6-color dPCR data. We first perform flowPeaks only.
```{r 3d example, warning=FALSE, message=FALSE}
data(BPV)
data_scaled<-apply(BPV,2,function(x) (x-min(x))/(max(x)-min(x)))
data_input<-as.matrix(data_scaled)
fp<-flowPeaks(data_input)
table(fp$peaks.cluster)
df_data<-as.data.frame(cbind(BPV,cluster=fp$peaks.cluster))
polytect_plot(df_data,cluster_num=8)
```
Then the main function.
```{r 3d example polytect clust, warning=FALSE, message=FALSE}
result<-polytect_clust(data=BPV,cluster_num=8)
table(result$cluster)
polytect_plot(result,cluster_num = 8)
```
Polytect does not change the clustering result of flowPeaks. What it did is relabeling.

Comparison to the existing ddpcr package ddpcrclust
```{r ddpcrclust, warning=FALSE, message=FALSE}
exampleFiles <- list.files(paste0(find.package('ddPCRclust'), '/extdata'), full.names = TRUE)
template <- readTemplate(exampleFiles[9])
template$template[1,3]<-2
template$template[1,c(6,7)]<-''
data_list<-list()
data_list$ids<-'B01'
data_list$data$B01<-data.frame(Ch1.Amplitude=HR[,1],Ch2.Amplitude=HR[,2])  

result <- ddPCRclust(data_list,template = template)

ggplot(data=HR, 
       aes(channel1, channel2,colour = factor(result$B01$data$Cluster)))+
    geom_point(size=0.9,show.legend = FALSE) +
    labs(x = "color 1", y = "color 2") +
    theme(text = element_text(size = 15), 
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(),
          axis.line = element_line(colour = "black"),
          plot.margin = margin(t = 20, r = 20, b = 20, l = 20),
          axis.title.x = element_text(margin = margin(t = 20)),
          axis.title.y = element_text(margin = margin(r = 20)))

```
## Session Information

The following information was obtained from the R session that generated this vignette:

```{r session_info, warning=FALSE, message=FALSE}
sessionInfo()
```

## References

Ge Y, Sealfon S (2012). “flowPeaks: a fast unsupervised clustering for flow cytometry data via K-means and density peak finding.” Bioinformatics. R package version 4.4.0.

Lun A (2024). bluster: Clustering Algorithms for Bioconductor. R package version 1.14.0.