---
title: "Analyzing High Density Peptide Array Data using HERON"
output: 
    rmarkdown::html_vignette:
        toc: true
        number_sections: true
vignette: >
    %\VignetteIndexEntry{Analyzing High Density Peptide Array Data using HERON}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

# HERON

<!-- badges: start -->
<!-- badges: end -->

The goal of HERON (**H**ierarchical **E**pitope p**RO**tein bi**N**ding) is to
analyze peptide binding array data measured from nimblegen or any high density
peptide binding array data.

## Installation

You can install the released version of HERON from 
[bioconductor](https://www.bioconductor.org/) with:

``` r
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# The following initializes usage of Bioc devel
BiocManager::install(version='devel')

BiocManager::install("HERON")
```

And the development version from 
[GitHub](https://github.com/Ong-Research/HERON) with:

``` r
# install.packages("devtools")
devtools::install_github("Ong-Research/HERON")
```


```{r setup}
options(warn=2)
suppressPackageStartupMessages(library(HERON))
```

## Example

These are examples which shows you how to interact with the code.  We will be
using a subset of the COVID-19 peptide binding array dataset from the 
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8245122/ publication.

### The *HERONSequenceData* object

The *HERONSequenceDataSet* object is the first object needed for running HERON.
The *HERONSequenceDataSet* is a subclass of *SummarizedExperiment* from 
base Bioconductor. The components are a sequence *data.frame* or *matrix* where 
rows are amino acid sequences of the peptide binding probe and the columns are
samples, a colData *DFrame*, and a metadata *data.frame* that maps 
sequences to probes.

#### Sequence matrix

```{r, example_seq_mat}
data(heffron2021_wuhan)
knitr::kable(assay(heffron2021_wuhan)[1:5,1:5])
```

#### colData

The *colData* data frame describes the experimental setup of the samples.  
For the *colData* data frame, the important columns are *ptid*, *visit*, 
and *SampleName*.  The *SampleName* is the column name of the sequence table, 
*ptid* is the name of the sample, and  *visit* is either pre or post.  
The *ptid* can the same value for one pre and post sample, which would 
indicate a paired design. A *condition* column can also be provided which 
can indicate multiple conditions for the post samples.

```{r example_colData}
knitr::kable(head(colData(heffron2021_wuhan)))
```

#### Probe metadata

The probe_meta data frame describes the mapping the amino acid sequence 
(*PROBE_SEQUENCE*) of the probes to the probe identifier (*PROBE_ID*).  The
PROBE_ID contains the name of the protein and the position within the
protein of the first acid of the probe sequence separated by a semicolon. 


```{r example_probe_meta}
probe_meta <- metadata(heffron2021_wuhan)$probe_meta
knitr::kable(head(probe_meta[,c("PROBE_SEQUENCE", "PROBE_ID")]))
```

### Creating the *HERONSequenceData* object

The *heffron2021_wuhan* data object is already a *HERONSequenceDataSet*
object. We can create the same object using the *HERONSequenceDataSet*
constructor as an example.

```{r, create_object}
seq_mat <- assay(heffron2021_wuhan)
sds <- HERONSequenceDataSet(seq_mat)
colData(sds) <- colData(heffron2021_wuhan)
metadata(sds)$probe_meta <- probe_meta
```

### Pre-process data

The data is pre-processed by first quantile normalizing on sequence-level data.

```{r, quantile_normalize}
seq_ds_qn <- quantileNormalize(sds)
knitr::kable(head(assay(seq_ds_qn)[,1:5]))
```

### Calculate probe-level p-values

The quantile normalized data is then used with the *colData* and *probe_meta*
data frames to calculate probe-level p-values.  

The *calcCombPValues* call will first calculate the probe level
p-values using the colData, adjust the probe p-values on a 
column-by-column basis frame. 

The combined p-values can be calculated on the sequence data set, then
converted to a probe data set through *convertSequenceDSToProbeDS*, or 
calculated on the probe data set.  If the data was smoothed prior using 
consecutive probes on the same protein, then the p-values 
should calculated on the smoothed probe data set.

```{r example_probe_pvalues}
seq_pval_res <- calcCombPValues(seq_ds_qn)
probe_pval_res <- convertSequenceDSToProbeDS(seq_pval_res)
knitr::kable(head(assay(probe_pval_res, "padj")[,1:5]))
```

### Obtain probe-level calls

Once the p-values have been calculated, the *makeProbeCalls* will make the
calls using the adjusted p-values calculated.  *makeProbeCalls* also includes
a filter to remove one-hit probes, which are probes that were called only
in one sample and do not have a consecutive probe called in a single sample.
The return value is *HERONProbeDataSet* with the calls included as an 
additional assay.

```{r, example_probe_calls}
probe_calls_res <- makeProbeCalls(probe_pval_res)
knitr::kable(head(assay(probe_calls_res, "calls")[,1:5]))
```

K of N, Fraction, and Percent of calls made per sequence, probe, epitope, 
or protein can be determined using the *getKofN* function.

```{r}
probe_calls_k_of_n <- getKofN(probe_calls_res)
knitr::kable(head(probe_calls_k_of_n))
```

### Find Epitope Segments using the unique method

After calculating probe p-values and calls, the HERON can find epitope segments,
i.e. blocks of consecutive probes that have been called within a sample or
cluster of samples.  The unique method finds all consecutive probes for
each sample, then returns the unique set of all epitopic regions.


```{r, example_epitope_finding_unique}

epi_segments_uniq_res <- findEpitopeSegments(
    probe_calls_res,
    segment_method = "unique"
)

knitr::kable(head(epi_segments_uniq_res))


```

The format of the epitope segments are SEQID_Begin_End, where
the *SEQID* is the protein name, *Begin* is the starting position of the first 
probe within the consecutive block, and the *End* is the starting position of 
the last probe within the consecutive block.

### Calculate Epitope-level p-values

With the epitope segments and the probe p-values, HERON can calculate a 
significance value for the segments using a meta p-value method.  Here,
we are calculating epitope p-values using Wilkinson's max meta p-value method.

```{r, example_epitope_pvalues}

epi_padj_uniq <- calcEpitopePValues(
    probe_calls_res,
    epitope_ids = epi_segments_uniq_res,
    metap_method = "wmax1"
)

```

You can add sequence annotations to the row data of the HERONEpitopeDataSet
object.

```{r}
epi_padj_uniq <- addSequenceAnnotations(epi_padj_uniq)
```

### Obtain Epitope-level calls

The *makeEpitopeCalls* method will work on the epitope adjusted p-values to 
make epitope-level sample. *getKofN* can also get the K of N results.

```{r, example_epitope_calls}

epi_calls_uniq <- makeEpitopeCalls(epi_padj_uniq, one_hit_filter = TRUE)
epi_calls_k_of_n_uniq <- getKofN(epi_calls_uniq)
knitr::kable(head(epi_calls_k_of_n_uniq))

```

### Calculate Protein-level p-values

Calculate protein p-values using Tippett's (Wilkinson's Min) method.

```{r, example_protein_pvalues}
prot_padj_uniq <- calcProteinPValues(
    epi_padj_uniq,
    metap_method = "tippetts"
)
```

### Obtain Protein-level calls

```{r, example_protein_calls}
prot_calls_uniq <- makeProteinCalls(prot_padj_uniq)
prot_calls_k_of_n_uniq <- getKofN(prot_calls_uniq)
knitr::kable(head(prot_calls_k_of_n_uniq))
```

## Other 

The are other functions to utilize depending upon what you would like to do.
For example, there are different methods for finding the epitope
segments that involve clustering across samples.  There are two main
methods, one using hierarchical clustering and another using
the *skater* method from the *spdep* package.  In addition to the
two methods, how the distance matrix is calculated can be modified.  The
following subsections demonstrate how to find epitope segment blocks using
*hclust* or *skater* and using a binary hamming distance or a numeric
euclidean distance for making calls.  After the segments are found, you can
then use the *calcEpitopePValuesMat* and *makeEpitopeCalls* functions as before
to find the epitope p-values, protein p-values, and the respective calls. 

### Find Epitope Segments using the hclust method

#### binary calls with hamming distance

```{r, example_epitope_finding_hclust}
epi_segments_hclust_res <- findEpitopeSegments(
    probe_calls_res,
    segment_method = "hclust",
    segment_score_type = "binary",
    segment_dist_method = "hamming",
    segment_cutoff = "silhouette"
)
```

#### z-scores with euclidean distance

```{r, example_epitope_finding_hclust2}
epi_segments_hclust_res2 <- findEpitopeSegments(
    probe_calls_res,
    segment_method = "hclust",
    segment_score_type = "zscore",
    segment_dist_method = "euclidean",
    segment_cutoff = "silhouette"
)
```

### Find Epitope Segments using the skater method

#### binary calls with hamming distance

```{r, example_epitope_finding_skater}
epi_segments_skater_res <- findEpitopeSegments(
    probe_calls_res,
    segment_method = "skater",
    segment_score_type = "binary",
    segment_dist_method = "hamming",
    segment_cutoff = "silhouette"
)
```

#### z-scores with euclidean distance

```{r, example_epitope_finding_skater2}
epi_segments_skater_res <- findEpitopeSegments(
    probe_calls_res,
    segment_method = "skater",
    segment_score_type = "zscore",
    segment_dist_method = "euclidean",
    segment_cutoff = "silhouette"
)
```

### Other meta p-value methods

HERON also allows a choice for a number of meta p-value methods for combining 
p-values for epitopes (*calcEpitopePValuesPDS*) and proteins 
(*calcProteinPValuesEDS*).  The methods supported by HERON are : 
*min_bonf*, 
*min*, 
*max*, 
*fischer*/*sumlog*, 
*hmp/harmonicmeanp*, 
*wilkinsons_min1/tippets*, 
*wilkinsons_min2/wmin2*, 
*wilkinsons_min3*, 
*wilkinsons_min4*, 
*wilkinsons_min5*, 
*wilkinsons_max1/wmax1*, 
*wilkinsons_max2/wmax2*, 
and *cct*.  

When choosing a p-value method, keep in mind that the epitope p-value should 
be one that requires most of the probe p-values to be small (e.g. *wmax1*) and 
the protein should be one that requires at least one of the epitope p-values to 
be small (e.g. *wmax1*).  Other p-value methods such as the *cct* and the *hmp*
have been shown to be more accurate with p-value that have dependencies.

### Making z-score cutoff calls

Some investigators would like to make z-score global level calls rather than
using adjusted p-values.  HERON is flexible to allow for such a setup.
For example, the user wanted to make probe-level calls using a 
z-score cutoff > 2.  

```{r, pvalue_zscore}

seq_pval_res_z <- calcCombPValues(
    obj = seq_ds_qn, 
    use = "z", 
    p_adjust_method = "none"
)


p_cutoff <- pnorm(2, lower.tail = FALSE)

probe_pval_res_z <- convertSequenceDSToProbeDS(seq_pval_res_z, probe_meta)

probe_calls_z2 <- makeProbeCalls(probe_pval_res_z, padj_cutoff = p_cutoff)
probe_k_of_n_z2 <- getKofN(probe_calls_z2)

knitr::kable(head(assay(probe_calls_z2,"calls")[,1:5]))
knitr::kable(head(probe_k_of_n_z2[probe_k_of_n_z2$K > 0,]))

```

The calls can also be used to find epitopes segments, p-values, and calls.
Also, can be used to make protein level calls.

```{r, pvalue_zscore_uniq}
epi_segments_uniq_z2_res <- findEpitopeSegments(
    probe_calls_z2,
    segment_method = "unique"
)
```

If we want to find regions where the z > 2 for all of the consecutive probes,
using the *max* meta p-value method will ensure that.

```{r, pvalue_zscore_epi_pval}

epi_pval_uniq_z2 <- calcEpitopePValues(
    probe_pds = probe_pval_res_z,
    epitope_ids = epi_segments_uniq_z2_res,
    metap_method = "max",
    p_adjust_method = "none"
)
```

Again, *makeEpitopeCalls* will be used in order to find significant regions.

```{r, pvalue_zscore_epi_calls}
epi_calls_uniq_z2 <- makeEpitopeCalls(
    epi_ds = epi_pval_uniq_z2, 
    padj_cutoff = p_cutoff
)
```

Other segmentation methods can be used with the called regions through the
binary score clustering methods.

```{r, pvalue_zscore_skater}
epi_segments_skater_z2_res <- findEpitopeSegments(
    probe_calls_z2,
    segment_method = "skater",
    segment_score_type = "binary",
    segment_dist_method = "hamming",
    segment_cutoff = "silhouette")
```

### Smoothing across probes

The pepStat paper (https://pubmed.ncbi.nlm.nih.gov/23770318/) mentions that
smoothing can sometimes help reduce the high-variability of the peptide array 
data.  HERON has implemented a running average smoothing algorithm across 
protein probes that can handle missing values. The function *smoothProbeMat* 
will take as input a probe matrix and will return a matrix that has been 
smoothed.  The smoothed matrix can then be processed through the workflow 
using the *calcProbePValuesProbeMat* call instead of the 
*calcProbePValuesSeqMat* call since now the probe signal is no longer a direct 
copy from the sequence.

```{r, smooth_probes}

probe_ds_qn <- convertSequenceDSToProbeDS(seq_ds_qn, probe_meta )

smooth_ds <- smoothProbeDS(probe_ds_qn)
```

After you smoothed the data using the probes, the probe p-value will be
calculated on the probe-level rather than the sequence-level.

```{r, smooth_probes_pval}
probe_sm_pval <- calcCombPValues(smooth_ds)
```

The probe calls can then be made as before.

```{r, smooth_probes_calls}
probe_sm_calls <- makeProbeCalls(probe_sm_pval)
probe_sm_k_of_n <- getKofN(probe_sm_calls)
knitr::kable(assay(probe_sm_calls,"calls")[1:3,1:3])
knitr::kable(head(probe_sm_k_of_n[probe_sm_k_of_n$K > 0,]))
```

### Calculate paired t-tests

Paired t-tests can be utilized and combined with the z-tests as well.  Here is
an example where we pair the five COVID- samples to the COVID+ samples and
run *calcProbePValuesSeqMat* with the *d_paired* parameter set.

```{r, paired_t_tests}
data(heffron2021_wuhan)
probe_meta <- attr(heffron2021_wuhan, "probe_meta")
colData_paired <- colData(heffron2021_wuhan)

## Make some samples paired by setting the sample ids
pre_idx <- which(colData_paired$visit == "pre")
colData_post <- colData_paired[colData_paired$visit == "post",]
new_ids <- colData_post$SampleName[seq_len(5)]
colData_paired$ptid[pre_idx[seq_len(5)]] = new_ids

paired_ds <- heffron2021_wuhan
colData(paired_ds) <- colData_paired

## calculate p-values
pval_res <- calcCombPValues(
    obj = paired_ds,
    d_paired = TRUE
)

knitr::kable(assay(pval_res[1:3,],"pvalue"))

```

### Use the wilcox test for probe-level p-values

To account for non-normality, HERON implements the two sample wilcox test in
placed of the t-test used for the differential p-values.  Below is an
example of how to use it.

```{r, example_wilcox_probe_pvalues}
seq_pval_res_w <- calcCombPValues(
    obj = seq_ds_qn, 
    use = "w", d_abs_shift = 1
)
probe_pval_res_w <- convertSequenceDSToProbeDS(seq_pval_res_w)
knitr::kable(assay(probe_pval_res_w[1:3, 1:5], "padj"))
```

### Use of the condition column

You can use the *condition* column in the *colData* dataframe.
It is used in the *getKofN* function. The data type of the 
column can be either a character or factor. When the condition
column exists in the *colData*, three columns (number of 
samples (K), frequency (F), and percent (P)) are provided for
each unique factor/character value.

```{r}
col_data <- colData(heffron2021_wuhan)
covid <- which(col_data$visit == "post")

col_data$condition[covid[1:10]] <- "COVID2"

seq_ds <- heffron2021_wuhan
colData(seq_ds) <- col_data

seq_ds_qn <- quantileNormalize(seq_ds)
seq_pval_res <- calcCombPValues(seq_ds_qn)
probe_pval_res <- convertSequenceDSToProbeDS(seq_pval_res)
probe_calls_res <- makeProbeCalls(probe_pval_res)
probe_calls_k_of_n <- getKofN(probe_calls_res)
probe_calls_k_of_n <- 
    probe_calls_k_of_n[
        order(probe_calls_k_of_n$K, decreasing = TRUE),]
knitr::kable(head(probe_calls_k_of_n))
```


## Funding

HERON and its developers have been partially supported by funding from the 
Clinical and Translational Science Award (CTSA) program (ncats.nih.gov/ctsa), 
through the National Institutes of Health National Center for Advancing 
Translational Sciences (NCATS), grants UL1TR002373 and KL2TR002374, NIH National
Institute of Allergy and Infectious Diseases, 2U19AI104317-06, NIH National
Cancer Institute (P30CA14520, P50CA278595, and P01CA250972), startup funds 
through the University of Wisconsin Department of Obstetrics and Gynecology and 
the University of Wisconsin-Madison Office of the Chancellor and the Vice 
Chancellor for Research and Graduate Education with funding from the Wisconsin 
Alumni Research Foundation through the Data Science Initiative award.

## Acknowledgments

We have benefited in the development of *HERON* from the help and
feedback of many individuals, including but not limited to: 

The Bioconductor Core Team,
Paul Sondel, 
Anna Hoefges, 
Amy Erbe Gurel, 
Jessica Vera,
Rene Welch,
Ken Lo (Nimble Therapeutics),
Brad Garcia (Nimble Therapeutics),
Jigar Patel (Nimble Therapeutics),
John Tan,
Nicholas Mathers,
Richard Pinapatti.

## Session Info

```{r, session_info}
sessionInfo()
```