---
title: "DegCre Introduction and Examples"
author: "Brian S. Roberts"
package: DegCre
output: BiocStyle::html_document
bibliography: degCreRefs.bib
vignette: >
  %\VignetteIndexEntry{DegCre Introduction and Examples}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Installation

DegCre can be installed from Bioconductor as follows:

``` r
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("DegCre")
```
Alternatively, DegCre can be installed from GitHub.
With the `devtools` pacakge installed, run:

```r
devtools::install_github("brianSroberts/DegCre")
```

## Introduction

DegCre associates differentially expressed genes (DEGs) with cis-regulatory
 elements (CREs) in a probabilistic, non-parametric approach. DegCre is 
 intended to be applied on differential expression and regulatory signal data 
 derived from responses to perturbations such as drug or natural ligand 
 treatmnents. As an example used here, we have obtained data from 
 McDowell et al. [@McDowell2018Glucocorticoid] which was generated by treating 
 A549 cells with dexamethasone and measuring RNA-seq and ChIP-seq data at 
 several time points. Data from RNA-seq and NR3C1 ChIP-seq at four hours 
 versus control is stored in the list `DexNR3C1`.

DegCre uses the 
[GenomicRanges](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html) 
framework for handling genomic regions and some calculations. As one input,
 `DegGR`, users generate differential expression statistics for genes with 
 methods such as 
 [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) or 
 [edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html). These
values should then be associated with gene TSSs such as those available from 
EPDNew [@Dreos2015Eukaryotic; @Meylan2020EPD] in a `GRanges`. The second input,
 `CreGR`, is differential regulatory signal data (in the example, 
 NR3C1 ChIP-seq data) associated with genomic regions in a `GRanges` such as 
 those generated by 
 [csaw](https://bioconductor.org/packages/release/bioc/html/csaw.html). 

A complete description of the mathematical basis of the DegCre core algorithms 
is provided in [@roberts_probabilistic_2024]. DegCre generates a `Hits` object of all 
associations between `DegGR` and `CreGR` within a specified maximum distance.
Associations are then binned by TSS-to-CRE distance according to an algorithm 
that balances resolution (many bins with few members) versus minimization of 
the deviance of each bin's CRE p-value distribution from the global 
distribution, selecting an optimal bin size.

Next, DegCre applies a non-parametric algorithm to find concordance between 
DEG and CRE differential effects within bins and derives an association 
probability. For all association probabilities involving one given CRE, the 
probabilities are adjusted to favor associations across shorter distances. 
An FDR of the association probability is then estimated. Results are returned 
in list containing a `Hits` object and both input `GRanges`.

## Generating DegCre associations

To begin, we load the necessary packages:
```{r message = FALSE, warning = FALSE, setup}
library(GenomicRanges)
library(InteractionSet)
library(plotgardener)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(org.Hs.eg.db)
library(DegCre)
library(magick)
```

DegCre is run on the two input `GRanges`, `DegGR` and `CreGR`, and 
differential p-values associated with each. To require the effect directions 
be concordant between the changes in gene expression and CRE signal, users must
 also supply log fold-changes for each and set `reqEffectDirConcord=TRUE` 
 (default value). As example data, we use  `DexNR3C1` derived from McDowell et 
 al. [@McDowell2018Glucocorticoid] using [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) 
 and [csaw](https://bioconductor.org/packages/release/bioc/html/csaw.html):

```{r demo inputs}
data(DexNR3C1)
lapply(DexNR3C1,head)
```
We now run DegCre on this data with default values (subsetting input to "chr1" 
for expediency):

```{r demo DegCre default}
subDegGR <- DexNR3C1$DegGR[which(seqnames(DexNR3C1$DegGR)=="chr1")]
subCreGR <- DexNR3C1$CreGR[which(seqnames(DexNR3C1$CreGR)=="chr1")]

myDegCreResList <- runDegCre(DegGR=subDegGR,
                             DegP=subDegGR$pVal,
                             DegLfc=subDegGR$logFC,
                             CreGR=subCreGR,
                             CreP=subCreGR$pVal,
                             CreLfc=subCreGR$logFC,
                             reqEffectDirConcord=TRUE,
                             verbose=FALSE)
```
DegCre produces a list object with several items. These include the inputs 
`DegGR` and `CreGR`. It also contains a `Hits` object `degCreHits`. The 
`queryHits` of `degCreHits` index `DegGR` and the `subjectHits` index `CreGR`. 
`degCreHits` also has several metadata columns that include the key resuls of 
the algorithm. These include `assocProb` which is the probabilty of the 
association being true, and `assocProbFDR` which is the FDR of `assocProb` 
being greater than a null model based on TSS to CRE distance alone. Also, 
the results of the distance bin optimization heuristic, `binHeurOutputs`, 
and the input DEG significance threshold `alphaVal` are returned.

```{r demo DegCre result list}
names(myDegCreResList)
head(myDegCreResList$degCreHits)
```

To find associations driven by non-random changes in CRE signal, we subset 
the `degCreHits` by `assocProbFDR`. We can also further subset by `assocProb` 
to find those also more likely to confirm in a suitable validation experiment:

```{r demo subsetting}
passFDR <- which(mcols(myDegCreResList$degCreHits)$assocProbFDR <= 0.05)
passProb <- which(mcols(myDegCreResList$degCreHits)$assocProb >= 0.25)

keepDegCreHits <- myDegCreResList$degCreHits[intersect(passFDR,passProb)]

keepDegCreHits
```

We can then view the subsets of `DegGR` and `CreGR` that comprise the passing 
associations:

```{r demo GR subset}

myDegCreResList$DegGR[queryHits(keepDegCreHits)]

myDegCreResList$CreGR[subjectHits(keepDegCreHits)]
```

The choice of the threshold for DEG significance (DEG $\alpha$) is important 
in DegCre calculations. Accordingly, DegCre can be run over a set of 
thresholds to determine an optimal value. Here we select DEG $\alpha$ by 
finding the value that maximizes a normalized Precision-Recall Area Under 
the Curve (AUC):

```{r demo Alpha opt}
alphaOptList <- optimizeAlphaDegCre(DegGR=subDegGR,
                                    DegP=subDegGR$pVal,
                                    DegLfc=subDegGR$logFC,
                                    CreGR=subCreGR,
                                    CreP=subCreGR$pVal,
                                    CreLfc=subCreGR$logFC,
                                    reqEffectDirConcord=TRUE,
                                    verbose=FALSE)

alphaOptList$alphaPRMat

bestAlphaId <- which.max(alphaOptList$alphaPRMat[,4])

bestDegCreResList <- alphaOptList$degCreResListsByAlpha[[bestAlphaId]]
```

## Visualizing DegCre results

With an optimal DegCre result obtained, we can plot the association 
probabilities versus binned association distance to see the relationship 
between the two:

```{r demo assoc dist plot, fig.wide=TRUE}
par(mai=c(0.8,1.6,0.4,0.1))
par(mgp=c(1.7,0.5,0))
par(cex.axis=1)
par(cex.lab=1.4)
par(bty="l")
par(lwd=2)

binStats <- plotDegCreAssocProbVsDist(degCreResList=bestDegCreResList,
                                      assocProbFDRThresh=0.05)
```

The total number of associations passing a reasonable FDR (0.05 in the figure) 
drops fairly quickly as bin TSS to CRE distance increases (top panel). 
Similarly, the median association probability (black line in lower panel) 
drops with distance as well. The blue range is the interquartile range of the 
association probabilities. The red line in the above plot represents the null 
association probability, calculated as the probability if all CRE p-values in 
a given bin were identical.

The binning of the associations by distance is also important to the operation 
of the DegCre algorithm. The choice of the bin size (number of associations 
per bin) is done by an optimization heuristic that tries a wide range of bin 
sizes  and calculates the KS test statistic for each bin's CRE p-value 
distribution compared against the global (unbinned) CRE p-value distribution. 
The algorithm finds the smallest bin size for which the median KS-statistic 
is within a specified fraction, `fracMinKsMedianThresh`, of the range of that 
for all tested bin sizes. We visualize the results of this process:

```{r demo plot bin algorithm, fig.small=TRUE}
par(mai=c(0.8,0.6,0.4,0.1))
par(mgp=c(1.7,0.5,0))
par(cex.axis=1)
par(cex.lab=1.4)
par(bty="l")
par(lwd=1.5)

plotDegCreBinHeuristic(degCreResList=bestDegCreResList)
```
The red dot indicates the chosen bin size run with the default value of 
'fracMinKsMedianThresh = 0.2'.

Another way to evaluate the performance of the DegCre on a data set is to 
make a Precision-Recall (PR) curve. To calculate these values, we define as a 
perfect set of associations, i.e. one that would generate a PR Area Under the 
Curve (AUC) of 1, as containing associations to all significant DEGs with 
uniform association probabilities of 1. Such a set will not result from real 
data, but it serves as a useful benchmark. Comparison of relative values of PR 
AUCs for DegCre results will be most informative. We plot the PR curve for 
the example data:

```{r demo plot PR curve, fig.small=TRUE}
par(mai=c(0.8,0.6,0.4,0.1))
par(mgp=c(1.7,0.5,0))
par(cex.axis=1)
par(cex.lab=1.4)
par(bty="l")
par(lwd=1.5)

degCrePRAUC(degCreResList=bestDegCreResList)
```
The black line is the performance of the DegCre associations and the red 
region is values obtained from random shuffles.

## Obtaining the number of associations per DEG

Since DegCre calculates association probabilities, one can estimate the 
number of expected associations per significant DEG by summing the total 
associations that pass a chosen FDR. DEGs with many expected associations 
are more likely to have true associations. This can be done with the 
`getExpectAssocPerDEG()` function:

```{r demo expect associations}
expectAssocPerDegDf <- getExpectAssocPerDEG(degCreResList=bestDegCreResList,
                                            geneNameColName="GeneSymb",
                                            assocAlpha=0.05)
```

The result is a `DataFrame` with one entry per gene, sorted by the expected 
number of DEGs, `expectAssocs'.

```{r demo expect DataFrame}
head(expectAssocPerDegDf)
```
This result can be plotted as a histogram by the function 
`plotExpectedAssocsPerDeg()`:

```{r demo plot expect hist, fig.small=TRUE}
par(mai=c(0.8,0.6,0.4,0.1))
par(mgp=c(1.7,0.5,0))
par(cex.axis=1)
par(cex.lab=1.4)
par(bty="l")
plotExpectedAssocsPerDeg(expectAssocPerDegDf=expectAssocPerDegDf)
```
The dashed line indicates the median expected associations per DEG.

## Browser views of DegCre results for genes or regions

One of the genes with a high number of expected associations and also a known 
target of NR3C1 (Glucocorticoid receptor) is *ERRFI1*. Our test data 
`DexNR3C1` is derived from ChIP-seq of NR3C1 in cells treated with the 
powerful glucocorticoid, dexamethasone, so it is not surprising to find it 
with many expected associations. To visualize these associations, we can make 
a browser plot of the associations and underlying data with 
`plotBrowserDegCre()` which relies on the 
[plotgardener](https://bioconductor.org/packages/release/bioc/html/plotgardener.html) package.

```{r browser1, echo=TRUE, fig.wide=TRUE, message=FALSE,warning=FALSE}
browserOuts <- plotBrowserDegCre(degCreResList=bestDegCreResList,
                                 geneName="ERRFI1",
                                 geneNameColName="GeneSymb",
                                 CreSignalName="NR3C1",
                                 panelTitleFontSize=8,
                                 geneLabelFontsize=10,
                                 plotXbegin=0.9,
                                 plotWidth=5)
```


We now want to zoom in closer to the TSS of *ERRFI1*. To do this we specify 
the region we want to plot as a `GRanges` of length 1:

```{r demo zoom browser}
zoomGR <- GRanges(seqnames="chr1",
                  ranges=IRanges(start=7900e3,end=8400e3))
```

We then supply the `GRanges` to the `plotRegionGR` argument of 
`plotBrowserDegCre()`:

```{r plot browser2, echo=TRUE, fig.wide=TRUE, message=FALSE, warning=FALSE}
zoomBrowserOuts <- plotBrowserDegCre(degCreResList=bestDegCreResList,
                                                   plotRegionGR=zoomGR,
                                                   geneNameColName="GeneSymb",
                                                   CreSignalName="NR3C1",
                                                   panelTitleFontSize=8,
                                                   geneLabelFontsize=10,
                                                   plotXbegin=0.9,
                                                   plotWidth=5)
```


Since we did not specify `geneName` it is not highlighted. We can force 
highlights by supplying a `DataFrame` of gene names and colors to the 
`geneHighlightDf` argument. This `DataFrame` should conform to the 
requirements of `plotgardener::plotGenes()`.

```{r demo highlight browser}
geneHighDf <- data.frame(gene=bestDegCreResList$DegGR$GeneSymb,
                         col="gray")
geneHighDf[which(geneHighDf$gene=="ERRFI1"),2] <- "black"
```

Then `geneHighDf` is supplied to the `geneHighlightDf` argument to produce:

```{r plot browser3, echo=TRUE, fig.wide=TRUE, message=FALSE, warning=FALSE}
highLightBrowserOuts <- plotBrowserDegCre(degCreResList=bestDegCreResList,
                                          plotRegionGR=zoomGR,
                                          CreSignalName="NR3C1",
                                          geneNameColName="GeneSymb",
                                          geneHighlightDf=geneHighDf,
                                          panelTitleFontSize=8,
                                          geneLabelFontsize=10,
                                          plotXbegin=0.9,
                                          plotWidth=5)
```


## Conversion of DegCre results to other formats

The list of results that `DegCre()` produces has many useful features for 
downstream analysis. However, conversion to other formats will be useful in 
some situations. The function `convertdegCreResListToGInteraction()` will 
convert the DegCre list to a `GInteractions` object from the 
[InteractionSet](https://www.bioconductor.org/packages/release/bioc/html/InteractionSet.html) package:

```{r demo convert to GInter}
degCreGInter <-
    convertdegCreResListToGInteraction(degCreResList=bestDegCreResList,
                                       assocAlpha=0.05)

degCreGInter
```

The DegCre results list can can also be converted to a `DataFrame` in roughly 
a BEDPE format. This is useful for writing out DegCre results as text files:

```{r demo convert to DataFrame}
degCreDf <- convertDegCreDataFrame(degCreResList=bestDegCreResList,
                                   assocAlpha=0.05)

head(degCreDf)
```
## Session Info

```{r}
sessionInfo()
```

## References