--- title: "DegCre Introduction and Examples" author: "Brian S. Roberts" package: DegCre output: BiocStyle::html_document bibliography: degCreRefs.bib vignette: > %\VignetteIndexEntry{DegCre Introduction and Examples} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Installation DegCre can be installed from Bioconductor as follows: ``` r if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("DegCre") ``` Alternatively, DegCre can be installed from GitHub. With the `devtools` pacakge installed, run: ```r devtools::install_github("brianSroberts/DegCre") ``` ## Introduction DegCre associates differentially expressed genes (DEGs) with cis-regulatory elements (CREs) in a probabilistic, non-parametric approach. DegCre is intended to be applied on differential expression and regulatory signal data derived from responses to perturbations such as drug or natural ligand treatmnents. As an example used here, we have obtained data from McDowell et al. [@McDowell2018Glucocorticoid] which was generated by treating A549 cells with dexamethasone and measuring RNA-seq and ChIP-seq data at several time points. Data from RNA-seq and NR3C1 ChIP-seq at four hours versus control is stored in the list `DexNR3C1`. DegCre uses the [GenomicRanges](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html) framework for handling genomic regions and some calculations. As one input, `DegGR`, users generate differential expression statistics for genes with methods such as [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) or [edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html). These values should then be associated with gene TSSs such as those available from EPDNew [@Dreos2015Eukaryotic; @Meylan2020EPD] in a `GRanges`. The second input, `CreGR`, is differential regulatory signal data (in the example, NR3C1 ChIP-seq data) associated with genomic regions in a `GRanges` such as those generated by [csaw](https://bioconductor.org/packages/release/bioc/html/csaw.html). A complete description of the mathematical basis of the DegCre core algorithms is provided in [@roberts_probabilistic_2024]. DegCre generates a `Hits` object of all associations between `DegGR` and `CreGR` within a specified maximum distance. Associations are then binned by TSS-to-CRE distance according to an algorithm that balances resolution (many bins with few members) versus minimization of the deviance of each bin's CRE p-value distribution from the global distribution, selecting an optimal bin size. Next, DegCre applies a non-parametric algorithm to find concordance between DEG and CRE differential effects within bins and derives an association probability. For all association probabilities involving one given CRE, the probabilities are adjusted to favor associations across shorter distances. An FDR of the association probability is then estimated. Results are returned in list containing a `Hits` object and both input `GRanges`. ## Generating DegCre associations To begin, we load the necessary packages: ```{r message = FALSE, warning = FALSE, setup} library(GenomicRanges) library(InteractionSet) library(plotgardener) library(TxDb.Hsapiens.UCSC.hg38.knownGene) library(org.Hs.eg.db) library(DegCre) library(magick) ``` DegCre is run on the two input `GRanges`, `DegGR` and `CreGR`, and differential p-values associated with each. To require the effect directions be concordant between the changes in gene expression and CRE signal, users must also supply log fold-changes for each and set `reqEffectDirConcord=TRUE` (default value). As example data, we use `DexNR3C1` derived from McDowell et al. [@McDowell2018Glucocorticoid] using [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) and [csaw](https://bioconductor.org/packages/release/bioc/html/csaw.html): ```{r demo inputs} data(DexNR3C1) lapply(DexNR3C1,head) ``` We now run DegCre on this data with default values (subsetting input to "chr1" for expediency): ```{r demo DegCre default} subDegGR <- DexNR3C1$DegGR[which(seqnames(DexNR3C1$DegGR)=="chr1")] subCreGR <- DexNR3C1$CreGR[which(seqnames(DexNR3C1$CreGR)=="chr1")] myDegCreResList <- runDegCre(DegGR=subDegGR, DegP=subDegGR$pVal, DegLfc=subDegGR$logFC, CreGR=subCreGR, CreP=subCreGR$pVal, CreLfc=subCreGR$logFC, reqEffectDirConcord=TRUE, verbose=FALSE) ``` DegCre produces a list object with several items. These include the inputs `DegGR` and `CreGR`. It also contains a `Hits` object `degCreHits`. The `queryHits` of `degCreHits` index `DegGR` and the `subjectHits` index `CreGR`. `degCreHits` also has several metadata columns that include the key resuls of the algorithm. These include `assocProb` which is the probabilty of the association being true, and `assocProbFDR` which is the FDR of `assocProb` being greater than a null model based on TSS to CRE distance alone. Also, the results of the distance bin optimization heuristic, `binHeurOutputs`, and the input DEG significance threshold `alphaVal` are returned. ```{r demo DegCre result list} names(myDegCreResList) head(myDegCreResList$degCreHits) ``` To find associations driven by non-random changes in CRE signal, we subset the `degCreHits` by `assocProbFDR`. We can also further subset by `assocProb` to find those also more likely to confirm in a suitable validation experiment: ```{r demo subsetting} passFDR <- which(mcols(myDegCreResList$degCreHits)$assocProbFDR <= 0.05) passProb <- which(mcols(myDegCreResList$degCreHits)$assocProb >= 0.25) keepDegCreHits <- myDegCreResList$degCreHits[intersect(passFDR,passProb)] keepDegCreHits ``` We can then view the subsets of `DegGR` and `CreGR` that comprise the passing associations: ```{r demo GR subset} myDegCreResList$DegGR[queryHits(keepDegCreHits)] myDegCreResList$CreGR[subjectHits(keepDegCreHits)] ``` The choice of the threshold for DEG significance (DEG $\alpha$) is important in DegCre calculations. Accordingly, DegCre can be run over a set of thresholds to determine an optimal value. Here we select DEG $\alpha$ by finding the value that maximizes a normalized Precision-Recall Area Under the Curve (AUC): ```{r demo Alpha opt} alphaOptList <- optimizeAlphaDegCre(DegGR=subDegGR, DegP=subDegGR$pVal, DegLfc=subDegGR$logFC, CreGR=subCreGR, CreP=subCreGR$pVal, CreLfc=subCreGR$logFC, reqEffectDirConcord=TRUE, verbose=FALSE) alphaOptList$alphaPRMat bestAlphaId <- which.max(alphaOptList$alphaPRMat[,4]) bestDegCreResList <- alphaOptList$degCreResListsByAlpha[[bestAlphaId]] ``` ## Visualizing DegCre results With an optimal DegCre result obtained, we can plot the association probabilities versus binned association distance to see the relationship between the two: ```{r demo assoc dist plot, fig.wide=TRUE} par(mai=c(0.8,1.6,0.4,0.1)) par(mgp=c(1.7,0.5,0)) par(cex.axis=1) par(cex.lab=1.4) par(bty="l") par(lwd=2) binStats <- plotDegCreAssocProbVsDist(degCreResList=bestDegCreResList, assocProbFDRThresh=0.05) ``` The total number of associations passing a reasonable FDR (0.05 in the figure) drops fairly quickly as bin TSS to CRE distance increases (top panel). Similarly, the median association probability (black line in lower panel) drops with distance as well. The blue range is the interquartile range of the association probabilities. The red line in the above plot represents the null association probability, calculated as the probability if all CRE p-values in a given bin were identical. The binning of the associations by distance is also important to the operation of the DegCre algorithm. The choice of the bin size (number of associations per bin) is done by an optimization heuristic that tries a wide range of bin sizes and calculates the KS test statistic for each bin's CRE p-value distribution compared against the global (unbinned) CRE p-value distribution. The algorithm finds the smallest bin size for which the median KS-statistic is within a specified fraction, `fracMinKsMedianThresh`, of the range of that for all tested bin sizes. We visualize the results of this process: ```{r demo plot bin algorithm, fig.small=TRUE} par(mai=c(0.8,0.6,0.4,0.1)) par(mgp=c(1.7,0.5,0)) par(cex.axis=1) par(cex.lab=1.4) par(bty="l") par(lwd=1.5) plotDegCreBinHeuristic(degCreResList=bestDegCreResList) ``` The red dot indicates the chosen bin size run with the default value of 'fracMinKsMedianThresh = 0.2'. Another way to evaluate the performance of the DegCre on a data set is to make a Precision-Recall (PR) curve. To calculate these values, we define as a perfect set of associations, i.e. one that would generate a PR Area Under the Curve (AUC) of 1, as containing associations to all significant DEGs with uniform association probabilities of 1. Such a set will not result from real data, but it serves as a useful benchmark. Comparison of relative values of PR AUCs for DegCre results will be most informative. We plot the PR curve for the example data: ```{r demo plot PR curve, fig.small=TRUE} par(mai=c(0.8,0.6,0.4,0.1)) par(mgp=c(1.7,0.5,0)) par(cex.axis=1) par(cex.lab=1.4) par(bty="l") par(lwd=1.5) degCrePRAUC(degCreResList=bestDegCreResList) ``` The black line is the performance of the DegCre associations and the red region is values obtained from random shuffles. ## Obtaining the number of associations per DEG Since DegCre calculates association probabilities, one can estimate the number of expected associations per significant DEG by summing the total associations that pass a chosen FDR. DEGs with many expected associations are more likely to have true associations. This can be done with the `getExpectAssocPerDEG()` function: ```{r demo expect associations} expectAssocPerDegDf <- getExpectAssocPerDEG(degCreResList=bestDegCreResList, geneNameColName="GeneSymb", assocAlpha=0.05) ``` The result is a `DataFrame` with one entry per gene, sorted by the expected number of DEGs, `expectAssocs'. ```{r demo expect DataFrame} head(expectAssocPerDegDf) ``` This result can be plotted as a histogram by the function `plotExpectedAssocsPerDeg()`: ```{r demo plot expect hist, fig.small=TRUE} par(mai=c(0.8,0.6,0.4,0.1)) par(mgp=c(1.7,0.5,0)) par(cex.axis=1) par(cex.lab=1.4) par(bty="l") plotExpectedAssocsPerDeg(expectAssocPerDegDf=expectAssocPerDegDf) ``` The dashed line indicates the median expected associations per DEG. ## Browser views of DegCre results for genes or regions One of the genes with a high number of expected associations and also a known target of NR3C1 (Glucocorticoid receptor) is *ERRFI1*. Our test data `DexNR3C1` is derived from ChIP-seq of NR3C1 in cells treated with the powerful glucocorticoid, dexamethasone, so it is not surprising to find it with many expected associations. To visualize these associations, we can make a browser plot of the associations and underlying data with `plotBrowserDegCre()` which relies on the [plotgardener](https://bioconductor.org/packages/release/bioc/html/plotgardener.html) package. ```{r browser1, echo=TRUE, fig.wide=TRUE, message=FALSE,warning=FALSE} browserOuts <- plotBrowserDegCre(degCreResList=bestDegCreResList, geneName="ERRFI1", geneNameColName="GeneSymb", CreSignalName="NR3C1", panelTitleFontSize=8, geneLabelFontsize=10, plotXbegin=0.9, plotWidth=5) ``` We now want to zoom in closer to the TSS of *ERRFI1*. To do this we specify the region we want to plot as a `GRanges` of length 1: ```{r demo zoom browser} zoomGR <- GRanges(seqnames="chr1", ranges=IRanges(start=7900e3,end=8400e3)) ``` We then supply the `GRanges` to the `plotRegionGR` argument of `plotBrowserDegCre()`: ```{r plot browser2, echo=TRUE, fig.wide=TRUE, message=FALSE, warning=FALSE} zoomBrowserOuts <- plotBrowserDegCre(degCreResList=bestDegCreResList, plotRegionGR=zoomGR, geneNameColName="GeneSymb", CreSignalName="NR3C1", panelTitleFontSize=8, geneLabelFontsize=10, plotXbegin=0.9, plotWidth=5) ``` Since we did not specify `geneName` it is not highlighted. We can force highlights by supplying a `DataFrame` of gene names and colors to the `geneHighlightDf` argument. This `DataFrame` should conform to the requirements of `plotgardener::plotGenes()`. ```{r demo highlight browser} geneHighDf <- data.frame(gene=bestDegCreResList$DegGR$GeneSymb, col="gray") geneHighDf[which(geneHighDf$gene=="ERRFI1"),2] <- "black" ``` Then `geneHighDf` is supplied to the `geneHighlightDf` argument to produce: ```{r plot browser3, echo=TRUE, fig.wide=TRUE, message=FALSE, warning=FALSE} highLightBrowserOuts <- plotBrowserDegCre(degCreResList=bestDegCreResList, plotRegionGR=zoomGR, CreSignalName="NR3C1", geneNameColName="GeneSymb", geneHighlightDf=geneHighDf, panelTitleFontSize=8, geneLabelFontsize=10, plotXbegin=0.9, plotWidth=5) ``` ## Conversion of DegCre results to other formats The list of results that `DegCre()` produces has many useful features for downstream analysis. However, conversion to other formats will be useful in some situations. The function `convertdegCreResListToGInteraction()` will convert the DegCre list to a `GInteractions` object from the [InteractionSet](https://www.bioconductor.org/packages/release/bioc/html/InteractionSet.html) package: ```{r demo convert to GInter} degCreGInter <- convertdegCreResListToGInteraction(degCreResList=bestDegCreResList, assocAlpha=0.05) degCreGInter ``` The DegCre results list can can also be converted to a `DataFrame` in roughly a BEDPE format. This is useful for writing out DegCre results as text files: ```{r demo convert to DataFrame} degCreDf <- convertDegCreDataFrame(degCreResList=bestDegCreResList, assocAlpha=0.05) head(degCreDf) ``` ## Session Info ```{r} sessionInfo() ``` ## References