---
title: "Continuous Data Analysis"
shorttitle: "Continuous Data"
package: knowYourCG
output: rmarkdown::html_vignette
fig_width: 6
fig_height: 5
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{"3. Continuous Variable Enrichment Analysis"}
  %\VignetteEncoding{UTF-8}
---

There are four testing scenarios depending on the type format of the query set
and database sets. They are shown with the respective testing scenario in the
table below. `testEnrichment`, `testEnrichmentSEA` are for Fisher's exact test
and Set Enrichment Analysis respectively.

```{r ky9, echo = FALSE, results="asis"}
library(knitr)
df = data.frame(
    c("Correlation-based","Set Enrichment Analysis"),
    c("Set Enrichment Analysis","Fisher's Exact Test")
)
colnames(df) <- c("Continuous Database Set", "Discrete Database Set")
rownames(df) <- c("Continuous Query", "Discrete Query")
kable(df, caption="Four knowYourCG Testing Scenarios")
```

# CONTINUOUS VARIABLE ENRICHMENT

The query may be a named continuous vector. In that case, either a gene
enrichment score will be calculated (if the database is discrete) or a Spearman
correlation will be calculated (if the database is continuous as well). The
three other cases are shown below using biologically relevant examples.

To display this functionality, let's load two numeric database sets
individually. One is a database set for CpG density and the other is a database
set corresponding to the distance of the nearest transcriptional start site
(TSS) to each probe.

```{r ky21, run-test-data, echo=TRUE, eval=TRUE, message=FALSE}
query <- getDBs("KYCG.MM285.designGroup")[["TSS"]]
```

```{r ky22, echo=TRUE, eval=TRUE, message=FALSE}
sesameDataCache(data_titles = c("KYCG.MM285.seqContextN.20210630"))
res <- testEnrichmentSEA(query, "MM285.seqContextN")
main_stats <- c("dbname", "test", "estimate", "FDR", "nQ", "nD", "overlap")
res[,main_stats]
```

The estimate here is enrichment score.

> **NOTE:** Negative enrichment score suggests enrichment of the categorical
database with the higher values (in the numerical database). Positive
enrichment score represent enrichment with the smaller values. As expected, the
designed TSS CpGs are significantly enriched in smaller TSS distance and higher
CpG density.

Alternatively one can test the enrichment of a continuous query with discrete
databases. Here we will use the methylation level from a sample as the query
and test it against the chromHMM chromatin states.

```{r ky23, warning=FALSE, eval=TRUE,message=FALSE}
library(sesame)
sesameDataCache(data_titles = c("MM285.1.SigDF"))
beta_values <- getBetas(sesameDataGet("MM285.1.SigDF"))
res <- testEnrichmentSEA(beta_values, "MM285.chromHMM")
main_stats <- c("dbname", "test", "estimate", "FDR", "nQ", "nD", "overlap")
res[,main_stats] 
```

As expected, chromatin states `Tss`, `Enh` has negative enrichment score,
meaning these databases are associated with small values of the query (DNA
methylation level). On the contrary, `Het` and `Quies` states are associated 
with high methylation level.

# SESSION INFO

```{r}
sessionInfo()
```