---
title: "An Introduction to goSorensen R-Package"
author:
- name: Jordi Ocaña
  email: jocana@ub.edu
  affiliation: Department of Genetics, Microbiology and Statistics, Statistics Section, University of Barcelona
- name: Pablo Flores
  email: p_flores@espoch.edu.ec
  affiliation: Escuela Superior Politécnica de Chimborazo (ESPOCH), Facultad de Ciencias, Carrera de Estadística.
package: goSorensen
abstract: >
  This vignette provides an introduction to goSorensen  package, which was built to determine equivalence between features lists. The method is based on the Sorensen–Dice index and the joint frequencies of GO term enrichment. 
  
  
  Starting from an introduction, which includes a brief explanation of the method to detect equivalence between feature lists, a description of the data to be used, and previous considerations about the necessary information of the species to be analyzed, this vignette explains how to: i) perform an equivalence test, ii) access to specific results of the equivalence test, iii) upgrade the obtained results and iv) perform results for more than two lists.
  
output:
  BiocStyle::html_document:
    toc_float: true
vignette: >
  %\VignetteIndexEntry{An introduction to equivalence test between feature lists using goSorensen.}
  %\VignetteEngine{knitr::rmarkdown}
  %%\VignetteKeywords{Annotation, GO, GeneSetEnrichment, Software, Microarray, Pathways, GeneExpression, MultipleComparison}
  %\VignetteEncoding{UTF-8}
---

```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```

```{r setup, include=FALSE}
knitr::opts_chunk$set(dpi=25,fig.width=7)
```

```{r env, message = FALSE, warning = FALSE, echo = TRUE}
library(goSorensen)
```

# Introduction. {.unnumbered}

The goal of **goSorensen** is to implement the equivalence test introduced in Flores, P., Salicrú, M., Sánchez-Pla, A. and Ocaña, J.(2022) "An equivalence test between features lists, based on the Sorensen - Dice index and the joint frequencies of GO node enrichment", BMC Bioinformatics, 2022 23:207.

Given two gene lists, $L_1$ and $L_2$, (the data) and a given set of $n$ Gene Ontology (GO) terms (the frame of reference for biological significance in these lists), the test is devoted to answer the following question (quite informally stated for the moment): The dissimilarity between the biological information in both lists, is it negligible? To measure the dissimilarity we use the Sorensen-Dice index:

$$
\hat d_S = \hat d(L_1,L_2) = \frac{2n_{11}}{2n_{11} + n_{10} + n_{01}}
$$

where $n_{11}$ corresponds to the number of GO terms (among the $n$ GO terms under consideration) which are enriched in both gene lists, $n_{10}$ corresponds to the GO terms enriched in $L_1$ but not in $L_2$ and $n_{01}$ the reverse, those enriched in $L_2$ but not in $L_1$. For notation completeness, $n_{00}$ would correspond to those GO terms not enriched in both lists; it is not considered by the Sorensen-Dice index but would be necessary in some computations. Obviously, $n = n_{11} + n_{10} + n_{01} + n_{00}$.

More precisely, the above problem can be restated as follows: Given a negligibility threshold $d_0$ for the Sorensen-Dice values, to decide negligibility corresponds to rejecting the null hypothesis $H_0:d_S \ge d_0$ in favour of the alternative $H_1: d_S < d_0$, where $d_S$ stands for the "true" value of the Sorensen-Dice dissimilarity ($L_1$ and $L_2$ are samples, and the own process of declaring enrichment of a GO term is random, so $\hat d_S = \hat d(L_1,L_2)$ is an estimate of $d_S$). Then, a bit more precise statement of the problem is "The dissimilarity between the biological information in two gene lists, is it negligible up to a degree $d_0$?" Where this information is expressed by means of the Sorensen-Dice dissimilarity measured on the degree of coincidence and non-coincidence in GO terms enrichment among a given set of GO terms.

For the moment, the reference set of GO terms can be only all those GO terms in a given level of one GO ontology, either BP, CC or MF.

## Installation. {.unnumbered}

goSorensen package must be installed with a working R version (\>=4.3.0). Installation could take a few minutes on a regular desktop or laptop. Package can be installed from Bioconductor, then it needs to be loaded using `library(goSorensen)`:

```{r, eval=FALSE}
if (!requireNamespace("goSorensen", quietly = TRUE)) {
    BiocManager::install("goSorensen")
}
library(goSorensen)
```

## Data. {.unnumbered}

The dataset used in this vignette, `allOncoGeneLists`, is based on the gene lists compiled at <http://www.bushmanlab.org/links/genelists>, a comprehensive set of gene lists related to cancer. The package `goSorensen` loads this dataset using `data(allOncoGeneLists)`:

```{r}
data("allOncoGeneLists")
```

`allOncoGeneLists`  is an object of class list, containing seven character vectors with the ENTREZ gene identifiers of a gene list related to cancer.

```{r}
length(allOncoGeneLists)
sapply(allOncoGeneLists, length)

# First 20 gene identifiers of gene lists Vogelstein and sanger:
allOncoGeneLists[["Vogelstein"]][1:20]
allOncoGeneLists[["sanger"]][1:20]
```

## Previous Information About the Species to Analyze.  {.unnumbered}
Before using `goSorensen,` the users must have adequate knowledge of the species they intend to focus their analysis on. The genomic annotation packages available in Bioconductor provide all the essential information about many species. 


For the specific case of this vignette, given that the analysis will be done in the human species, the `org.Hs.eg.db` package must be previously installed and activated as follows:

```{r, message = FALSE, warning = FALSE, eval = FALSE}
if (!requireNamespace("org.Hs.eg.db", quietly = TRUE)) {
    BiocManager::install("org.Hs.eg.db")
}
library(org.Hs.eg.db)
```

```{r, message = FALSE, warning = FALSE, eval = TRUE}
library(org.Hs.eg.db)
```

Actually, the `org.Hs.eg.db` package is automatically installed as a dependency on `goSorensen`, making its installation unnecessary. However, for any other species, the user must install the correspondence genome annotation for the species to analyze, as indicated in the above code. 


In addition, it is necessary to have a vector containing the IDs of the universe of genes associated with the species under study. The genomic annotation package provides an easy way to obtain this universe. The ENTREZ identifiers of the gene universe for humans, necessary for this vignette, is obtained as follows:

```{r, message = FALSE, warning = FALSE, echo = TRUE}
humanEntrezIDs <- keys(org.Hs.eg.db, keytype = "ENTREZID")
```

In this same way, the identifiers of the gene universe can be obtained for any other species.


# Performing an Equivalence Test.

## Equivalence Test From a Contingency Table of Joint Enrichment.

Function `equivTestSorensen` performs the equivalence test. One possibility is to build first the joint enrichment contingency table using the function `buildEnrichTable` and then to perform the equivalence test:

```{r, warning=FALSE, message=FALSE}
# Build the enrichment contingency table between gene lists Vogelstein and 
# sanger for the MF ontology at GO level 5:
enrichTab <- buildEnrichTable(allOncoGeneLists[["Vogelstein"]],
                              allOncoGeneLists[["sanger"]],
                              geneUniverse = humanEntrezIDs, 
                              orgPackg = "org.Hs.eg.db",
                              onto = "MF", GOLevel = 5, 
                              listNames = c("Vogelstein", "sanger"))
enrichTab

# Equivalence test for an equivalence (or negligibility) limit 0.2857
testResult <- equivTestSorensen(enrichTab, d0 = 0.2857)
testResult
```

## Equivalence Test Directly From the Gene Lists.

Performing the test directly from the gene lists is also possible:

```{r}
equivTestSorensen(allOncoGeneLists[["Vogelstein"]],
                  allOncoGeneLists[["sanger"]], d0 = 0.2857,
                  geneUniverse = humanEntrezIDs, 
                  orgPackg = "org.Hs.eg.db",
                  onto = "MF", GOLevel = 5, 
                  listNames = c("Vogelstein", "sanger"))
```

The first option (building previously the contingency table) may be suitable to save computing time.

The above tests are based on the normal distribution. The next section shows how to use the bootstrap distribution, which is suitable for low enrichment levels.

## Using a Bootstrap Aproximation.

The following code computes the results of an equivalence test based on the bootstrap distribution:

```{r}
boot.testResult <- equivTestSorensen(enrichTab, d0 = 0.2857, boot = TRUE)
boot.testResult
```

As one can see, to obtain the bootstrap results, we only have used the argument `boot=TRUE`.

For low frequencies in the contingency table, bootstrap is a more conservative but preferable approach, with better type I error control.

## Isolated Computes. 

The outputs of the equivalence test, such as the Sorensen dissimilarity, the standard error of the test, and the confidence limit, can be computed individually. 

```{r}
# The Sorensen dissimilarity from the contingency table:
dSorensen(enrichTab)
# The Sorensen dissimilarity from the gene lists:
dSorensen(allOncoGeneLists[["Vogelstein"]], 
          allOncoGeneLists[["sanger"]],
          geneUniverse = humanEntrezIDs, 
          orgPackg = "org.Hs.eg.db",
          onto = "MF", GOLevel = 5, 
          listNames = c("Vogelstein", "sanger"))

# The standard error from the contingency table::
seSorensen(enrichTab)
# or from the gene lists:
seSorensen(allOncoGeneLists[["Vogelstein"]], 
           allOncoGeneLists[["sanger"]],
           geneUniverse = humanEntrezIDs, 
           orgPackg = "org.Hs.eg.db",
           onto = "MF", GOLevel = 5, 
           listNames = c("Vogelstein", "sanger"))

# Upper limit of the confidence interval from the contingency table:
duppSorensen(enrichTab)
duppSorensen(enrichTab, conf.level = 0.90)
duppSorensen(enrichTab, conf.level = 0.90, boot = TRUE)

# Upper limit of the confidence interval from the gene lists:
duppSorensen(allOncoGeneLists[["Vogelstein"]], 
             allOncoGeneLists[["sanger"]],    
             geneUniverse = humanEntrezIDs, 
             orgPackg = "org.Hs.eg.db", 
             onto = "MF", GOLevel = 5, 
             listNames = c("Vogelstein", "sanger"))
```


# Accessing to Specific Results.

To access specific results from the equivalence test above computed, one can use the functions of the type `get...`, as follows:

```{r}
getDissimilarity(testResult)
getSE(testResult)
getPvalue(testResult)
getTable(testResult)
getUpper(testResult)

# In the bootstrap approach, only these differ:
getPvalue(boot.testResult)
getUpper(boot.testResult)
# (Only available for bootstrap tests) efective number of bootstrap resamples:
getNboot(boot.testResult)
```

# Upgrading the Outputs.
After performing the equivalence test calculations, the results can be updated without redoing the calculations. This can be done by using different values for the irrelevance limit, level of significance, distribution, and number of resamples (in the case of Bootstrap) than the ones initially provided to the function. 

For example, the results saved in the `testResult` object were calculated using the arguments: `d0 = 0.2857`, `conf.level = 0.95` and `boot = FALSE` (using normal distribution). Now, we are going to upgrade the results with other arguments, as follows:

```{r}
upgrade(testResult, d0 = 0.4444, conf.level = 0.99, boot = TRUE)
```


# All Pairwise Computes

Since `goSorensen` has been built under the S3 object-oriented programming paradigm, it is not only possible to perform calculations for a couple of gene lists. If instead of entering two vectors with the lists $L_1, L_2$ of genes to be compared as input to the functions, an object of the class "list" is entered, which contains several vectors with $L_1, L_2, \ldots, L_s$ lists, then some specific `goSorensen` functions perform the calculations for all possible pairs of gene lists that are formed.


For instance, the object `allOncoGeneLists` is a list object that consists of seven vectors, each corresponding to a list of genes. For a given ontology and GO level, we can compute the dissimilarity between all possible pairs of comparisons among the lists. In this case, there are a total of 21 pairs. The computation is as follows:

```{r}
totalDiss <- dSorensen(allOncoGeneLists, onto = "MF", GOLevel = 5, 
          geneUniverse = humanEntrezIDs, orgPackg = "org.Hs.eg.db")
round(totalDiss, 2)
```

Similarly, the following code performs all pairwise tests:

```{r}
allTests <- equivTestSorensen(allOncoGeneLists, d0 = 0.2857, 
                              onto = "MF", GOLevel = 5, 
                              geneUniverse = humanEntrezIDs, 
                              orgPackg = "org.Hs.eg.db")
getPvalue(allTests)
getDissimilarity(allTests, simplify = FALSE)
```


# Session information. {.unnumbered}

All software and respective versions used to produce this document are listed below.

```{r sessionInfo}
sessionInfo()
```

# Bibliography. {.unnumbered}

Flores, P., Salicrú, M., Sánchez-Pla, A. et al. An equivalence test between features lists, based on the Sorensen--Dice index and the joint frequencies of GO term enrichment. BMC Bioinformatics 23, 207 (2022). <https://doi.org/10.1186/s12859-022-04739-2>