--- title: "Working with the Irrelevance-threshold Matrix of Dissimilarities." author: - name: Pablo Flores email: p_flores@espoch.edu.ec affiliation: Escuela Superior Politécnica de Chimborazo (ESPOCH), Facultad de Ciencias, Carrera de Estadística. - name: Jordi Ocaña email: jocana@ateneu.ub.edu affiliation: Department of Genetics, Microbiology and Statistics, Statistics Section, University of Barcelona header-includes: -\usepackage{amsmath} -\usepackage{mathtools} package: goSorensen abstract: > This vignette illustrates the calculation, visualisation, and interpretation of a dissimilarity matrix between gene lists. It is based on the concept of "irrelevance threshold", the equivalence limit that would declare them equivalent. This matrix is derived from the statistical method implemented in `goSorensen` to test for equivalence between two or more gene lists, based on the Sorensen--Dice dissimilarity and joint GO term enrichment frequencies. We begin with a brief introduction to the meaning of the matrix, followed by an explanation of the process for calculating it. Then, we visualize the dissimilarity matrix using interpretable statistical graphics, such as a dendrogram and the biplot generated from a multidimensional scaling (MDS) analysis. Characterizing the dimensions of the MDS biplot allows us to identify the biological functions (GO terms) associated with the observed dissimilarities. output: BiocStyle::html_document: toc: false bibliography: references.bib vignette: > %\VignetteIndexEntry{Working with te Irrelevance-threshold Matrix of Dissimilarities.} %\VignetteEngine{knitr::rmarkdown} %%\VignetteKeywords{GO, dissimilarities, lists, genes} %\VignetteEncoding{UTF-8} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() ``` ```{r css, echo=FALSE, results='asis'} cat(" ") ``` ```{r js, echo=FALSE, results='asis'} cat(" ") ``` ```{r setup, include=FALSE} knitr::opts_chunk$set( comment = "" ) ``` ```{r env, message = FALSE, warning = FALSE, echo = TRUE} library(goSorensen) ``` # PRELIMINARIES. ## Theoretical Framework of Reference \begin{equation} \begin{aligned} & \begin{matrix} \hspace{0.6cm} L_1 & L_2 & \hspace{0.1cm} L_3 & \hspace{0.1cm} \ldots \hspace{0.1cm} & L_{s-1} \end{matrix} \\ \mathfrak{D} = \begin{matrix} L_2 \\ L_3 \\ L_4 \\ \vdots \\ L_s \end{matrix} & \begin{pmatrix} \mathfrak{d}_{21} & & & & \\ \mathfrak{d}_{31} & \mathfrak{d}_{32} & & & \\ \mathfrak{d}_{41} & \mathfrak{d}_{42} & \mathfrak{d}_{43} & & \\ \vdots & \vdots & \vdots & & \\ \mathfrak{d}_{s1} & \mathfrak{d}_{s2} & \mathfrak{d}_{s3} & \ldots & \mathfrak{d}_{s(s-1)} \\ \end{pmatrix} \end{aligned} \label{it-diss} \end{equation} Given $s$ gene lists, $L_1, L_2, \dots, L_s$, each element $\mathfrak{d}_{ij}$ of the matrix $\mathfrak{D}$ tries to quantify how different or distant are the lists $L_i$ and $L_j$ in the following terms: "A higher equivalence threshold than this would cause these lists to be declared equivalent". These dissimilarities are based on the irrelevance (equivalence) threshold that would make them statistically significant, in the sense of the equivalence tests introduced in @flores2022equivalence: "[**_An equivalence test between features lists, based on the Sorensen - Dice index and the joint frequencies of GO term enrichment_**](https://rdcu.be/cOISz)." This concept was introduced in an attempt to give this dissimilarity a more inferential meaning, not merely descriptive as could be the Sorensen--Dice dissimilarity used directly. But analyzing simultaneously $s$ gene lists poses a multiple testing problem, so the algorithm to obtain the matrix should be expressed as: - Step 1. Establish $h = s(s-1) / 2$ equivalence hypothesis tests. Each test $ET_I$ (with $I = 1, \ldots, h$) compares the pair of lists $L_{i}, L_{j}$ (with $i, j = 1, 2, \ldots, s, i \neq j$). Equivalence is declared when the null hypothesis of non-equivalence is rejected. - Step 2. According to a criterion of type I error control in multiple testing, let $\mathfrak{d}_h$ be the smallest irrelevance limit for which all $h$ null hypothesis are rejected. The default adjusting criterion in `goSorensen` is the Holm's method ("holm" option in the R function `p.adjust`). So, $\mathfrak{d}_h$ is computed as: $$\mathfrak{d}_h = \text{min} \left(\mathfrak{d}: P_{(I)}(\mathfrak{d}) \leq \frac{\alpha}{h+1-I}; \hspace{0.5cm} I=1, 2, \ldots,h \right).$$ - Step 3. Use $\mathfrak{d}_h$ as the irrelevance limit dissimilarity between the lists $L_i, L_j$ corresponding to the last position in the vector of ordered p-values $P_{(1)}(\mathfrak{d}_{h}) \le \ldots \le P_{(h)}(\mathfrak{d}_{h})$, i.e. $\mathfrak{d}_{ij} = \mathfrak{d}_{h}.$ - Step 4. Set $h = h - 1$, excluding the comparison between the lists $L_i, L_j$ from Step $3$, and go back to Step $2$ until $h=0$. ## Data Used in this Vignette. The dataset used in this vignette, `allOncoGeneLists`, is derived from gene lists compiled at , which contains an extensive collection of gene lists associated with cancer. The `goSorensen` package loads this dataset using the command `data(allOncoGeneLists)`: ```{r} data("allOncoGeneLists") ``` `allOncoGeneLists` is an object of class list containing 7 elements. Each element is a character vector that holds the ENTREZ identifiers for the respective gene lists: ```{r} # name and length of the gene lists: sapply(allOncoGeneLists, length) ``` # Obtaining the Irrelevance-threshold Matrix of Dissimilarities. ## For a Specific Ontology and one GO level. One method to compute the dissimilarity matrix $\mathfrak{D}$ for a given ontology and level (e.g., ontology BP, level 4) is to directly input the gene lists into the `sorenThreshold` function: ```{r, warning=FALSE, message=FALSE, eval=FALSE} # Previously load the genomic annotation package for the studied specie: library(org.Hs.eg.db) humanEntrezIDs <- keys(org.Hs.eg.db, keytype = "ENTREZID") # Computing the irrelevance-threshold matrix of dissimilarities dissMatrx_BP4 <- sorenThreshold(allOncoGeneLists, geneUniverse = humanEntrezIDs, orgPackg = "org.Hs.eg.db", onto = "BP", GOLevel = 4, trace = FALSE) dissMatrx_BP4 ``` ```{r, echo=FALSE} options(digits = 4) data("dissMatrx_BP4") data("cont_all_BP4") dissMatrx_BP4 ``` Alternatively, we can previously compute all the enrichment contingency tables for all 21 possible pairs of lists from `allOncoGeneLists` using the function `buildEnrichTable.` The result is an object of class `tableList`, from which we can obtain the dissimilarity matrix as follows: ```{r, warning=FALSE, message=FALSE, comment=NA, eval=FALSE} # Previously compute the enrichment contingency tables: cont_all_BP4 <- buildEnrichTable(allOncoGeneLists, geneUniverse = humanEntrezIDs, orgPackg = "org.Hs.eg.db", onto = "BP", GOLevel = 4) ``` ```{r} # Computing the irrelevance-threshold matrix of dissimilarities from the # enrichment contingency table "cont_all_BP4": dissMatrx_BP4 <- sorenThreshold(cont_all_BP4, trace = FALSE) dissMatrx_BP4 ``` Both alternatives yield the same results; however, the latter approach (whenever possible, if all contingency tables are already available) is much faster, as it eliminates the internal time required to generate the contingency tables since they are provided as arguments to the function `sorenThreshold`. The dissimilarity matrix `dissMatrx_BP4` is obtained using equivalence tests based on the normal asymptotic distribution. If the user prefers the tests to be based on the bootstrap distribution, they should set the argument `boot = TRUE` (the default is `FALSE`), as shown below: ```{r, eval=FALSE} boot_dissMatrx_BP4 <- sorenThreshold(allOncoGeneLists, geneUniverse = humanEntrezIDs, orgPackg = "org.Hs.eg.db", onto = "BP", GOLevel = 4, boot = TRUE, # use bootstrap distribution trace = FALSE) boot_dissMatrx_BP4 ``` ```{r, echo=FALSE} sorenThreshold(cont_all_BP4, boot = TRUE, trace = FALSE) ``` Or, it is faster if we already have the enrichment contingency tables: ```{r} boot_dissMatrx_BP4 <- sorenThreshold(cont_all_BP4, boot = TRUE, trace = FALSE) boot_dissMatrx_BP4 ``` An important and useful attribute of this dissimilarity matrix (whether calculated using the normal or bootstrap distribution) is `attr(, "all2x2Tables")`, which contains the enrichment contingency tables for all possible pairs of lists among the set of gene lists being compared. These contingency tables, in turn, include the matrix of GO terms enrichment in its attribute `attr(, "enriched")`. For more details about the matrix of GO terms enrichment, the enrichment contingency tables and the equivalence tests (including computation, outputs, attributes, etc.), refer to the vignette [An Introduction to the goSorensen R Package](goSorensen_Introduction.html). #### NOTE: {.unnumbered} To provide users with a quick visualization, the `goSorensen` package includes the objects `cont_all_BP4` and `dismatBP4`, which can be accessed using `data(cont_all_BP4)` and `data(dismatBP4)`. Note that gene lists, GO terms, and Bioconductor may change over time. So, consider these objects only as illustrative examples, valid exclusively for the dataset `allOncoGeneLists` at a specific time. The current version of these results was generated with Bioconductor version 3.20. The same comment is applicable to other objects included in `goSorensen` for quick visualization, some of which are also described in this vignette. ## For More than One Ontology and GO Level. To calculate dissimilarity matrices for multiple ontologies and GO levels, we use the `allSorenThreshold` function. The first option is to provide the lists of genes to compare as the main input, as shown below: ```{r, warning=FALSE, message=FALSE, comment=NA, eval=FALSE} # For example, for GO levels 3 to 10 and for the ontologies BP, CC and MF: allDissMatrx <- allSorenThreshold(allOncoGeneLists, geneUniverse = humanEntrezIDs, orgPackg = "org.Hs.eg.db", ontos = c("BP", "CC", "MF"), GOLevels = 3:10, trace = FALSE) ``` Another option is previously computing all the enrichment contingency tables from `allOncoGeneLists` using the function `allBuildEnrichTable.` The result is an object of class `allTableList`, from which we can obtain the dissimilarity matrix as follows: ```{r, warning=FALSE, message=FALSE, comment=NA, eval=FALSE} # Previously compute the enrichment contingency tables: allContTabs <- allBuildEnrichTable(allOncoGeneLists, geneUniverse = humanEntrezIDs, orgPackg = "org.Hs.eg.db", ontos = c("BP", "CC", "MF"), GOLevels = 3:10) # Computing the irrelevance-threshold matrix of dissimilarities from the # enrichment contingency tables "allContTabs": allDissMatrx <- allSorenThreshold(allContTabs, trace = FALSE) ``` Remember that, by default, these matrices are obtained from equivalence tests based on the normal distribution. If the user prefers to use the bootstrap distribution instead, they must set the argument `boot = TRUE` inside the `allSorenThreshold` function. #### NOTE: {.unnumbered} Given the large number of outcomes produced in `allDissMatrx`, displaying all of them is not feasible. However, for a quick visualization, the `goSorensen` package includes the object `allDissMatrx`, which can be accessed using `data(allDissMatrx)`. # Representing a Dissimilarity Matrix in Statistical Graphs. ## Dendrogram. Below, we present a dendrogram of the dissimilarity matrix for the specific case of the `allOncoGeneLists` dataset, for the BP ontology and GO Level 4, which was computed in Section 2.1 of this vignette (object `dissMatrx_BP4`): ```{r, warning=FALSE, message=FALSE} clust.threshold <- hclustThreshold(dissMatrx_BP4) plot(clust.threshold) ``` Since the dissimilarities used to form the biplot are based on an inferential measure of significant equivalence, we can deduce that the closer two or more lists are (forming groups), the stronger the evidence suggesting that these lists are functionally similar, and the opposite for distant lists. For example, in this particular dendrogram, two groups appear to be formed, as two pairs of lists are very close to each other. The first group consists of sanger and vogelstein, while the second group includes miscellaneous and waldman. So, it can then be inferred that the pairs of lists within each group provide stronger evidence of equivalence. ## MDS-Biplot. Multidimensional Scaling (MDS) transforms the initial dissimilarity matrix into Euclidean distances, preserving as much of the original variability as possible. In this case, we compute the first two dimensions, which capture the highest proportion of the original variability. Like in the dendrogram example, we use the dissimilarity matrix `dissMatrx_BP4` computed in Section 2.1 of this vignette: ```{r, warning=FALSE, message=FALSE} # multidimensional scaling analysis: mds <- as.data.frame(cmdscale(dissMatrx_BP4, k = 2)) ``` Below we plot the MDS-Biplot: ```{r, warning=FALSE, message=FALSE, fig.align='left', fig.height=3, fig.width=5} library(ggplot2) library(ggrepel) graph <- ggplot() + geom_point(aes(mds[,1], mds[,2]), color = "blue", size = 3) + geom_text_repel(aes(mds[,1], mds[,2], label = attr(dissMatrx_BP4, "Labels")), color = "black", size = 3) + xlab("Dim 1") + ylab("Dim 2") + theme_light() graph ``` The biplot suggests that the closest lists (clusters) are Waldman-Vogelstein and Sanger-Miscellaneous; therefore, there is stronger evidence for equivalence among these lists. Characterizing the axes of the biplot on which these groups are formed is of great interest, as it can help determine the biological functions associated with the formation of these groups. Specifically, it allows us to identify the GO terms involved in the evidence supporting the biological similarity between the compared gene lists. In the following section, we propose a 5-step procedure to characterize the axes of the biplot and identify the GO terms associated with the formation of these groups. This is a very preliminary version of a work in progress. # Characterizing MDS-Biplot Dimensions. In this vignette, we will apply the 5-step procedure to Dimension 1. However, the technique would be the same if we were to apply it to Dimension 2 or any other dimension. ## Step 1. Split the biplot axis: Split the axis (in this case, Dimension 1) in three parts to identify the lists located at the extremes. For example, we could allocate 20% to each extreme, leaving 60% in the middle, as follows: ```{r} # Split the axis 20% to the left, 60% to the middle and 20% to the right: prop <- c(0.2, 0.6, 0.2) # Sort according dimension 1: sorted <- mds[order(mds[, 1]), ] # Determine the range for dimension 1. range <- sorted[, 1][c(1, nrow(mds))] # Find the cut-points to split the axis: cutpoints <- (cumsum(prop)[1:2] * diff(range)) + range[1] # Identify lists to the left: lleft <- rownames(sorted[sorted[, 1] < cutpoints[1], ]) lleft # Identify lists to the right lright <- rownames(sorted[sorted[, 1] > cutpoints[2], ]) lright ``` Visualizing in the Biplot: ```{r, warning=FALSE, message=FALSE, fig.align='left', fig.height=3, fig.width=5} graph + geom_vline(xintercept = cutpoints, color = "red", linetype = "dashed", linewidth = 0.75) ``` The basic idea is that if the enrichment of a GO term differs between these two groups of lists at the extremes, we can deduce that this GO term discriminates the groups, thus explaining their formation. ## Step 2. Obtain the matrix of GO terms enrichment: Obtain the matrix of GO terms enrichment for the lists located at the extremes detected in Step 1. Remember, this matrix is stored as an attribute of the enrichment contingency tables. Therefore: ```{r, echo=FALSE} options(max.print = 16) ``` ```{r} # enrichment contingency tables: contTablesBP4 <- attr(dissMatrx_BP4, "all2x2Tables") # matrix of GO term enrichment for the lists located at the extreme left tableleft <- attr(contTablesBP4, "enriched")[, lleft, drop = FALSE] tableleft ``` ```{r, echo=FALSE} options(max.print = 50) ``` ```{r} # matrix of GO term enrichment for the lists located at the extreme right tableright <- attr(contTablesBP4, "enriched")[, lright, drop = FALSE] tableright ``` Actually, the contingency tables saved in the objects `contTablesBP4` and `cont_all_BP4` (from Section 2.1) are exactly the same. ## Step 3. Compute means and variances: Compute the means and variances for each of the GO terms detected in each extreme group. In other words, calculate the means and variances by rows of the GO term enrichment matrix for the lists at the extremes: ```{r, warning=FALSE, message=FALSE} # function to compute mean and standard deviation: mean_sd <- function(x){ c("mean" = mean(x), "sd" = ifelse(length(x) == 1, 0, sd(x))) } # mean and sd of the lists to the extreme left: lmnsd <- apply(tableleft, 1, mean_sd) # mean and sd of the lists to the extreme right: rmnsd <- apply(tableright, 1, mean_sd) ``` ## Step 4. Establish a statistic: Establish a statistic to assess the degree of enrichment disparity between extreme groups for each GO term. This measure should consider that a GO term discriminates a dimension if: i) the lists show opposing patterns of enrichment in the GO term, indicated by a large difference between $| _L\overline{X}_{GO_j} - _R \overline{X}_{GO_j}|$ ($L$ represents Left and $R$ Rigth), and ii) the behavior of the lists remains stable at each extreme, indicated by a small amount of $_LS^2_{GO_j} + _RS^2_{GO_j}$. Considering these specific details, the following statistic, is suggested: $$st_j = \dfrac{| _L\overline{X}_{GO_j} - _R \overline{X}_{GO_j}|}{\sqrt{\dfrac{_LS^2_{GO_j}}{l} + \dfrac{_RS^2_{GO_j}}{r} + \epsilon }}$$ with $l$ the number of lists to the extreme left and $r$ the number of lists to the extreme right. Using $\epsilon= 0.00000001$ the statistic is computed as follows: ```{r, warning=FALSE, message=FALSE} nl <- ncol(tableleft) nr <- ncol(tableright) t_stat <- abs(lmnsd[1, ] - rmnsd[1, ]) / sqrt((((lmnsd[2, ] / nl) + (rmnsd[2, ] / nr))) + 0.00000001) ``` ## Step 5. Select the GO terms with the highest statistic: GO terms with the highest statistic are those with a value closest to $\frac{1}{\sqrt{\epsilon}}$. These GO terms are the main contributors to the formation of clusters in the analyzed dimension. Therefore, they are strongly related to the equivalence between the compared gene lists. They are detected as follows: ```{r, warning=FALSE, message=FALSE} result <- t_stat[t_stat == max(t_stat)] result ``` In addition, we can identify the biological functions associated with the detected GO terms: ```{r, warning=FALSE, message=FALSE} library(GO.db) library(DT) # Previous function to get the description of the identified GO terms: get_go_description <- function(go_id) { go_term <- Term(GOTERM[[go_id]]) return(go_term) } # GO terms description: datatable(data.frame(GO_term = names(result), Description = sapply(names(result), get_go_description, USE.NAMES = TRUE)), filter = "top", rownames = FALSE) ``` # References {.unnumbered}