Working with the Irrelevance-threshold Matrix of Dissimilarities.

library(goSorensen)

PRELIMINARIES.

Theoretical Framework of Reference

Given s gene lists, L1, L2, …, Ls, each element 𝔡ij of the matrix 𝔇 tries to quantify how different or distant are the lists Li and Lj in the following terms: “A higher equivalence threshold than this would cause these lists to be declared equivalent”. These dissimilarities are based on the irrelevance (equivalence) threshold that would make them statistically significant, in the sense of the equivalence tests introduced in Flores et al. (2022): “An equivalence test between features lists, based on the Sorensen - Dice index and the joint frequencies of GO term enrichment.” This concept was introduced in an attempt to give this dissimilarity a more inferential meaning, not merely descriptive as could be the Sorensen–Dice dissimilarity used directly.

But analyzing simultaneously s gene lists poses a multiple testing problem, so the algorithm to obtain the matrix should be expressed as:

  • Step 1. Establish h = s(s − 1)/2 equivalence hypothesis tests. Each test ETI (with I = 1, …, h) compares the pair of lists Li, Lj (with i, j = 1, 2, …, s, i ≠ j). Equivalence is declared when the null hypothesis of non-equivalence is rejected.

  • Step 2. According to a criterion of type I error control in multiple testing, let 𝔡h be the smallest irrelevance limit for which all h null hypothesis are rejected. The default adjusting criterion in goSorensen is the Holm’s method (“holm” option in the R function p.adjust). So, 𝔡h is computed as: $$\mathfrak{d}_h = \text{min} \left(\mathfrak{d}: P_{(I)}(\mathfrak{d}) \leq \frac{\alpha}{h+1-I}; \hspace{0.5cm} I=1, 2, \ldots,h \right).$$

  • Step 3. Use 𝔡h as the irrelevance limit dissimilarity between the lists Li, Lj corresponding to the last position in the vector of ordered p-values P(1)(𝔡h) ≤ … ≤ P(h)(𝔡h), i.e. 𝔡ij = 𝔡h.

  • Step 4. Set h = h − 1, excluding the comparison between the lists Li, Lj from Step 3, and go back to Step 2 until h = 0.

Data Used in this Vignette.

The dataset used in this vignette, allOncoGeneLists, is derived from gene lists compiled at http://www.bushmanlab.org/links/genelists, which contains an extensive collection of gene lists associated with cancer. The goSorensen package loads this dataset using the command data(allOncoGeneLists):

data("allOncoGeneLists")

allOncoGeneLists is an object of class list containing 7 elements. Each element is a character vector that holds the ENTREZ identifiers for the respective gene lists:

# name and length of the gene lists:
sapply(allOncoGeneLists, length)
        atlas      cangenes           cis miscellaneous        sanger 
          991           189           613           187           450 
   Vogelstein       waldman 
          419           426 

Obtaining the Irrelevance-threshold Matrix of Dissimilarities.

For a Specific Ontology and one GO level.

One method to compute the dissimilarity matrix 𝔇 for a given ontology and level (e.g., ontology BP, level 4) is to directly input the gene lists into the sorenThreshold function:

# Previously load the genomic annotation package for the studied specie:
library(org.Hs.eg.db)
humanEntrezIDs <- keys(org.Hs.eg.db, keytype = "ENTREZID")

# Computing the irrelevance-threshold matrix of dissimilarities
dissMatrx_BP4 <- sorenThreshold(allOncoGeneLists, 
                            geneUniverse = humanEntrezIDs, 
                            orgPackg = "org.Hs.eg.db",
                            onto = "BP", 
                            GOLevel = 4,
                            trace = FALSE)
dissMatrx_BP4
               atlas    cis miscellaneous sanger Vogelstein
cis           0.7413                                       
miscellaneous 0.5496 0.6096                                
sanger        0.4955 0.6464        0.3476                  
Vogelstein    0.4027 0.6808        0.4467 0.2547           
waldman       0.3123 0.6707        0.2820 0.3659     0.3322

Alternatively, we can previously compute all the enrichment contingency tables for all 21 possible pairs of lists from allOncoGeneLists using the function buildEnrichTable. The result is an object of class tableList, from which we can obtain the dissimilarity matrix as follows:

# Previously compute the enrichment contingency tables:
cont_all_BP4 <- buildEnrichTable(allOncoGeneLists,
                                 geneUniverse = humanEntrezIDs, 
                                 orgPackg = "org.Hs.eg.db", 
                                 onto = "BP", 
                                 GOLevel = 4)
# Computing the irrelevance-threshold matrix of dissimilarities from the 
# enrichment contingency table "cont_all_BP4":
dissMatrx_BP4 <- sorenThreshold(cont_all_BP4, 
                                trace = FALSE)
dissMatrx_BP4
               atlas    cis miscellaneous sanger Vogelstein
cis           0.7413                                       
miscellaneous 0.5496 0.6096                                
sanger        0.4955 0.6464        0.3476                  
Vogelstein    0.4027 0.6808        0.4467 0.2547           
waldman       0.3123 0.6707        0.2820 0.3659     0.3322

Both alternatives yield the same results; however, the latter approach (whenever possible, if all contingency tables are already available) is much faster, as it eliminates the internal time required to generate the contingency tables since they are provided as arguments to the function sorenThreshold.

The dissimilarity matrix dissMatrx_BP4 is obtained using equivalence tests based on the normal asymptotic distribution. If the user prefers the tests to be based on the bootstrap distribution, they should set the argument boot = TRUE (the default is FALSE), as shown below:

boot_dissMatrx_BP4 <- sorenThreshold(allOncoGeneLists, 
                            geneUniverse = humanEntrezIDs, 
                            orgPackg = "org.Hs.eg.db",
                            onto = "BP", 
                            GOLevel = 4, 
                            boot = TRUE, # use bootstrap distribution
                            trace = FALSE)
boot_dissMatrx_BP4
               atlas    cis miscellaneous sanger Vogelstein
cis           0.7401                                       
miscellaneous 0.4463 0.6091                                
sanger        0.4136 0.6465        0.3498                  
Vogelstein    0.3836 0.6799        0.3913 0.2572           
waldman       0.3153 0.6709        0.2848 0.4951     0.5491

Or, it is faster if we already have the enrichment contingency tables:

boot_dissMatrx_BP4 <- sorenThreshold(cont_all_BP4, 
                                     boot = TRUE, 
                                     trace = FALSE)
boot_dissMatrx_BP4
               atlas    cis miscellaneous sanger Vogelstein
cis           0.7401                                       
miscellaneous 0.4463 0.6091                                
sanger        0.4136 0.6465        0.3498                  
Vogelstein    0.3836 0.6799        0.3913 0.2572           
waldman       0.3153 0.6709        0.2848 0.4951     0.5491

An important and useful attribute of this dissimilarity matrix (whether calculated using the normal or bootstrap distribution) is attr(, "all2x2Tables"), which contains the enrichment contingency tables for all possible pairs of lists among the set of gene lists being compared. These contingency tables, in turn, include the matrix of GO terms enrichment in its attribute attr(, "enriched").

For more details about the matrix of GO terms enrichment, the enrichment contingency tables and the equivalence tests (including computation, outputs, attributes, etc.), refer to the vignette An Introduction to the goSorensen R Package.

NOTE:

To provide users with a quick visualization, the goSorensen package includes the objects cont_all_BP4 and dismatBP4, which can be accessed using data(cont_all_BP4) and data(dismatBP4).

Note that gene lists, GO terms, and Bioconductor may change over time. So, consider these objects only as illustrative examples, valid exclusively for the dataset allOncoGeneLists at a specific time. The current version of these results was generated with Bioconductor version 3.20. The same comment is applicable to other objects included in goSorensen for quick visualization, some of which are also described in this vignette.

For More than One Ontology and GO Level.

To calculate dissimilarity matrices for multiple ontologies and GO levels, we use the allSorenThreshold function. The first option is to provide the lists of genes to compare as the main input, as shown below:

# For example, for GO levels 3 to 10 and for the ontologies BP, CC and MF:
allDissMatrx <- allSorenThreshold(allOncoGeneLists,
                                  geneUniverse = humanEntrezIDs, 
                                  orgPackg = "org.Hs.eg.db", 
                                  ontos = c("BP", "CC", "MF"), 
                                  GOLevels = 3:10, 
                                  trace = FALSE)

Another option is previously computing all the enrichment contingency tables from allOncoGeneLists using the function allBuildEnrichTable. The result is an object of class allTableList, from which we can obtain the dissimilarity matrix as follows:

# Previously compute the enrichment contingency tables:
allContTabs <- allBuildEnrichTable(allOncoGeneLists,
                                   geneUniverse = humanEntrezIDs, 
                                   orgPackg = "org.Hs.eg.db", 
                                   ontos = c("BP", "CC", "MF"), 
                                   GOLevels = 3:10)

# Computing the irrelevance-threshold matrix of dissimilarities from the 
# enrichment contingency tables "allContTabs":
allDissMatrx <- allSorenThreshold(allContTabs,
                                  trace = FALSE)

Remember that, by default, these matrices are obtained from equivalence tests based on the normal distribution. If the user prefers to use the bootstrap distribution instead, they must set the argument boot = TRUE inside the allSorenThreshold function.

NOTE:

Given the large number of outcomes produced in allDissMatrx, displaying all of them is not feasible. However, for a quick visualization, the goSorensen package includes the object allDissMatrx, which can be accessed using data(allDissMatrx).

Representing a Dissimilarity Matrix in Statistical Graphs.

Dendrogram.

Below, we present a dendrogram of the dissimilarity matrix for the specific case of the allOncoGeneLists dataset, for the BP ontology and GO Level 4, which was computed in Section 2.1 of this vignette (object dissMatrx_BP4):

clust.threshold <- hclustThreshold(dissMatrx_BP4)
plot(clust.threshold)

Since the dissimilarities used to form the biplot are based on an inferential measure of significant equivalence, we can deduce that the closer two or more lists are (forming groups), the stronger the evidence suggesting that these lists are functionally similar, and the opposite for distant lists.

For example, in this particular dendrogram, two groups appear to be formed, as two pairs of lists are very close to each other. The first group consists of sanger and vogelstein, while the second group includes miscellaneous and waldman. So, it can then be inferred that the pairs of lists within each group provide stronger evidence of equivalence.

MDS-Biplot.

Multidimensional Scaling (MDS) transforms the initial dissimilarity matrix into Euclidean distances, preserving as much of the original variability as possible. In this case, we compute the first two dimensions, which capture the highest proportion of the original variability. Like in the dendrogram example, we use the dissimilarity matrix dissMatrx_BP4 computed in Section 2.1 of this vignette:

# multidimensional scaling analysis:
mds <- as.data.frame(cmdscale(dissMatrx_BP4, k = 2))

Below we plot the MDS-Biplot:

library(ggplot2)
library(ggrepel)
graph <- ggplot() +
 geom_point(aes(mds[,1], mds[,2]), color = "blue", size = 3) +
 geom_text_repel(aes(mds[,1], mds[,2], label = attr(dissMatrx_BP4, "Labels")),
                 color = "black", size = 3) +
 xlab("Dim 1") +
 ylab("Dim 2") +
 theme_light()
graph

The biplot suggests that the closest lists (clusters) are Waldman-Vogelstein and Sanger-Miscellaneous; therefore, there is stronger evidence for equivalence among these lists.

Characterizing the axes of the biplot on which these groups are formed is of great interest, as it can help determine the biological functions associated with the formation of these groups. Specifically, it allows us to identify the GO terms involved in the evidence supporting the biological similarity between the compared gene lists. In the following section, we propose a 5-step procedure to characterize the axes of the biplot and identify the GO terms associated with the formation of these groups. This is a very preliminary version of a work in progress.

Characterizing MDS-Biplot Dimensions.

In this vignette, we will apply the 5-step procedure to Dimension 1. However, the technique would be the same if we were to apply it to Dimension 2 or any other dimension.

Step 1. Split the biplot axis:

Split the axis (in this case, Dimension 1) in three parts to identify the lists located at the extremes. For example, we could allocate 20% to each extreme, leaving 60% in the middle, as follows:

# Split the axis 20% to the left, 60% to the middle and 20% to the right:
prop <- c(0.2, 0.6, 0.2) 

# Sort according  dimension 1:
sorted <- mds[order(mds[, 1]), ] 

# Determine the range for dimension 1.
range <- sorted[, 1][c(1, nrow(mds))] 

# Find the cut-points to split the axis:
cutpoints <- (cumsum(prop)[1:2] * diff(range)) + range[1]

# Identify lists to the left:  
lleft <- rownames(sorted[sorted[, 1] < cutpoints[1], ])
lleft
[1] "cis"
# Identify lists to the right
lright <- rownames(sorted[sorted[, 1] > cutpoints[2], ])
lright
[1] "waldman"    "Vogelstein" "atlas"     

Visualizing in the Biplot:

graph +
  geom_vline(xintercept = cutpoints, color = "red", 
             linetype = "dashed", linewidth = 0.75)

The basic idea is that if the enrichment of a GO term differs between these two groups of lists at the extremes, we can deduce that this GO term discriminates the groups, thus explaining their formation.

Step 2. Obtain the matrix of GO terms enrichment:

Obtain the matrix of GO terms enrichment for the lists located at the extremes detected in Step 1. Remember, this matrix is stored as an attribute of the enrichment contingency tables. Therefore:

# enrichment contingency tables:
contTablesBP4 <- attr(dissMatrx_BP4, "all2x2Tables")

# matrix of GO term enrichment for the lists located at the extreme left
tableleft <- attr(contTablesBP4, "enriched")[, lleft, drop = FALSE]
tableleft
             cis
GO:0001649  TRUE
GO:0030278 FALSE
GO:0030279 FALSE
GO:0030282 FALSE
GO:0036075 FALSE
GO:0045778 FALSE
GO:0048755 FALSE
GO:0060688 FALSE
GO:0061138  TRUE
GO:0002263  TRUE
GO:0030168 FALSE
GO:0042118 FALSE
GO:0050866  TRUE
GO:0050867  TRUE
GO:0061900 FALSE
GO:0072537 FALSE
 [ reached getOption("max.print") -- omitted 473 rows ]
# matrix of GO term enrichment for the lists located at the extreme right
tableright <- attr(contTablesBP4, "enriched")[, lright, drop = FALSE]
tableright
           waldman Vogelstein atlas
GO:0001649    TRUE       TRUE  TRUE
GO:0030278    TRUE       TRUE  TRUE
GO:0030279    TRUE       TRUE FALSE
GO:0030282    TRUE       TRUE  TRUE
GO:0036075   FALSE      FALSE  TRUE
GO:0045778    TRUE      FALSE FALSE
GO:0048755   FALSE      FALSE  TRUE
GO:0060688    TRUE       TRUE  TRUE
GO:0061138    TRUE       TRUE  TRUE
GO:0002263    TRUE       TRUE  TRUE
GO:0030168    TRUE      FALSE  TRUE
GO:0042118    TRUE       TRUE FALSE
GO:0050866    TRUE       TRUE  TRUE
GO:0050867    TRUE       TRUE  TRUE
GO:0061900   FALSE      FALSE  TRUE
GO:0072537   FALSE      FALSE  TRUE
 [ reached getOption("max.print") -- omitted 473 rows ]

Actually, the contingency tables saved in the objects contTablesBP4 and cont_all_BP4 (from Section 2.1) are exactly the same.

Step 3. Compute means and variances:

Compute the means and variances for each of the GO terms detected in each extreme group. In other words, calculate the means and variances by rows of the GO term enrichment matrix for the lists at the extremes:

# function to compute mean and standard deviation:
mean_sd <- function(x){
  c("mean" = mean(x), "sd" = ifelse(length(x) == 1, 0, sd(x)))
}

# mean and sd of the lists to the extreme left:
lmnsd <- apply(tableleft, 1, mean_sd)

# mean and sd of the lists to the extreme right:
rmnsd <- apply(tableright, 1, mean_sd)

Step 4. Establish a statistic:

Establish a statistic to assess the degree of enrichment disparity between extreme groups for each GO term. This measure should consider that a GO term discriminates a dimension if:

  1. the lists show opposing patterns of enrichment in the GO term, indicated by a large difference between $| _L\overline{X}_{GO_j} - _R \overline{X}_{GO_j}|$ (L represents Left and R Rigth), and
  2. the behavior of the lists remains stable at each extreme, indicated by a small amount of LSGOj2+RSGOj2.

Considering these specific details, the following statistic, is suggested:

$$st_j = \dfrac{| _L\overline{X}_{GO_j} - _R \overline{X}_{GO_j}|}{\sqrt{\dfrac{_LS^2_{GO_j}}{l} + \dfrac{_RS^2_{GO_j}}{r} + \epsilon }}$$ with l the number of lists to the extreme left and r the number of lists to the extreme right. Using ϵ = 0.00000001 the statistic is computed as follows:

nl <- ncol(tableleft) 
nr <- ncol(tableright) 
t_stat <- abs(lmnsd[1, ] - rmnsd[1, ]) / 
  sqrt((((lmnsd[2, ] / nl) + (rmnsd[2, ] / nr))) + 0.00000001)

Step 5. Select the GO terms with the highest statistic:

GO terms with the highest statistic are those with a value closest to $\frac{1}{\sqrt{\epsilon}}$. These GO terms are the main contributors to the formation of clusters in the analyzed dimension. Therefore, they are strongly related to the equivalence between the compared gene lists. They are detected as follows:

result <- t_stat[t_stat == max(t_stat)]
result
GO:0030278 GO:0030282 GO:0060688 GO:0002367 GO:0002562 GO:0016445 GO:0002443 
     10000      10000      10000      10000      10000      10000      10000 
GO:0034101 GO:0002377 GO:0002700 GO:0002200 GO:0007281 GO:0007548 GO:0045137 
     10000      10000      10000      10000      10000      10000      10000 
GO:0048608 GO:0060512 GO:0060736 GO:0060740 GO:0060742 GO:0001666 GO:0006979 
     10000      10000      10000      10000      10000      10000      10000 
GO:0009408 GO:0045786 GO:0051321 GO:0009755 GO:0008366 GO:0009791 GO:0046660 
     10000      10000      10000      10000      10000      10000      10000 
GO:0046661 GO:0009880 GO:0007611 GO:0010463 GO:0033002 GO:0072111 GO:0009895 
     10000      10000      10000      10000      10000      10000      10000 
GO:0009612 GO:0009266 GO:0009314 GO:0070482 GO:0071214 GO:0003151 GO:0003179 
     10000      10000      10000      10000      10000      10000      10000 
GO:0003206 GO:0031069 GO:0060323 GO:0060411 GO:0060560 GO:0072028 GO:0045913 
     10000      10000      10000      10000      10000      10000      10000 
GO:0001558 
     10000 
 [ reached getOption("max.print") -- omitted 75 entries ]

In addition, we can identify the biological functions associated with the detected GO terms:

library(GO.db)
library(DT)
# Previous function to get the description of the identified GO terms:
get_go_description <- function(go_id) {
  go_term <- Term(GOTERM[[go_id]])
  return(go_term)
}

# GO terms description:
datatable(data.frame(GO_term = names(result),
                     Description = sapply(names(result), get_go_description, 
                                           USE.NAMES = TRUE)), 
          filter = "top", rownames = FALSE)

References

Flores, Pablo, Miquel Salicrú, Alex Sánchez-Pla, and Jordi Ocaña. 2022. “An Equivalence Test Between Features Lists, Based on the Sorensen–Dice Index and the Joint Frequencies of GO Term Enrichment.” BMC Bioinformatics 23 (1): 207.