The majority of functions of goSorensen are devoted to implementing a statistical hypothesis test to detect equivalence between two or more gene lists. This method was introduced in Flores, P., Salicrú, M., Sánchez-Pla, A. and Ocaña, J.(2022) “An equivalence test between features lists, based on the Sorensen - Dice index and the joint frequencies of GO node enrichment”, BMC Bioinformatics, 2022 23:207.
Following this method, we develop the irrelevance-threshold matrix of dissimilarities 𝔇:
The goal is to determine a dissimilarity measure 𝔡ij for comparing two lists (Li, Lj). This dissimilarity is not only a descriptive measure, but it is based on the irrelevance threshold determining whether two lists (Li, Lj) are equivalent, which ensures that the dissimilarity 𝔡ij is directly associated with the statistically significant equivalence declaration for a specific ontology (BP, MF, or CC) and GO level.
The dissimilarity measures between all the compared s feature lists can be collectively obtained in the symmetric matrix 𝔇 using the following algorithm:
Step 1. Establish h = s(s − 1) equivalence hypothesis tests. Each test ETI (with I = 1, …, h) compares some specific pair of lists Li, Lj (with i, j = 1, 2, …, s).
Step 2. Let 𝔡h be the smallest irrelevance limit that makes all the null hypotheses corresponding to the h equivalence tests be rejected (All the s lists are equivalent): $$\mathfrak{d}_h = \text{min} \left(\mathfrak{d}: P_v(\mathfrak{d})_{(I)} \leq \frac{\alpha}{h+1-I}; \hspace{0.5cm} I=1, 2, \ldots,h \right).$$
Step 3. Obtain 𝔡h and use it as the irrelevance limit between the lists Li, Lj corresponding to the last position in a vector of ordered p-values Pv(𝔡h)(I), i.e. 𝔡ij = 𝔡h.
Step 4. Set h = h − 1, excluding the comparison between the lists Li, Lj from Step 3, and go back to Step 2 until h = 0.
We use a specific database to provide a clear and precise explanation
of the computation, visualization, and interpretation of the 𝔇 matrix. Specifically, we will employ the
set of PBTs
lists, which is available as a resource in the
goSorensen package. Access to this database can be
obtained as follows:
PBTs
is an object of class list, containing 14 character
vectors with the ENTREZ gene identifiers of a gene list related to cell
graft rejection:
ABMR_RATs BAT CT1 ENDAT GRIT2 GRIT3 IRITD1 IRITD3
108 105 948 118 131 205 181 321
IRITD5 KT1 LT1 LT3 Rej_RATs TCMR_RATs
215 562 913 693 107 121
One method to compute the dissimilarity matrix 𝔇 for a given ontology and level (e.g.,
ontology BP, level 5) is to directly input the gene lists into the
sorenThreshold
function:
# Previously load the genomic annotation package for the studied specie:
library(org.Hs.eg.db)
humanEntrezIDs <- keys(org.Hs.eg.db, keytype = "ENTREZID")
# Computing the irrelevance-threshold matrix of dissimilarities
dismatBP5 <- sorenThreshold(pbtGeneLists, onto = "BP", GOLevel = 5,
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db")
Wed Dec 18 04:45:29 2024 Building all enrichment contingency tables...
Wed Dec 18 04:47:33 2024 Computing all threshold equivalence distances...
ABMR_RATs BAT CT1 ENDAT GRIT2 GRIT3 IRITD3 IRITD5 KT1 LT1 LT3
BAT 0.98
CT1 1.00 1.00
ENDAT 0.90 1.00 1.00
GRIT2 0.69 0.98 1.00 0.98
GRIT3 0.45 0.98 1.00 0.96 0.37
IRITD3 0.91 1.00 1.00 0.76 0.94 0.94
IRITD5 0.87 1.00 1.00 0.88 0.89 0.88 0.91
KT1 1.00 1.00 0.81 0.95 1.00 1.00 1.00 1.00
LT1 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
LT3 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.30
Rej_RATs 0.34 0.99 1.00 0.95 0.56 0.40 0.93 0.87 1.00 1.00 1.00
TCMR_RATs 0.50 0.96 1.00 0.97 0.62 0.44 0.91 0.86 1.00 1.00 1.00
Rej_RATs
BAT
CT1
ENDAT
GRIT2
GRIT3
IRITD3
IRITD5
KT1
LT1
LT3
Rej_RATs
TCMR_RATs 0.42
Each value in this matrix represents dissimilarities determined by the equivalence threshold which determines whether the compared lists are equivalent. In addition, it is essential to consider that when the Sorensen dissimilarity value is irrelevantly equal to zero (equivalent to zero), it suggests biological similarity. Based on this, we can deduce that the smallest values (near zero) in the matrix 𝔇 indicate equivalence between lists, whereas the largest values (close to 1) indicate that the lists cannot be considered equivalent.
Another way is to compute all the pairwise 2 × 2 enrichment contingency tables previously. From this table, we can obtain the dissimilarity matrix as follows:
# Previously compute the 2x2 contingency tables:
ctableBP5 <- buildEnrichTable(pbtGeneLists, onto = "BP", GOLevel = 5,
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db")
# Computing the irrelevance-threshold matrix of dissimilarities
dismatBP5 <- sorenThreshold(ctableBP5)
The result is exactly the same matrix od dissimilarities above showed.
Similarly, we can directly perform the computation using the gene lists dataset:
allDismat <- allSorenThreshold (pbtGeneLists, geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db", ontos = c("BP", "CC"),
GOLevels = 4:7)
Or, previously computing all the pairwise contingency tables for the ontologies (BP, CC, MF) and GO levels defined by the user:
allTabs <- allBuildEnrichTable(pbtGeneLists, geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db", ontos = c("BP", "CC"),
GOLevels = 4:7)
allDismat <- allSorenThreshold(allTabs)
Given the substantial quantity of outcomes obtained, exhibiting all
of these results is not feasible. The result is a collection of
dissimilarity matrices for the different ontologies and GO levels
defined in the ontos
and GOLevels
arguments.
Suppose the ontos
and GOLevels
arguments
are not provided. In that case, the allSorenThreshold
function automatically computes the dissimilarity matrices for all three
ontologies (BP, CC and MF) and GO levels ranging from 3 to 10.
Below, we graph a dendrogram based on the irrelevance-threshold
matrix of dissimilarities for the particular case of the MF ontology, GO
level 5, in our PBTs dataset. The dissimilarity matrix for this
particular case was computed above and it is saved in the object
dismatBP5
.
The dendrogram displays the lists of genes that are most similar and most distant. For instance, when considering a dissimilarity of 0.6, chosen arbitrarily, we can observe the formation of three groups. Within each group, the lists can be regarded as similar but different from those in other groups. There are three groups of lists that can be considered equivalent. The first group includes LT3 and LT1. The second group contains GRIT2 and GRIT3. The third group comprises the lists TCMR_RATs, ABMR_RATs, and Rej_RATs.
We initiate the process by performing a multidimensional scaling MDS analysis. MDS transforms the initial dissimilarity matrix into Euclidean distances while maintaining the maximum possible variation of the original dissimilarities. In this instance, we compute the first two dimensions, which account for the most relevant proportion of the original variability.
V1 V2
ABMR_RATs 0.37749638 0.05316328
BAT -0.18271776 -0.09899928
CT1 -0.28485420 -0.26753828
ENDAT -0.17334937 -0.31109043
GRIT2 0.34624136 0.08875324
GRIT3 0.43263735 0.10879125
IRITD3 -0.12025074 -0.26261937
IRITD5 -0.01466796 -0.15346417
KT1 -0.29190006 -0.29352922
LT1 -0.44970672 0.48333325
LT3 -0.44970672 0.48333325
Rej_RATs 0.42374314 0.09174890
TCMR_RATs 0.38703531 0.07811760
Below we plot the biplot:
library(ggplot2)
library(ggrepel)
graph <- ggplot() +
geom_point(aes(mds[,1], mds[,2]), color = "blue", size = 3) +
geom_text_repel(aes(mds[,1], mds[,2], label = attr(dismatBP5, "Labels")),
color = "black", size = 3) +
xlab("Dim 1") +
ylab("Dim 2") +
theme_light()
graph
The MDS-Biplot suggests that the formation of groups is consistent with what was observed in the dendrogram. It is important to remember that the distances observed between lists in the biplot are defined based on the dissimilarity threshold that establishes the equivalence of the compared lists. For instance, the proximity between the Rej_RATs and GRIT3 lists implies that both are equivalent.
However, it is more interesting to identify the GO terms most strongly associated with the dimensions generated in the biplot rather than just determining if the lists are equivalent. These GO terms elucidate the biological functions associated with the observed distances between lists, explaining the formation of groups and the biological reasons for detecting equivalences between lists. In the next section, we develop a 5-step procedure for identifying these GO terms:
The following 5-step procedure helps identify the GO terms that biologically elucidate the creation of groups in each dimension of the Biplot. In this specific instance, we will execute the technique on Dimension 1. However, the approach remains identical if we were to explain Dimension 2.
# Split the axis 20% to the left, 60% to the middle and 20% to the right:
prop <- c(0.2, 0.6, 0.2)
# Sort according dimension 1:
sorted <- mds[order(mds[, 1]), ]
# Determine the range for dimension 1.
range <- sorted[, 1][c(1, nrow(mds))]
# Find the cutpoints to split the axis:
cutpoints <- cumsum(prop) * diff(range) + range[1]
cutpoints <- cutpoints[-length(cutpoints)]
# Identify lists to the left:
lleft <- rownames(sorted[sorted[, 1] < cutpoints[1], ])
# Identify lists to the right
lright <- rownames(sorted[sorted[, 1] > cutpoints[2], ])
lleft
[1] "LT1" "LT3" "KT1" "CT1"
[1] "GRIT2" "ABMR_RATs" "TCMR_RATs" "Rej_RATs" "GRIT3"
The basic idea is that if the enrichment of a GO term in these two groups of lists is different, then we can assume that this GO term discriminates the groups of lists, consequently explaining the formation of groups in that particular dimension.
tableleft <- enrichedIn(pbtGeneLists[lleft], onto = "BP", GOLevel = 5,
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db")
tableright <- enrichedIn(pbtGeneLists[lright], onto = "BP", GOLevel = 5,
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db")
Due to the large size of the results, we only show the first 15 rows of each matrix:
LT1 LT3 KT1 CT1
GO:0000055 FALSE FALSE FALSE FALSE
GO:0000056 FALSE FALSE FALSE FALSE
GO:2000200 FALSE FALSE FALSE FALSE
GO:2000201 FALSE FALSE FALSE FALSE
GO:2000202 FALSE FALSE FALSE FALSE
GO:0051229 FALSE FALSE FALSE FALSE
GO:0051232 FALSE FALSE FALSE FALSE
GO:0090306 FALSE FALSE FALSE FALSE
GO:1990395 FALSE FALSE FALSE FALSE
GO:0001578 TRUE TRUE FALSE FALSE
GO:0007020 FALSE FALSE FALSE FALSE
GO:0007051 FALSE FALSE FALSE FALSE
GO:0010245 FALSE FALSE FALSE FALSE
GO:0021815 FALSE FALSE FALSE FALSE
GO:0030951 FALSE FALSE FALSE FALSE
GRIT2 ABMR_RATs TCMR_RATs Rej_RATs GRIT3
GO:0000055 FALSE FALSE FALSE FALSE FALSE
GO:0000056 FALSE FALSE FALSE FALSE FALSE
GO:2000200 FALSE FALSE FALSE FALSE FALSE
GO:2000201 FALSE FALSE FALSE FALSE FALSE
GO:2000202 FALSE FALSE FALSE FALSE FALSE
GO:0051229 FALSE FALSE FALSE FALSE FALSE
GO:0051232 FALSE FALSE FALSE FALSE FALSE
GO:0090306 FALSE FALSE FALSE FALSE FALSE
GO:1990395 FALSE FALSE FALSE FALSE FALSE
GO:0001578 FALSE FALSE FALSE FALSE FALSE
GO:0007020 FALSE FALSE FALSE FALSE FALSE
GO:0007051 FALSE FALSE FALSE FALSE FALSE
GO:0010245 FALSE FALSE FALSE FALSE FALSE
GO:0021815 FALSE FALSE FALSE FALSE FALSE
GO:0030951 FALSE FALSE FALSE FALSE FALSE
lmnsd <- apply(tableleft, 1,
function(x){c("meanLeft" = mean(x), "sdLeft" = sd(x))})
rmnsd <- apply(tableright, 1,
function(x){c("meanRight" = mean(x), "sdRight" = sd(x))})
Considering these specific details, the following statistic, is suggested:
$$st_j = \dfrac{| _L\overline{X}_{GO_j} - _R \overline{X}_{GO_j}|}{\sqrt{\dfrac{_LS^2_{GO_j}}{l} + \dfrac{_RS^2_{GO_j}}{r} + \epsilon }}$$ with l the number of lists to the extreme left and r the number of lists to the extreme right. Using ϵ = 0.00000001 the statistic is computed as follows:
nl <- ncol(tableleft)
nr <- ncol(tableright)
stat <- abs(lmnsd[1, ] - rmnsd[1, ]) /
sqrt((((lmnsd[2, ] / nl) + (rmnsd[2, ] / nr))) + 0.00000001)
GO:0002718 GO:0032649 GO:0032729 GO:0001910 GO:0042267 GO:0002460 GO:0002819
10000 10000 10000 10000 10000 10000 10000
GO:0002821 GO:0002366 GO:0002285 GO:0001909 GO:0002449 GO:0002703 GO:0002704
10000 10000 10000 10000 10000 10000 10000
GO:0002705 GO:0002685 GO:0002697 GO:1903706 GO:0002698 GO:0050777 GO:0002696
10000 10000 10000 10000 10000 10000 10000
GO:0002699 GO:0051250 GO:0051251 GO:0002700 GO:0002702 GO:0002768 GO:0002832
10000 10000 10000 10000 10000 10000 10000
GO:0002833 GO:0045088 GO:0045824 GO:0045089 GO:0060326 GO:0031348 GO:0070665
10000 10000 10000 10000 10000 10000 10000
GO:0051607 GO:0002237 GO:0050900 GO:0002478 GO:0045785 GO:0032102 GO:0050727
10000 10000 10000 10000 10000 10000 10000
GO:0032496 GO:0070663 GO:0002228 GO:0002274 GO:0022409 GO:0030098 GO:0046651
10000 10000 10000 10000 10000 10000 10000
GO:0050867 GO:0031341 GO:0032943 GO:0071219 GO:0007159 GO:1902105
10000 10000 10000 10000 10000 10000
Finally, we can identify the specific biological functions associated with the detected GO terms.
library(GO.db)
library(knitr)
# Previous function to get the description of the identified GO terms:
get_go_description <- function(go_id) {
go_term <- Term(GOTERM[[go_id]])
return(go_term)
}
# GO terms description:
kable(data.frame(Description = sapply(names(result), get_go_description,
USE.NAMES = TRUE)))
Description | |
---|---|
GO:0002718 | regulation of cytokine production involved in immune response |
GO:0032649 | regulation of type II interferon production |
GO:0032729 | positive regulation of type II interferon production |
GO:0001910 | regulation of leukocyte mediated cytotoxicity |
GO:0042267 | natural killer cell mediated cytotoxicity |
GO:0002460 | adaptive immune response based on somatic recombination of immune receptors built from immunoglobulin superfamily domains |
GO:0002819 | regulation of adaptive immune response |
GO:0002821 | positive regulation of adaptive immune response |
GO:0002366 | leukocyte activation involved in immune response |
GO:0002285 | lymphocyte activation involved in immune response |
GO:0001909 | leukocyte mediated cytotoxicity |
GO:0002449 | lymphocyte mediated immunity |
GO:0002703 | regulation of leukocyte mediated immunity |
GO:0002704 | negative regulation of leukocyte mediated immunity |
GO:0002705 | positive regulation of leukocyte mediated immunity |
GO:0002685 | regulation of leukocyte migration |
GO:0002697 | regulation of immune effector process |
GO:1903706 | regulation of hemopoiesis |
GO:0002698 | negative regulation of immune effector process |
GO:0050777 | negative regulation of immune response |
GO:0002696 | positive regulation of leukocyte activation |
GO:0002699 | positive regulation of immune effector process |
GO:0051250 | negative regulation of lymphocyte activation |
GO:0051251 | positive regulation of lymphocyte activation |
GO:0002700 | regulation of production of molecular mediator of immune response |
GO:0002702 | positive regulation of production of molecular mediator of immune response |
GO:0002768 | immune response-regulating cell surface receptor signaling pathway |
GO:0002832 | negative regulation of response to biotic stimulus |
GO:0002833 | positive regulation of response to biotic stimulus |
GO:0045088 | regulation of innate immune response |
GO:0045824 | negative regulation of innate immune response |
GO:0045089 | positive regulation of innate immune response |
GO:0060326 | cell chemotaxis |
GO:0031348 | negative regulation of defense response |
GO:0070665 | positive regulation of leukocyte proliferation |
GO:0051607 | defense response to virus |
GO:0002237 | response to molecule of bacterial origin |
GO:0050900 | leukocyte migration |
GO:0002478 | antigen processing and presentation of exogenous peptide antigen |
GO:0045785 | positive regulation of cell adhesion |
GO:0032102 | negative regulation of response to external stimulus |
GO:0050727 | regulation of inflammatory response |
GO:0032496 | response to lipopolysaccharide |
GO:0070663 | regulation of leukocyte proliferation |
GO:0002228 | natural killer cell mediated immunity |
GO:0002274 | myeloid leukocyte activation |
GO:0022409 | positive regulation of cell-cell adhesion |
GO:0030098 | lymphocyte differentiation |
GO:0046651 | lymphocyte proliferation |
GO:0050867 | positive regulation of cell activation |
GO:0031341 | regulation of cell killing |
GO:0032943 | mononuclear cell proliferation |
GO:0071219 | cellular response to molecule of bacterial origin |
GO:0007159 | leukocyte cell-cell adhesion |
GO:1902105 | regulation of leukocyte differentiation |
All software and respective versions used to produce this document are listed below.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] knitr_1.49 GO.db_3.20.0 ggrepel_0.9.6
## [4] ggplot2_3.5.1 org.Hs.eg.db_3.20.0 AnnotationDbi_1.69.0
## [7] IRanges_2.41.2 S4Vectors_0.45.2 Biobase_2.67.0
## [10] BiocGenerics_0.53.3 generics_0.1.3 goSorensen_1.9.0
## [13] BiocStyle_2.35.0 rmarkdown_2.29
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 gson_0.1.0 rlang_1.1.4
## [4] magrittr_2.0.3 DOSE_4.1.0 compiler_4.4.2
## [7] RSQLite_2.3.9 png_0.1-8 vctrs_0.6.5
## [10] reshape2_1.4.4 stringr_1.5.1 pkgconfig_2.0.3
## [13] crayon_1.5.3 fastmap_1.2.0 XVector_0.47.0
## [16] labeling_0.4.3 enrichplot_1.27.3 UCSC.utils_1.3.0
## [19] purrr_1.0.2 bit_4.5.0.1 xfun_0.49
## [22] zlibbioc_1.52.0 cachem_1.1.0 aplot_0.2.4
## [25] GenomeInfoDb_1.43.2 jsonlite_1.8.9 blob_1.2.4
## [28] BiocParallel_1.41.0 parallel_4.4.2 R6_2.5.1
## [31] bslib_0.8.0 stringi_1.8.4 RColorBrewer_1.1-3
## [34] jquerylib_0.1.4 GOSemSim_2.33.0 Rcpp_1.0.13-1
## [37] goProfiles_1.69.0 ggtangle_0.0.5 R.utils_2.12.3
## [40] Matrix_1.7-1 splines_4.4.2 igraph_2.1.2
## [43] tidyselect_1.2.1 qvalue_2.39.0 yaml_2.3.10
## [46] codetools_0.2-20 lattice_0.22-6 tibble_3.2.1
## [49] plyr_1.8.9 withr_3.0.2 treeio_1.31.0
## [52] KEGGREST_1.47.0 evaluate_1.0.1 gridGraphics_0.5-1
## [55] CompQuadForm_1.4.3 Biostrings_2.75.3 pillar_1.10.0
## [58] BiocManager_1.30.25 ggtree_3.15.0 clusterProfiler_4.15.1
## [61] ggfun_0.1.8 tidytree_0.4.6 munsell_0.5.1
## [64] scales_1.3.0 glue_1.8.0 lazyeval_0.2.2
## [67] maketools_1.3.1 tools_4.4.2 sys_3.4.3
## [70] data.table_1.16.4 fgsea_1.33.0 buildtools_1.0.0
## [73] fs_1.6.5 fastmatch_1.1-4 cowplot_1.1.3
## [76] grid_4.4.2 tidyr_1.3.1 ape_5.8-1
## [79] colorspace_2.1-1 nlme_3.1-166 GenomeInfoDbData_1.2.13
## [82] patchwork_1.3.0 cli_3.6.3 dplyr_1.1.4
## [85] gtable_0.3.6 R.methodsS3_1.8.2 yulab.utils_0.1.8
## [88] sass_0.4.9 digest_0.6.37 ggplotify_0.1.2
## [91] farver_2.1.2 memoise_2.0.1 htmltools_0.5.8.1
## [94] R.oo_1.27.0 lifecycle_1.0.4 httr_1.4.7
## [97] bit64_4.5.2