The goal of goSorensen is to implement the equivalence test introduced in Flores, P., Salicrú, M., Sánchez-Pla, A. and Ocaña, J.(2022) “An equivalence test between features lists, based on the Sorensen - Dice index and the joint frequencies of GO node enrichment”, BMC Bioinformatics, 2022 23:207.
Given two gene lists, L1 and L2, (the data) and a given set of n Gene Ontology (GO) terms (the frame of reference for biological significance in these lists), the test is devoted to answer the following question (quite informally stated for the moment): The dissimilarity between the biological information in both lists, is it negligible? To measure the dissimilarity we use the Sorensen-Dice index:
$$ \hat d_S = \hat d(L_1,L_2) = \frac{2n_{11}}{2n_{11} + n_{10} + n_{01}} $$
where n11 corresponds to the number of GO terms (among the n GO terms under consideration) which are enriched in both gene lists, n10 corresponds to the GO terms enriched in L1 but not in L2 and n01 the reverse, those enriched in L2 but not in L1. For notation completeness, n00 would correspond to those GO terms not enriched in both lists; it is not considered by the Sorensen-Dice index but would be necessary in some computations. Obviously, n = n11 + n10 + n01 + n00.
More precisely, the above problem can be restated as follows: Given a negligibility threshold d0 for the Sorensen-Dice values, to decide negligibility corresponds to rejecting the null hypothesis H0 : dS ≥ d0 in favour of the alternative H1 : dS < d0, where dS stands for the “true” value of the Sorensen-Dice dissimilarity (L1 and L2 are samples, and the own process of declaring enrichment of a GO term is random, so d̂S = d̂(L1, L2) is an estimate of dS). Then, a bit more precise statement of the problem is “The dissimilarity between the biological information in two gene lists, is it negligible up to a degree d0?” Where this information is expressed by means of the Sorensen-Dice dissimilarity measured on the degree of coincidence and non-coincidence in GO terms enrichment among a given set of GO terms.
For the moment, the reference set of GO terms can be only all those GO terms in a given level of one GO ontology, either BP, CC or MF.
goSorensen package must be installed with a working R version
(>=4.3.0). Installation could take a few minutes on a regular desktop
or laptop. Package can be installed from Bioconductor, then it needs to
be loaded using library(goSorensen)
:
The dataset used in this vignette, allOncoGeneLists
, is
based on the gene lists compiled at http://www.bushmanlab.org/links/genelists, a
comprehensive set of gene lists related to cancer. The package
goSorensen
loads this dataset using
data(allOncoGeneLists)
:
allOncoGeneLists
is an object of class list, containing
seven character vectors with the ENTREZ gene identifiers of a gene list
related to cancer.
## [1] 7
## atlas cangenes cis miscellaneous sanger
## 991 189 613 187 450
## Vogelstein waldman
## 419 426
# First 20 gene identifiers of gene lists Vogelstein and sanger:
allOncoGeneLists[["Vogelstein"]][1:20]
## [1] "10006" "25" "27" "23305" "91" "4299" "3899" "27125"
## [9] "207" "238" "139285" "324" "367" "23092" "23365" "8289"
## [17] "57492" "196528" "405" "79058"
## [1] "25" "27" "2181" "57082" "10962" "51517" "27125" "10142"
## [9] "207" "208" "217" "238" "57714" "324" "23365" "399"
## [17] "8289" "405" "79058" "171023"
Before using goSorensen,
the users must have adequate
knowledge of the species they intend to focus their analysis on. The
genomic annotation packages available in Bioconductor provide all the
essential information about many species.
For the specific case of this vignette, given that the analysis will
be done in the human species, the org.Hs.eg.db
package must
be previously installed and activated as follows:
if (!requireNamespace("org.Hs.eg.db", quietly = TRUE)) {
BiocManager::install("org.Hs.eg.db")
}
library(org.Hs.eg.db)
Actually, the org.Hs.eg.db
package is automatically
installed as a dependency on goSorensen
, making its
installation unnecessary. However, for any other species, the user must
install the correspondence genome annotation for the species to analyze,
as indicated in the above code.
In addition, it is necessary to have a vector containing the IDs of the universe of genes associated with the species under study. The genomic annotation package provides an easy way to obtain this universe. The ENTREZ identifiers of the gene universe for humans, necessary for this vignette, is obtained as follows:
In this same way, the identifiers of the gene universe can be obtained for any other species.
Function equivTestSorensen
performs the equivalence
test. One possibility is to build first the joint enrichment contingency
table using the function buildEnrichTable
and then to
perform the equivalence test:
# Build the enrichment contingency table between gene lists Vogelstein and
# sanger for the MF ontology at GO level 5:
enrichTab <- buildEnrichTable(allOncoGeneLists[["Vogelstein"]],
allOncoGeneLists[["sanger"]],
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db",
onto = "MF", GOLevel = 5,
listNames = c("Vogelstein", "sanger"))
enrichTab
## Enriched in sanger
## Enriched in Vogelstein TRUE FALSE
## TRUE 33 10
## FALSE 2 1958
# Equivalence test for an equivalence (or negligibility) limit 0.2857
testResult <- equivTestSorensen(enrichTab, d0 = 0.2857)
testResult
##
## Normal asymptotic test for 2x2 contingency tables based on the
## Sorensen-Dice dissimilarity
##
## data: enrichTab
## (d - d0) / se = -2.9711, p-value = 0.001484
## alternative hypothesis: true equivalence limit d0 is less than 0.2857
## 95 percent confidence interval:
## 0.0000000 0.2268426
## sample estimates:
## Sorensen dissimilarity
## 0.1538462
## attr(,"se")
## standard error
## 0.0443787
Performing the test directly from the gene lists is also possible:
equivTestSorensen(allOncoGeneLists[["Vogelstein"]],
allOncoGeneLists[["sanger"]], d0 = 0.2857,
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db",
onto = "MF", GOLevel = 5,
listNames = c("Vogelstein", "sanger"))
##
## Normal asymptotic test for 2x2 contingency tables based on the
## Sorensen-Dice dissimilarity
##
## data: tab
## (d - d0) / se = -2.9711, p-value = 0.001484
## alternative hypothesis: true equivalence limit d0 is less than 0.2857
## 95 percent confidence interval:
## 0.0000000 0.2268426
## sample estimates:
## Sorensen dissimilarity
## 0.1538462
## attr(,"se")
## standard error
## 0.0443787
The first option (building previously the contingency table) may be suitable to save computing time.
The above tests are based on the normal distribution. The next section shows how to use the bootstrap distribution, which is suitable for low enrichment levels.
The following code computes the results of an equivalence test based on the bootstrap distribution:
##
## Bootstrap test for 2x2 contingency tables based on the Sorensen-Dice
## dissimilarity (10000 bootstrap replicates)
##
## data: enrichTab
## (d - d0) / se = -2.9711, p-value = 0.0146
## alternative hypothesis: true equivalence limit d0 is less than 0.2857
## 95 percent confidence interval:
## 0.0000000 0.2449937
## sample estimates:
## Sorensen dissimilarity
## 0.1538462
## attr(,"se")
## standard error
## 0.0443787
As one can see, to obtain the bootstrap results, we only have used
the argument boot=TRUE
.
For low frequencies in the contingency table, bootstrap is a more conservative but preferable approach, with better type I error control.
The outputs of the equivalence test, such as the Sorensen dissimilarity, the standard error of the test, and the confidence limit, can be computed individually.
## [1] 0.1538462
# The Sorensen dissimilarity from the gene lists:
dSorensen(allOncoGeneLists[["Vogelstein"]],
allOncoGeneLists[["sanger"]],
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db",
onto = "MF", GOLevel = 5,
listNames = c("Vogelstein", "sanger"))
## [1] 0.1538462
## [1] 0.0443787
# or from the gene lists:
seSorensen(allOncoGeneLists[["Vogelstein"]],
allOncoGeneLists[["sanger"]],
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db",
onto = "MF", GOLevel = 5,
listNames = c("Vogelstein", "sanger"))
## [1] 0.0443787
## [1] 0.2268426
## [1] 0.2107197
## [1] 0.2207591
## attr(,"eff.nboot")
## [1] 9999
# Upper limit of the confidence interval from the gene lists:
duppSorensen(allOncoGeneLists[["Vogelstein"]],
allOncoGeneLists[["sanger"]],
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db",
onto = "MF", GOLevel = 5,
listNames = c("Vogelstein", "sanger"))
## [1] 0.2268426
To access specific results from the equivalence test above computed,
one can use the functions of the type get...
, as
follows:
## Sorensen dissimilarity
## 0.1538462
## attr(,"se")
## standard error
## 0.0443787
## standard error
## 0.0443787
## p-value
## 0.001483644
## Enriched in sanger
## Enriched in Vogelstein TRUE FALSE
## TRUE 33 10
## FALSE 2 1958
## dUpper
## 0.2268426
## p-value
## 0.01459854
## dUpper
## 0.2449937
# (Only available for bootstrap tests) efective number of bootstrap resamples:
getNboot(boot.testResult)
## [1] 10000
After performing the equivalence test calculations, the results can be updated without redoing the calculations. This can be done by using different values for the irrelevance limit, level of significance, distribution, and number of resamples (in the case of Bootstrap) than the ones initially provided to the function.
For example, the results saved in the testResult
object
were calculated using the arguments: d0 = 0.2857
,
conf.level = 0.95
and boot = FALSE
(using
normal distribution). Now, we are going to upgrade the results with
other arguments, as follows:
##
## Bootstrap test for 2x2 contingency tables based on the Sorensen-Dice
## dissimilarity (10000 bootstrap replicates)
##
## data: tab
## (d - d0) / se = -6.5471, p-value = 2e-04
## alternative hypothesis: true equivalence limit d0 is less than 0.4444
## 99 percent confidence interval:
## 0.0000000 0.2909462
## sample estimates:
## Sorensen dissimilarity
## 0.1538462
## attr(,"se")
## standard error
## 0.0443787
Since goSorensen
has been built under the S3
object-oriented programming paradigm, it is not only possible to perform
calculations for a couple of gene lists. If instead of entering two
vectors with the lists L1, L2
of genes to be compared as input to the functions, an object of the
class “list” is entered, which contains several vectors with L1, L2, …, Ls
lists, then some specific goSorensen
functions perform the
calculations for all possible pairs of gene lists that are formed.
For instance, the object allOncoGeneLists
is a list
object that consists of seven vectors, each corresponding to a list of
genes. For a given ontology and GO level, we can compute the
dissimilarity between all possible pairs of comparisons among the lists.
In this case, there are a total of 21 pairs. The computation is as
follows:
totalDiss <- dSorensen(allOncoGeneLists, onto = "MF", GOLevel = 5,
geneUniverse = humanEntrezIDs, orgPackg = "org.Hs.eg.db")
round(totalDiss, 2)
## atlas cangenes cis miscellaneous sanger Vogelstein waldman
## atlas 0.00 1 0.76 0.49 0.37 0.38 0.37
## cangenes 1.00 0 1.00 1.00 1.00 1.00 1.00
## cis 0.76 1 0.00 0.69 0.72 0.73 0.76
## miscellaneous 0.49 1 0.69 0.00 0.46 0.52 0.29
## sanger 0.37 1 0.72 0.46 0.00 0.15 0.45
## Vogelstein 0.38 1 0.73 0.52 0.15 0.00 0.45
## waldman 0.37 1 0.76 0.29 0.45 0.45 0.00
Similarly, the following code performs all pairwise tests:
allTests <- equivTestSorensen(allOncoGeneLists, d0 = 0.2857,
onto = "MF", GOLevel = 5,
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db")
getPvalue(allTests)
## cangenes.atlas.p-value cis.atlas.p-value
## NaN 1.000000000
## cis.cangenes.p-value miscellaneous.atlas.p-value
## NaN 0.998231735
## miscellaneous.cangenes.p-value miscellaneous.cis.p-value
## NaN 0.999894013
## sanger.atlas.p-value sanger.cangenes.p-value
## 0.919688219 NaN
## sanger.cis.p-value sanger.miscellaneous.p-value
## 0.999999142 0.984991465
## Vogelstein.atlas.p-value Vogelstein.cangenes.p-value
## 0.949193143 NaN
## Vogelstein.cis.p-value Vogelstein.miscellaneous.p-value
## 0.999999941 0.999072617
## Vogelstein.sanger.p-value waldman.atlas.p-value
## 0.001483644 0.921125359
## waldman.cangenes.p-value waldman.cis.p-value
## NaN 0.999999994
## waldman.miscellaneous.p-value waldman.sanger.p-value
## 0.540542350 0.990217132
## waldman.Vogelstein.p-value
## 0.994228782
## atlas cangenes cis miscellaneous sanger Vogelstein
## atlas 0.0000000 1 0.7627119 0.4933333 0.3720930 0.3829787
## cangenes 1.0000000 0 1.0000000 1.0000000 1.0000000 1.0000000
## cis 0.7627119 1 0.0000000 0.6875000 0.7209302 0.7254902
## miscellaneous 0.4933333 1 0.6875000 0.0000000 0.4576271 0.5223881
## sanger 0.3720930 1 0.7209302 0.4576271 0.0000000 0.1538462
## Vogelstein 0.3829787 1 0.7254902 0.5223881 0.1538462 0.0000000
## waldman 0.3695652 1 0.7551020 0.2923077 0.4473684 0.4523810
## waldman
## atlas 0.3695652
## cangenes 1.0000000
## cis 0.7551020
## miscellaneous 0.2923077
## sanger 0.4473684
## Vogelstein 0.4523810
## waldman 0.0000000
All software and respective versions used to produce this document are listed below.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] org.Hs.eg.db_3.20.0 AnnotationDbi_1.69.0 IRanges_2.41.2
## [4] S4Vectors_0.45.2 Biobase_2.67.0 BiocGenerics_0.53.3
## [7] generics_0.1.3 goSorensen_1.9.0 BiocStyle_2.35.0
## [10] rmarkdown_2.29
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 gson_0.1.0 rlang_1.1.4
## [4] magrittr_2.0.3 DOSE_4.1.0 compiler_4.4.2
## [7] RSQLite_2.3.9 png_0.1-8 vctrs_0.6.5
## [10] reshape2_1.4.4 stringr_1.5.1 pkgconfig_2.0.3
## [13] crayon_1.5.3 fastmap_1.2.0 XVector_0.47.2
## [16] enrichplot_1.27.4 UCSC.utils_1.3.1 purrr_1.0.2
## [19] bit_4.5.0.1 xfun_0.50 cachem_1.1.0
## [22] aplot_0.2.4 GenomeInfoDb_1.43.2 jsonlite_1.8.9
## [25] blob_1.2.4 BiocParallel_1.41.0 parallel_4.4.2
## [28] R6_2.5.1 bslib_0.8.0 stringi_1.8.4
## [31] RColorBrewer_1.1-3 jquerylib_0.1.4 GOSemSim_2.33.0
## [34] Rcpp_1.0.14 knitr_1.49 goProfiles_1.69.0
## [37] ggtangle_0.0.6 R.utils_2.12.3 Matrix_1.7-1
## [40] splines_4.4.2 igraph_2.1.3 tidyselect_1.2.1
## [43] qvalue_2.39.0 yaml_2.3.10 codetools_0.2-20
## [46] lattice_0.22-6 tibble_3.2.1 plyr_1.8.9
## [49] treeio_1.31.0 KEGGREST_1.47.0 evaluate_1.0.3
## [52] gridGraphics_0.5-1 CompQuadForm_1.4.3 Biostrings_2.75.3
## [55] pillar_1.10.1 BiocManager_1.30.25 ggtree_3.15.0
## [58] clusterProfiler_4.15.1 ggfun_0.1.8 ggplot2_3.5.1
## [61] munsell_0.5.1 scales_1.3.0 tidytree_0.4.6
## [64] glue_1.8.0 lazyeval_0.2.2 maketools_1.3.1
## [67] tools_4.4.2 sys_3.4.3 data.table_1.16.4
## [70] fgsea_1.33.2 buildtools_1.0.0 fs_1.6.5
## [73] fastmatch_1.1-6 cowplot_1.1.3 grid_4.4.2
## [76] tidyr_1.3.1 ape_5.8-1 colorspace_2.1-1
## [79] nlme_3.1-166 GenomeInfoDbData_1.2.13 patchwork_1.3.0
## [82] cli_3.6.3 dplyr_1.1.4 gtable_0.3.6
## [85] R.methodsS3_1.8.2 yulab.utils_0.1.9 sass_0.4.9
## [88] digest_0.6.37 ggrepel_0.9.6 ggplotify_0.1.2
## [91] farver_2.1.2 memoise_2.0.1 htmltools_0.5.8.1
## [94] R.oo_1.27.0 lifecycle_1.0.4 httr_1.4.7
## [97] GO.db_3.20.0 bit64_4.6.0-1
Flores, P., Salicrú, M., Sánchez-Pla, A. et al. An equivalence test between features lists, based on the Sorensen–Dice index and the joint frequencies of GO term enrichment. BMC Bioinformatics 23, 207 (2022). https://doi.org/10.1186/s12859-022-04739-2