Given two gene lists, \(L_1\) and
\(L_2\), (the data) and a given set of
\(n\) Gene Ontology (GO) terms (the
frame of reference for biological significance in these lists),
goSorensen makes the required computations to answer the
following question: Is the dissimilarity between the biological
information in both lists negligible? In other words, are both lists
functionally equivalent?
We employ the following metric derived from the Sorensen-Dice index to quantify this dissimilarity:
\[\begin{equation*} \hat{d_S} = 1 - \dfrac{2n_{11}}{2n_{11} + n_{10} + n_{01}} \\ \hat{d_S} = 1 - \dfrac{\frac{2n_{11}}{n}}{\frac{2n_{11}}{n} + \frac{n_{10}}{n} + \frac{n_{01}}{n}} \\ \hat{d_S} = 1 - \dfrac{2\widehat{p}_{11}}{2\widehat{p}_{11} + \widehat{p}_{10} + \widehat{p}_{01}} \end{equation*}\]
where:
The enrichment frequency can be represented in a \(2 \times 2\) contingency table, as follows:
| enriched in \(L_2\) | non-enriched in \(L_2\) | ||
|---|---|---|---|
| enriched in \(L_1\) | \(n_{11}\) | \(n_{10}\) | \(n_{1.}\) |
| non-enriched in \(L_1\) | \(n_{01}\) | \(n_{00}\) | \(n_{0.}\) |
| \(n_{.1}\) | \(n_{.0}\) | \(n\) |
In Flores et al. (2022) it is shown that \(d_S\) asymptotically follows a normal distribution. In cases of low joint enrichment, a sampling distribution derived from the bootstrap approach demonstrates a better fit and provides more suitable results.
Consider the following equivalence hypothesis test:
\[\begin{equation*} H_0:d_S \ge d_0 \\ H_1: d_S < d_0 \end{equation*}\]
where \(d_S\) represents the “true” Sorensen dissimilarity. If this theoretical measure is equivalent to zero, it implies that the compared lists \(L_1\) and \(L_2\) share an important proportion of enriched GO terms, which can be interpreted as biological similarity. Equivalence is understood as an equality, except for negligible deviations, which is defined by the irrelevance limit \(d_0\)
\(d_0\) is a value that should be fixed in advance, greater than 0 and less than 1. In Flores et al. (2022) it is shown that a not-so-arbitrary irrelevance limit is \(d_0 = 0.4444\), or more restrictive \(d_0=0.2857\)
For the moment, the reference set of GO terms can be only all those GO terms in a given level of one GO ontology, either Biological Process (BP), Cellular Component (CC) or Molecular Function (MF).
For more details, see the reference paper.
goSorensen package must be installed with a working R
version (>=4.4.0). Installation could take a few minutes on a regular
desktop or laptop. Package can be installed from Bioconductor, then it
needs to be loaded using library(goSorensen):
The dataset used in this vignette, allOncoGeneLists, is
based on the gene lists compiled at http://www.bushmanlab.org/links/genelists, a
comprehensive set of gene lists related to cancer. The package
goSorensen loads this dataset using
data(allOncoGeneLists):
allOncoGeneLists is an object of class list, containing
seven character vectors with the ENTREZ gene identifiers of a gene list
related to cancer.
atlas cangenes cis miscellaneous sanger
991 189 613 187 450
Vogelstein waldman
419 426
[1] "11186" "27086" "239" "7764" "5934" "246" "57124" "79663" "23136"
[10] "6281" "1307" "1978" "3732" "3636" "7159"
[1] "25" "27" "2181" "57082" "10962" "51517" "27125" "10142" "207"
[10] "208" "217" "238" "57714" "324" "23365"
Before using goSorensen, the users must have adequate
knowledge of the species they intend to focus their analysis on. The
genomic annotation packages available in Bioconductor provide all the
essential information about many species.
For the specific case of this vignette and the help pages of the
package, given that the analysis will be done in the human species, the
org.Hs.eg.db package must be previously installed and
activated as follows:
Actually, the org.Hs.eg.db package is automatically
installed as a dependency on goSorensen, making its
installation unnecessary. However, for any other species, the user must
install the corresponding genome annotation for the species to analyse,
as indicated in the above code.
In addition, it is necessary to have a vector containing the IDs of the gene universe associated with the species under study. The genomic annotation package provides an easy way to obtain this universe. In this vignette, human gene universe IDs are retrieved using ENTREZ identifiers as follows:
Although ENTREZID is used throughout this vignette for illustration
purposes and to match the example dataset, the updated version of
goSorensen is not restricted to ENTREZ identifiers. Other
supported key types available in the corresponding annotation package
may also be used, provided that the same identifier type is used
consistently in the gene lists, the gene universe, and the
keyType argument of the functions.
In this same way, the identifiers of the gene universe can be obtained for any other species.
Other species available in Bioconductor may include:
org.Hs.eg.db: Genome wide annotation for Humans.org.At.tair.db: Genome wide annotation for
Arabidopsisorg.Ag.eg.db: Genome wide annotation for Anophelesorg.Bt.eg.db: Genome wide annotation for Bovineorg.Ce.eg.db: Genome wide annotation for Wormorg.Cf.eg.db: Genome wide annotation for Canineorg.Dm.eg.db: Genome wide annotation for Flyorg.EcSakai.eg.db: Genome wide annotation for E coli
strain Sakaiorg.EcK12.eg.db: Genome wide annotation for E coli
strain K12org.Dr.eg.db: Genome wide annotation for Zebrafishorg.Gg.eg.db: Genome wide annotation for Chickenorg.Mm.eg.db: Genome wide annotation for Mouseorg.Mmu.eg.db: Genome wide annotation for RhesusDue to the extensive research conducted on the human species and the
examples documented in goSorensen for this species, the
installation of the goSorensen package automatically
includes the annotation package org.Hs.eg.db as a
dependency.
If you are working with other species, you must install the appropriate package to use the genomic annotation for those species.”
The first step1 is to determine whether the GO terms of a
specific ontology are enriched or not enriched across the different
lists to be compared. This can be done either at a specific GO level or
without GO level restriction. The enrichedIn function
assigns TRUE when a GO term is enriched in a gene list and
FALSE when it is not.
For example, for the list atlas, which is part of
allOncoGeneLists, the enrichment of GO terms in the BP
ontology at GO level 4 may be obtained as follows:
enrichedAtlas <- enrichedIn(allOncoGeneLists[["atlas"]],
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db",
onto = "BP", GOLevel = 4
)
enrichedAtlasGO:0001649 GO:0030278 GO:0030282 GO:0045778 GO:0060688 GO:0061138 GO:0002263
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
GO:0030168 GO:0050866 GO:0050867 GO:0072537 GO:0001818 GO:0001819 GO:0002367
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
GO:0002534 GO:0010573 GO:0032602 GO:0032612 GO:0032613 GO:0032615 GO:0032623
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
GO:0032635 GO:0071604 GO:0071706 GO:0002443 GO:0002697 GO:0002698 GO:0002699
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
GO:0043299 GO:0002218 GO:0034101 GO:0002377 GO:0002700 GO:0002702 GO:0048534
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
GO:0002685 GO:1903706 GO:0002695 GO:0050777 GO:0050858 GO:1903707 GO:0002687
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
GO:0002696 GO:1903708 GO:0007548 GO:0045137 GO:0048608 GO:0003012 GO:0022600
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
GO:0001666
TRUE
[ reached 'max' / getOption("max.print") -- omitted 236 entries ]
attr(,"nTerms")
[1] 3323
The result is a vector containing only the GO terms enriched
(TRUE) in the atlas list in the BP ontology at
GO level 4.
The attribute nTerms indicates the total number of GO
terms, both enriched (TRUE) and non-enriched
(FALSE), by the list atlas in the BP ontology
at GO level 4. To obtain this vector, the logical argument
onlyEnriched (which is TRUE by default) must
be set to FALSE, as follows:
fullEnrichedAtlas <- enrichedIn(allOncoGeneLists[["atlas"]],
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db",
onto = "BP", GOLevel = 4,
onlyEnriched = FALSE
)
fullEnrichedAtlasGO:0036349 GO:0036350 GO:0060256 GO:0060257 GO:1900735 GO:0010590 GO:0030994
FALSE FALSE FALSE FALSE FALSE FALSE FALSE
GO:0030995 GO:1905391 GO:2001042 GO:2001043 GO:0001649 GO:0030278 GO:0030279
FALSE FALSE FALSE FALSE TRUE TRUE FALSE
GO:0030282 GO:0036072 GO:0036075 GO:0036076 GO:0036077 GO:0043931 GO:0043932
TRUE FALSE FALSE FALSE FALSE FALSE FALSE
GO:0045778 GO:0010223 GO:0048755 GO:0060688 GO:0061137 GO:0061138 GO:0080181
TRUE FALSE FALSE TRUE FALSE TRUE FALSE
GO:0002263 GO:0002266 GO:0007343 GO:0007407 GO:0014719 GO:0030168 GO:0032980
TRUE FALSE FALSE FALSE FALSE TRUE FALSE
GO:0042118 GO:0044566 GO:0045321 GO:0050865 GO:0050866 GO:0050867 GO:0061900
FALSE FALSE FALSE FALSE TRUE TRUE FALSE
GO:0071892 GO:0072537 GO:0001780 GO:0002260 GO:0033023 GO:0035702 GO:0036145
FALSE TRUE FALSE FALSE FALSE FALSE FALSE
GO:0061519
FALSE
[ reached 'max' / getOption("max.print") -- omitted 3273 entries ]
attr(,"nTerms")
[1] 3323
The full vector (fullEnrichedAtlas) is much larger than
the vector containing only the enriched GO terms
(enrichedAtlas), which implies a higher memory usage.
Alternatively, the enrichment can be computed without restricting the
analysis to a specific GO level by setting GOLevel = NULL.
For example, for the BP ontology:
enrichedAtlasBP <- enrichedIn(allOncoGeneLists[["atlas"]],
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db",
onto = "BP", GOLevel = NULL
)
enrichedAtlasBPGO:0043065 GO:0097193 GO:2001233 GO:0071214 GO:0104004 GO:0009314 GO:0070482
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
GO:0022407 GO:0036293 GO:0001666 GO:0007346 GO:0071478 GO:0042325 GO:0042770
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
GO:0045785 GO:1903131 GO:2001242 GO:0050673 GO:0097191 GO:0031401 GO:0030098
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
GO:0009411 GO:0045786 GO:1901987 GO:0072331 GO:2001234 GO:0007159 GO:0040008
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
GO:0043491 GO:0051249 GO:0050678 GO:0001932 GO:0048545 GO:1903706 GO:0045165
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
GO:0050867 GO:1903037 GO:0033002 GO:0071453 GO:1901796 GO:0002696 GO:0044772
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
GO:0051896 GO:0007507 GO:0050727 GO:2001235 GO:1902893 GO:2000630 GO:0061614
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
GO:0036294
TRUE
[ reached 'max' / getOption("max.print") -- omitted 1634 entries ]
attr(,"nTerms")
[1] 5880
In this case, the result contains only the GO terms enriched
(TRUE) in the selected list, considering all GO terms in
the BP ontology without GO level restriction.
[1] 5880
This attribute gives the total number of GO terms considered in the level-free analysis.
[1] 286
[1] 3323
The length of fullEnrichedAtlas corresponds to the total
number of GO terms in the BP ontology at GO level 4. In contrast, the
length of enrichedAtlas represents only the number of GO
terms that are enriched in the list atlas.
To illustrate the use of alternative identifier types, the seven gene
lists in allOncoGeneLists can also be converted from
ENTREZID to SYMBOL. This allows the user to
run the same analysis using gene symbols instead of ENTREZ identifiers,
provided that the gene lists, the gene universe, and the
keyType argument are all defined consistently.
library(AnnotationDbi)
# Convert all gene lists from ENTREZID to SYMBOL
allOncoGeneListsSYMBOL <- lapply(allOncoGeneLists, function(geneList) {
symbols <- AnnotationDbi::mapIds(org.Hs.eg.db,
keys = geneList,
column = "SYMBOL",
keytype = "ENTREZID",
multiVals = "first"
)
unique(na.omit(symbols))
})
# Obtain the human gene universe using SYMBOL identifiers
humanSymbols <- keys(org.Hs.eg.db, keytype = "SYMBOL")For example, the enrichment of GO terms in the BP ontology at GO
level 4 for the list atlas, now represented using
SYMBOL identifiers, can be obtained as follows:
enrichedAtlasSymbolBP4 <- enrichedIn(allOncoGeneListsSYMBOL$atlas,
geneUniverse = humanSymbols,
orgPackg = "org.Hs.eg.db",
keyType = "SYMBOL",
onto = "BP", GOLevel = 4
)'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:many mapping between keys and columns
GO:0001649 GO:0030278 GO:0030282 GO:0045778 GO:0060688 GO:0061138 GO:0002263
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
GO:0030168 GO:0050866 GO:0050867 GO:0072537 GO:0001818 GO:0001819 GO:0002367
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
GO:0002534 GO:0010573 GO:0032602 GO:0032612 GO:0032613 GO:0032615 GO:0032623
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
GO:0032635 GO:0071604 GO:0071706 GO:0002443 GO:0002697 GO:0002698 GO:0002699
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
GO:0043299 GO:0002218 GO:0034101 GO:0002377 GO:0002700 GO:0002702 GO:0048534
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
GO:0002685 GO:1903706 GO:0002695 GO:0050777 GO:0050858 GO:1903707 GO:0002687
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
GO:0002696 GO:1903708 GO:0007548 GO:0045137 GO:0048608 GO:0003012 GO:0022600
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
GO:0001666
TRUE
[ reached 'max' / getOption("max.print") -- omitted 236 entries ]
attr(,"nTerms")
[1] 3323
The result is a vector containing only the GO terms enriched
(TRUE) in the atlas list in the BP ontology at
GO level 4, now using gene symbols instead of ENTREZ identifiers.
For the seven lists of allOncoGeneLists, the matrix
containing the GO terms enriched in at least one of the lists to be
compared can be calculated either at a specific GO level or without GO
level restriction.
For example, the enrichment matrix for the BP ontology at GO level 4 is obtained as follows:
enrichedInBP4 <- enrichedIn(allOncoGeneLists,
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db",
onto = "BP", GOLevel = 4
)
enrichedInBP4 atlas cangenes cis miscellaneous sanger Vogelstein waldman
GO:0001649 TRUE FALSE TRUE TRUE TRUE TRUE TRUE
GO:0030278 TRUE FALSE FALSE FALSE FALSE TRUE TRUE
GO:0030282 TRUE FALSE FALSE FALSE FALSE FALSE TRUE
GO:0045778 TRUE FALSE FALSE FALSE FALSE FALSE TRUE
GO:0060688 TRUE FALSE FALSE FALSE TRUE TRUE TRUE
GO:0061138 TRUE FALSE FALSE TRUE TRUE TRUE TRUE
GO:0002263 TRUE FALSE TRUE FALSE TRUE TRUE TRUE
GO:0030168 TRUE FALSE FALSE TRUE TRUE FALSE TRUE
GO:0042118 FALSE FALSE FALSE FALSE FALSE FALSE TRUE
GO:0050866 TRUE FALSE TRUE TRUE TRUE TRUE TRUE
GO:0050867 TRUE FALSE TRUE TRUE TRUE TRUE TRUE
GO:0072537 TRUE FALSE FALSE FALSE FALSE FALSE FALSE
GO:0002260 FALSE FALSE TRUE FALSE TRUE TRUE FALSE
GO:0001818 TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[ reached 'max' / getOption("max.print") -- omitted 332 rows ]
attr(,"nTerms")
[1] 3323
To obtain the full matrix with all the GO terms in the BP ontology at
GO level 4, we must set the argument onlyEnriched (which is
TRUE by default) to FALSE, as follows:
fullEnrichedInBP4 <- enrichedIn(allOncoGeneLists,
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db",
onto = "BP", GOLevel = 4,
onlyEnriched = FALSE
)
fullEnrichedInBP4 atlas cangenes cis miscellaneous sanger Vogelstein waldman
GO:0036349 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
GO:0036350 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
GO:0060256 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
GO:0060257 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
GO:1900735 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
GO:0010590 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
GO:0030994 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
GO:0030995 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
GO:1905391 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
GO:2001042 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
GO:2001043 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
GO:0001649 TRUE FALSE TRUE TRUE TRUE TRUE TRUE
GO:0030278 TRUE FALSE FALSE FALSE FALSE TRUE TRUE
GO:0030279 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[ reached 'max' / getOption("max.print") -- omitted 3309 rows ]
attr(,"nTerms")
[1] 3323
The enrichment matrix can also be obtained without restricting the
analysis to a specific GO level by setting GOLevel = NULL.
For example, for the BP ontology:
enrichedInBP <- enrichedIn(allOncoGeneLists,
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db",
onto = "BP", GOLevel = NULL
)
enrichedInBP atlas cangenes cis miscellaneous sanger Vogelstein waldman
GO:0097193 TRUE FALSE TRUE TRUE TRUE TRUE TRUE
GO:0050673 TRUE FALSE TRUE TRUE TRUE TRUE TRUE
GO:2001233 TRUE FALSE TRUE TRUE TRUE TRUE TRUE
GO:0048732 TRUE TRUE TRUE TRUE TRUE TRUE TRUE
GO:0050678 TRUE FALSE TRUE TRUE TRUE TRUE TRUE
GO:0071214 TRUE FALSE FALSE TRUE TRUE TRUE TRUE
GO:0104004 TRUE FALSE FALSE TRUE TRUE TRUE TRUE
GO:0072331 TRUE FALSE FALSE TRUE TRUE TRUE TRUE
GO:0009314 TRUE FALSE FALSE TRUE TRUE TRUE TRUE
GO:1903706 TRUE FALSE TRUE TRUE TRUE TRUE TRUE
GO:0070482 TRUE FALSE FALSE TRUE TRUE TRUE TRUE
GO:0042770 TRUE FALSE FALSE TRUE TRUE TRUE TRUE
GO:0030098 TRUE FALSE TRUE TRUE TRUE TRUE TRUE
GO:0036293 TRUE FALSE FALSE TRUE TRUE TRUE TRUE
[ reached 'max' / getOption("max.print") -- omitted 2681 rows ]
attr(,"nTerms")
[1] 6093
The number of rows in the full matrix
(fullEnrichedInBP4) is much larger than in the matrix
containing only the GO terms enriched in at least one list
(enrichedInBP4), which implies a more intensive memory
usage..
[1] 346
[1] 3323
The number of rows in fullEnrichedInBP4 corresponds to
the total number of GO terms in the BP ontology at GO level 4. In
contrast, the number of rows in enrichedInBP4 represents
only the GO terms enriched in at least one list from
allOncoGeneLists, meaning that each row in this matrix
contains at least one TRUE.
To provide users with a quick visualization, the
goSorensen package includes the objects
enrichedInBP4 and fullEnrichedInBP4, which can
be accessed using data(enrichedInBP4) and
data(fullEnrichedInBP4).
Note that gene lists, GO terms, and Bioconductor may change over
time. So, consider these objects only as illustrative examples, valid
exclusively for the allOncoGeneLists at a specific time.
The current version of these results was generated with Bioconductor
version 3.20. The same comment is applicable to other objects included
in goSorensen for quick visualization, some of which are
also described in this vignette.
The calculations illustrated in this vignette are based on the matrix
containing GO terms enriched in at least one list (in our case,
enrichedInBP4). In the illustrations provided for this
vignette, th
ere is no evidence to suggest that this matrix produces results different from the full matrix, which includes all GO terms for a specific ontology and level, including those that are not enriched in any of the lists being compared. This is very beneficial since the computational cost of processing is much lower than it could be.
The enrichment contingency tables considered in
goSorensen are the direct result of obtaining
cross-frequency tables between pairs of columns (lists) of the
enrichment matrices described in the Section 2 of this vignette. In
general, these are internal details that the user of this package does
not need to worry about.
Possibly, the only aspect to take into account here is that the main
function for this task, buildEnrichTable, always calls
internally the function enrichedIn with the argument
onlyEnriched set to TRUE and, therefore, the
obtained enrichment tables are always in their compact version: Only
rows with at least one TRUE (in other words, GO terms enriched in at
least one gene list).
For the specific case of two gene lists, the function
buildEnrichTable computes the contingency table by
accepting two vectors of the class character containing the
IDs of the lists to be compared. For instance, for the BP ontology at GO
level 4, the contingency table representing the enrichment of GO terms
in the lists atlas and sanger is obtained as
follows:
cont_atlas.sanger_BP4 <- buildEnrichTable(allOncoGeneLists$atlas,
allOncoGeneLists$sanger,
listNames = c("atlas", "sanger"),
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db",
onto = "BP",
GOLevel = 4
)
cont_atlas.sanger_BP4 Enriched in sanger
Enriched in atlas TRUE FALSE
TRUE 127 159
FALSE 19 3018
Alternatively, the contingency table can be computed in a level-free
framework by setting GOLevel = NULL. For example, for the
BP ontology:
cont_atlas.sanger_BP <- buildEnrichTable(allOncoGeneLists$atlas,
allOncoGeneLists$sanger,
listNames = c("atlas", "sanger"),
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db",
onto = "BP",
GOLevel = NULL
)
cont_atlas.sanger_BP Enriched in sanger
Enriched in atlas TRUE FALSE
TRUE 696 988
FALSE 67 4129
In this case, the contingency table is built considering all GO terms in the selected ontology, without GO level restriction.
The result is an enrichment contingency table of class
table. If the argument storeEnrichedIn of
buildEnrichTable was set to TRUE (the default
value), it has an attribute, enriched, with the logical
matrix of enriched GO terms in these gene lists, i.e., the output of
function enrichedIn (always the compact form of these
matrices, only rows with at least one TRUE).
To provide users with a quick visualization, the
goSorensen package includes the object
cont_atlas.sanger_BP4, which can be accessed using
data(cont_atlas.sanger_BP4).
Given \(s\) (\(s \geq 2\)) lists to compare, the \(s(s-1)/2\) possible enrichment contingency
tables can also be obtained using the function
buildEnrichTable. Instead of providing two vectors as the
main argument, we provide an object of the class list,
containing at least two elements, each of which contains the identifiers
of the different lists to be compared.
For example, for the BP ontology at GO level 4, the \(7(6)/2=21\) contingency tables calculated
from the 7 gene lists contained in allOncoGeneLists are
obtained as follows:
cont_all_BP4 <- buildEnrichTable(allOncoGeneLists,
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db",
onto = "BP",
GOLevel = 4
)Alternatively, the contingency tables can be computed in a level-free
framework by setting GOLevel = NULL. For example, for the
BP ontology:
cont_all_BP <- buildEnrichTable(allOncoGeneLists,
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db",
onto = "BP",
GOLevel = NULL
)In this case, the resulting object contains all pairwise enrichment contingency tables between the compared gene lists, considering all GO terms in the selected ontology without GO level restriction.
The result is an object of the class tableList, which is
exclusive from goSorensen and contains all the possible
enrichment contingency tables between the compared gene lists at GO
level 4 for the ontology BP. Since the output is very large, it is not
displayed it in this vignette.
If the argument storeEnrichedIn of
buildEnrichTable was set to TRUE (its default
value), an important attribute of this object is enriched,
accessible via attr(cont_all_BP4, "enriched"), which
contains the enrichment matrix obtained using the
enrichedIn function. For this particular case,
attr(cont_all_BP4, "enriched") contains exactly the same
information as the object enrichedInBP4 from Section 2.2 of
this vignette.
To provide users with a quick visualization, the
goSorensen package includes the object
cont_all_BP4, which can be accessed using
data(cont_all_BP4).
When you want to obtain contingency tables for two or more lists
across multiple ontologies, either for several GO levels or without GO
level restriction, you can use the function
allBuildEnrichTable.
For example to obtain the \(7(6)/2=21\) contingency tables calculated
from the 7 gene lists in allOncoGeneLists for the three
ontologies (BP, CC, and MF) and for the GO levels from 3 to 10, you can
use the function allBuildEnrichTable as follows:
allContTabs <- allBuildEnrichTable(allOncoGeneLists,
geneUnierse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db",
ontos = c("BP", "CC", "MF"),
GOLevels = 3:10
)The contingency tables can also be obtained without restricting the
analysis to specific GO levels by setting GOLevels = NULL,
as follows:
allContTabsNoLevel <- allBuildEnrichTable(allOncoGeneLists,
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db",
ontos = c("BP", "CC", "MF"),
GOLevels = NULL
)In this case, the output contains one tableList object
for each selected ontology, without the intermediate subdivision by GO
levels.
The result is an object of the class allTableList, which
is exclusive to goSorensen and contains all possible
enrichment contingency tables between the compared gene lists for the
BP, CC, and MF ontologies, and for GO levels 3 to 10. Since the output
is very large, it is not displayed in this vignette.
The attribute enriched is present in each element of
this output, meaning that for each ontology and GO level contained in
this object, there is an enrichment matrix similar to the one obtained
with the function enrichedIn. For instance, by running the
code attr(allContTabs$BP$'level 4', 'enriched'), you can
access the enrichment matrix enrichedInBP4 obtained in
Section 2.2 of this vignette.
In the level-free case, the attribute enriched is also
available for each ontology, but without the intermediate subdivision by
GO levels.
To provide users with a quick visualization, the
goSorensen package includes the object
allContTabs, which can be accessed using
data(allContTabs).
The function equivTestSorensen performs an equivalence
test to detect equivalence between gene lists.
For the specific case of two gene lists, you need to provide the
function equivTestSorensen with two character vectors
containing the IDs of the lists to be compared.
For example, using an asymptotic normal distribution, an irrelevance
limit \(d_0=0.4444\) and a significance
level \(\alpha=0.05\) (the default
parameters), an equivalence test to compare the lists atlas
and sanger for the BP ontology at GO level 4 can be
performed as follows:
eqTest_atlas.sanger_BP4 <- equivTestSorensen(allOncoGeneLists$atlas,
allOncoGeneLists$sanger,
listNames = c("atlas", "sanger"),
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db",
onto = "BP", GOLevel = 4,
d0 = 0.4444,
conf.level = 0.95
)
eqTest_atlas.sanger_BP4
Normal asymptotic test for 2x2 contingency tables based on the
Sorensen-Dice dissimilarity
data: cont_atlas.sanger_BP4
(d - d0) / se = -1.1, p-value = 0.1
alternative hypothesis: true equivalence limit d0 is less than 0.4444
95 percent confidence interval:
0.0000 0.4584
sample estimates:
Sorensen dissimilarity
0.412
attr(,"se")
standard error
0.02819
Alternatively, the equivalence test can be performed in a level-free
framework by setting GOLevel = NULL. For example, for the
BP ontology:
eqTest_atlas.sanger_BP <- equivTestSorensen(allOncoGeneLists$atlas,
allOncoGeneLists$sanger,
listNames = c("atlas", "sanger"),
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db",
onto = "BP", GOLevel = NULL,
d0 = 0.4444,
conf.level = 0.95
)
eqTest_atlas.sanger_BP
Normal asymptotic test for 2x2 contingency tables based on the
Sorensen-Dice dissimilarity
data: cont_atlas.sanger_BP
(d - d0) / se = -1.1, p-value = 0.1
alternative hypothesis: true equivalence limit d0 is less than 0.4444
95 percent confidence interval:
0.0000 0.4508
sample estimates:
Sorensen dissimilarity
0.4311
attr(,"se")
standard error
0.01198
In this case, the equivalence test is based on the enrichment contingency table built from all GO terms in the selected ontology, without GO level restriction.
If the enrichment contingency table is available prior to performing
the test, such as cont_atlas.sanger_BP4 determined in
Section 3.1.1, the execution time for the calculation is much shorter.
To use equivTestSorensen with the contingency table as
input, proceed as follows:
Normal asymptotic test for 2x2 contingency tables based on the
Sorensen-Dice dissimilarity
data: cont_atlas.sanger_BP4
(d - d0) / se = -1.1, p-value = 0.1
alternative hypothesis: true equivalence limit d0 is less than 0.4444
95 percent confidence interval:
0.0000 0.4584
sample estimates:
Sorensen dissimilarity
0.412
attr(,"se")
standard error
0.02819
As you can see, both procedures produce the same result, but the last
one (whenever possible) is much faster because no time is wasted
internally generating the contingency table from the lists of genes and
GO terms. Regardless of the procedure, the result is an object of class
equivSDhtest (a specialization of class
htest), which is exclusive from
goSorensen.
If you want to change the calculation parameters of the test, such as
using a bootstrap distribution instead of a normal distribution, setting
an irrelevance limit of \(d_0 =
0.2857\) instead of \(d_0
=0.4444\) (or any other), or changing the significance level to
\(\alpha = 0.01\) instead of \(\alpha = 0.05\) (or any other), one option
would be to use the equivTestSorensen function again with
the new parameters. However, this would require performing all the
calculations again, leading to additional computational costs, which
increase as more tests are performed. Instead, the function
upgrade allows you to update the test output much more
quickly by simply specifying the name of the object where the test
results are stored and the new parameters you wish to apply, as shown
below:
Bootstrap test for 2x2 contingency tables based on the Sorensen- Dice
dissimilarity (10000 bootstrap replicates)
data: tab
(d - d0) / se = 4.5, p-value = 1
alternative hypothesis: true equivalence limit d0 is less than 0.2857
99 percent confidence interval:
0.0000 0.4791
sample estimates:
Sorensen dissimilarity
0.412
attr(,"se")
standard error
0.02819
Given \(s\) (\(s \geq 2\)) lists to compare, the \(s(s-1)/2\) possible equivalence tests can
also be obtained using the function equivTestSorensen
Instead of providing two vectors as the main argument, we provide an
object of the class list, containing at least two elements,
each of which contains the identifiers of the different lists to be
compared.
For example, for the BP ontology at GO level 4, the \(7(6)/2=21\) possible test calculated from
the 7 gene lists contained in allOncoGeneLists are obtained
as follows:
eqTest_all_BP4 <- equivTestSorensen(allOncoGeneLists,
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db",
onto = "BP",
GOLevel = 4,
d0 = 0.4444,
conf.level = 0.95
)The equivalence tests can also be computed without restricting the
analysis to a specific GO level by setting GOLevel = NULL.
For example, for the BP ontology:
allEqTestsBP <- equivTestSorensen(allOncoGeneLists,
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db",
onto = "BP",
GOLevel = NULL,
d0 = 0.4444,
conf.level = 0.95
)In this case, the result contains all pairwise equivalence tests between the compared gene lists, considering all GO terms in the selected ontology without GO level restriction.
But remember, it is much simpler if we already have the contingency
tables as an object of the class tableList. In our case, we
have already calculated the contingency tables for all possible pairs of
allOncoGeneLists for the ontology BP, GO level 4, in
Section 3.1.2 and stored them in the object cont_all_BP4.
Therefore, we can calculate the eqTest_all_BP4 object more
efficiently in the following way:
Remember that similarly to the comparison of two lists in Section
4.1.1, you can use the function upgrade to update the
results by changing the parameters of the tests, such as the confidence
level, irrelevance limit, and others. For instance,
upgrade(eqTest_atlas.sanger_BP4, d0 = 0.2857 to update the
equivalence test using an irrelevance limit \(d_0=0.2857\).
Since the output contained in eqTest_all_BP4 is very
large, it is not displayed in this vignette. However,
goSorensen provides accessor functions that allow you to
retrieve specific outputs of interest. For example, to obtain a summary
of the Sorensen dissimilarities contained in the tests comparing all
pairs of lists in the BP ontology at GO level 4, you can use the
function getDissimilarity and retrieve them as follows:
atlas cangenes cis miscellaneous sanger Vogelstein waldman
atlas 0.0000 1 0.7723 0.4935 0.4120 0.3909 0.3240
cangenes 1.0000 0 1.0000 1.0000 1.0000 1.0000 1.0000
cis 0.7723 1 0.0000 0.6857 0.6973 0.7202 0.7391
miscellaneous 0.4935 1 0.6857 0.0000 0.4413 0.4196 0.4032
sanger 0.4120 1 0.6973 0.4413 0.0000 0.1200 0.4167
Vogelstein 0.3909 1 0.7202 0.4196 0.1200 0.0000 0.3696
waldman 0.3240 1 0.7391 0.4032 0.4167 0.3696 0.0000
Another example of accessor function is the function
getPvalue to obtain the p-values across the object
eqTest_all_BP4:
atlas cangenes cis miscellaneous sanger Vogelstein
atlas 0.000e+00 NaN 1 0.9429 1.254e-01 2.579e-02
cangenes NaN 0 NaN NaN NaN NaN
cis 1.000e+00 NaN 0 1.0000 1.000e+00 1.000e+00
miscellaneous 9.429e-01 NaN 1 0.0000 4.675e-01 2.510e-01
sanger 1.254e-01 NaN 1 0.4675 0.000e+00 5.872e-60
Vogelstein 2.579e-02 NaN 1 0.2510 5.872e-60 0.000e+00
waldman 2.990e-07 NaN 1 0.1045 1.854e-01 5.601e-03
waldman
atlas 2.990e-07
cangenes NaN
cis 1.000e+00
miscellaneous 1.045e-01
sanger 1.854e-01
Vogelstein 5.601e-03
waldman 0.000e+00
NaN values occur when the test statistic cannot be calculated due to an indeterminacy, for example when the standard error of the sample Sorensen-Dice dissimilarity cannot be calculated or is null. One of these scenarios occurs when there is no joint enrichment between two lists (i.e., when \(n_{11}=0\)).
In addition other accesor functions include: getSE for
the standard error, getUpper for the upper bound of the
confidence interval and getTable for the enrichment
contingency tables.
To provide users with a quick visualization, the
goSorensen package includes the object
eqTest_all_BP4, which can be accessed using
data(eqTest_all_BP4).
When you want to obtain the outputs of the equivalence tests to
compare two or more lists across multiple ontologies, either for several
GO levels or without GO level restriction, you can use the function
allEquivTestSorensen
For example to obtain the \(7(6)/2=21\) equivalence tests calculated
from the 7 gene lists in allOncoGeneLists for the three
ontologies (BP, CC, and MF) and for the GO levels from 3 to 10, you can
use the function allEquivTestSorensen as follows:
allEqTests <- allEquivTestSorensen(allOncoGeneLists,
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db",
ontos = c("BP", "CC", "MF"),
GOLevels = 3:10,
d0 = 0.4444,
conf.level = 0.95
)Alternatively, the equivalence tests can be computed without
restricting the analysis to specific GO levels by setting
GOLevels = NULL, as follows:
allEqTestsNoLevel <- allEquivTestSorensen(allOncoGeneLists,
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db",
ontos = c("BP", "CC", "MF"),
GOLevels = NULL,
d0 = 0.4444,
conf.level = 0.95
)In this case, the output contains the equivalence tests for all pairwise comparisons between the gene lists within each selected ontology, considering all GO terms in the ontology without GO level restriction.
But remember, it is much simpler if we already have the contingency
tables as an object of the class allTableList In our case,
we have already calculated the contingency tables for all possible pairs
of allOncoGeneLists for the ontologies BP, CC, MF, and for
the GO levels 3 to 10, in Section 3.2 and stored them in the object
allContTabs Therefore, we can calculate the
allEqTests object more efficiently in the following
way:
Likewise, if the contingency tables were previously computed without
GO level restriction, the corresponding equivalence tests can also be
obtained more efficiently from that allConTabsNoLevel
object.
The result is an object of the class AllEquivSDhtest,
which is exclusive to goSorensen.
In a similar way to what is explained in Section 4.1.1 and 4.1.2, you
can use the function upgrade to update the results by
changing the parameters of the tests, such as the confidence level,
irrelevance limit, sample distribution (normal or bootstrap) and
others.
You can use also the accessor functions to obtain key test outputs,
such as the Sorensen dissimilarities (getDissimilarity),
p-values (getPvalue), enrichment contingency tables
(getTable), and more.
To provide users with a quick visualization, the
goSorensen package includes the object
allEqTests, which can be accessed using
data(allEqTests).
R version 4.6.0 (2026-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] org.Hs.eg.db_3.23.1 DT_0.34.0 GO.db_3.23.1
[4] AnnotationDbi_1.75.0 IRanges_2.47.2 S4Vectors_0.51.3
[7] Biobase_2.73.1 BiocGenerics_0.59.6 generics_0.1.4
[10] ggrepel_0.9.8 ggplot2_4.0.3 goSorensen_1.15.0
[13] BiocStyle_2.41.0
loaded via a namespace (and not attached):
[1] DBI_1.3.0 gson_0.1.0 httr2_1.2.2
[4] rlang_1.2.0 magrittr_2.0.5 DOSE_4.7.0
[7] otel_0.2.0 compiler_4.6.0 RSQLite_3.53.1
[10] systemfonts_1.3.2 png_0.1-9 callr_3.7.6
[13] vctrs_0.7.3 reshape2_1.4.5 stringr_1.6.0
[16] pkgconfig_2.0.3 crayon_1.5.3 fastmap_1.2.0
[19] XVector_0.53.0 labeling_0.4.3 rmarkdown_2.31
[22] enrichplot_1.33.0 ps_1.9.3 purrr_1.2.2
[25] bit_4.6.0 xfun_0.57 cachem_1.1.0
[28] aplot_0.2.9 jsonlite_2.0.0 blob_1.3.0
[31] tidydr_0.0.6 tweenr_2.0.3 cluster_2.1.8.2
[34] parallel_4.6.0 R6_2.6.1 bslib_0.11.0
[37] stringi_1.8.7 RColorBrewer_1.1-3 enrichit_0.1.4
[40] jquerylib_0.1.4 GOSemSim_2.39.0 Rcpp_1.1.1-1.1
[43] Seqinfo_1.3.0 knitr_1.51 goProfiles_1.75.0
[46] ggtangle_0.1.2 splines_4.6.0 igraph_2.3.1
[49] aisdk_1.1.0 tidyselect_1.2.1 qvalue_2.45.0
[52] yaml_2.3.12 processx_3.9.0 lattice_0.22-9
[55] tibble_3.3.1 plyr_1.8.9 treeio_1.37.0
[58] withr_3.0.2 KEGGREST_1.53.0 S7_0.2.2
[61] evaluate_1.0.5 CompQuadForm_1.4.4 gridGraphics_0.5-1
[64] polyclip_1.10-7 scatterpie_0.2.6 Biostrings_2.81.2
[67] pillar_1.11.1 BiocManager_1.30.27 ggtree_4.3.0
[70] clusterProfiler_4.21.0 ggfun_0.2.0 scales_1.4.0
[73] tidytree_0.4.7 glue_1.8.1 gdtools_0.5.1
[76] lazyeval_0.2.3 maketools_1.3.2 tools_4.6.0
[79] ggnewscale_0.5.2 sys_3.4.3 ggiraph_0.9.6
[82] buildtools_1.0.0 fs_2.1.0 grid_4.6.0
[85] tidyr_1.3.2 ape_5.8-1 crosstalk_1.2.2
[88] nlme_3.1-169 patchwork_1.3.2 ggforce_0.5.0
[91] cli_3.6.6 rappdirs_0.3.4 fontBitstreamVera_0.1.1
[94] dplyr_1.2.1 gtable_0.3.6 yulab.utils_0.2.4
[97] fontquiver_0.2.1 sass_0.4.10 digest_0.6.39
[100] ggplotify_0.1.3
[ reached 'max' / getOption("max.print") -- omitted 9 entries ]
In fact, this is an internal step hidden within the
function buildEnrichTable described in Section 3. However,
providing a brief explanation of this process may help clarify certain
details about how enrichment contingency tables are constructed. For
users working with the package at a not-so-advanced level, this step can
be skipped without affecting their understanding of how to use the core
functions of goSorensen.↩︎