Gene Ontologies (GO) are often used to guide the interpretation of high-throughput omics experiments, with lists of differentially regulated genes being summarized into sets of genes with a common functional representation. Due to the hierachical nature of Gene Ontologies, the resulting lists of enriched sets are usually redundant and difficult to interpret.
rrvgo
aims at simplifying the redundance of GO sets by
grouping similar terms based on their semantic similarity. It also
provides some plots to help with interpreting the summarized terms.
This software is heavily influenced by REVIGO. It mimics a good part of its
core functionality, and even some of the outputs are similar. Without
aims to compete, rrvgo
tries to offer a programatic
interface using available annotation databases and semantic similarity
methods implemented in the Bioconductor project.
Starting with a list of genes of interest (eg. coming from a differential expression analysis), apply any method for the identification of eneriched GO terms (see GOStats or GSEA).
rrvgo
does not care about genes, but GO terms. The input
is a vector of enriched GO terms, along with (recommended, but not
mandatory) a vector of scores. If scores are not provided,
rrvgo
takes the GO term (set) size as a score, thus
favoring broader terms.
First step is to get the similarity matrix between terms. The
function calculateSimMatrix
takes a list of GO terms for
which the semantic simlarity is to be calculated, an OrgDb
object for an organism, the ontology of interest and the method to
calculate the similarity scores.
library(rrvgo)
go_analysis <- read.delim(system.file("extdata/example.txt", package="rrvgo"))
simMatrix <- calculateSimMatrix(go_analysis$ID,
orgdb="org.Hs.eg.db",
ont="BP",
method="Rel")
The semdata
parameter (see
?calculateSimMatrix
) is not mandatory as it is calculated
on demand. If the function needs to run several times with the same
organism, it’s advisable to save the
GOSemSim::godata(orgdb, ont=ont)
object, in order to reuse
it between calls and speedup the calculation of the similarity
matrix.
From the similarity matrix one can group terms based on similarity.
rrvgo
provides the reduceSimMatrix
function
for that. It takes as arguments i) the similarity matrix, ii) an
optional named vector of scores associated to each GO term,
iii) a similarity threshold used for grouping terms, and iv) an orgdb
object.
scores <- setNames(-log10(go_analysis$qvalue), go_analysis$ID)
reducedTerms <- reduceSimMatrix(simMatrix,
scores,
threshold=0.7,
orgdb="org.Hs.eg.db")
reduceSimMatrix
groups terms which are at least within a
similarity below threshold
, and selects as the group
representative the term with the higher score within the group. In case
the vector of scores is not available, reduceSimMatrix
can
either use the uniqueness of a term (default), or the GO term
size. In the case of size, rrvgo
will
fetch the GO term size from the OrgDb
object and use it as
the score, thus favoring broader terms. Please note that scores
are interpreted in the direction that higher are better,
therefore if you use p-values as scores, minus log-transform them
before.
NOTE:rrvgo
uses the similarity between
pairs of terms to compute a distance matrix, defined as
(1-simMatrix)
. The terms are then hierarchically clustered
using complete linkage, and the tree is cut at the desired threshold,
picking the term with the highest score as the representative of each
group.
Therefore, higher thresholds lead to fewer groups, and the threshold
should be read as the minimum similarity between group
representatives.
rrvgo
provides several methods for plotting and
interpreting the results.
Plot similarity matrix as a heatmap, with clustering of columns of rows turned on by default (thus arranging together similar terms).
The function internally uses pheatmap
,
and further parameters can be passed to this function.
Plot GO terms as scattered points. Distances between points represent the similarity between terms, and axes are the first 2 components of applying a PCoA to the (di)similarity matrix. Size of the point represents the provided scores or, in its absence, the number of genes the GO term contains.
Treemaps are space-filling visualization of hierarchical structures. The terms are grouped (colored) based on their parent, and the space used by the term is proportional to the score. Treemaps can help with the interpretation of the summarized results and also comparing differents sets of GO terms.
The function internally uses treemap
,
and further parameters can be passed to this function.
Word clouds are visualizations which reproduce a text putting emphasis to words which appear frequently in a text. They can help to identify processes and functions that happen more commonly in a set of enriched GO terms, as well as comparing between different sets.
The function internally uses wrodcloud
,
and further parameters can be passed to this function.
All similarity measures available are those implemented in the GOSemSim package, namely the Resnik, Lin, Relevance, Jiang and Wang methods. See the Semantic Similarity Measurement Based on GO section from the GOSeSim documentation for more details.
Bioconductor current provides OrgDb
objects for 20
species provided by the following packages:
Package | Organism |
---|---|
org.Ag.eg.db | Anopheles |
org.At.tair.db | Arabidopsis |
org.Bt.eg.db | Bovine |
org.Ce.eg.db | Worm |
org.Cf.eg.db | Canine |
org.Dm.eg.db | Fly |
org.Dr.eg.db | Zebrafish |
org.EcK12.eg.db | E coli strain K12 |
org.EcSakai.eg.db | E coli strain Sakai |
org.Gg.eg.db | Chicken |
org.Hs.eg.db | Human |
org.Mm.eg.db | Mouse |
org.Mmu.eg.db | Rhesus |
org.Mxanthus.db | Myxococcus xanthus DK 1622 |
org.Pf.plasmo.db | Malaria |
org.Pt.eg.db | Chimp |
org.Rn.eg.db | Rat |
org.Sc.sgd.db | Yeast |
org.Ss.eg.db | Pig |
org.Xl.eg.db | Xenopus |
If the organism is not supported in Bioconductor, you can still build
your own OrgDb
object usign the AnnotationForge
package and rendering the necessary data for semantic similarity using
the GOSemSim
package with:
One of Biologiocal Process (BP), Molecular Function (MF) or Cellular Compartment (CC).
Taken as is from the DOSE package, which was derived from the R package breastCancerMAINZ. It contains 200 samples with breast cancer at different grades (I, II and III). The dataset basically contains log2 ratios of the geometric means of grade III vs. grade I samples ( 34 vs. 29 repectively).
Please consider citing rrvgo if used in support of your own research:
## To cite package 'rrvgo' in publications use:
##
## Sayols, S (2023). rrvgo: a Bioconductor package for interpreting
## lists of Gene Ontology terms. microPublication Biology.
## 10.17912/micropub.biology.000811
##
## A BibTeX entry for LaTeX users is
##
## @Article{,
## title = {rrvgo: a Bioconductor package to reduce and visualize Gene Ontology terms},
## author = {Sergi Sayols},
## year = {2023},
## journal = {microPublication Biology},
## doi = {10.17912/micropub.biology.000811},
## url = {https://www.micropublication.org/journals/biology/micropub-biology-000811},
## }
If you run into problems using rrvgo, the Bioconductor Support site is a good first place to ask for help. If you think there is a bug or an unreported feature, you can report it using the rrvgo github site.
The following package and versions were used in the production of this vignette.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rrvgo_1.19.0 knitr_1.49 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 gridBase_0.4-7 farver_2.1.2
## [4] dplyr_1.1.4 blob_1.2.4 R.utils_2.12.3
## [7] Biostrings_2.75.1 fastmap_1.2.0 treemap_2.4-4
## [10] promises_1.3.2 digest_0.6.37 mime_0.12
## [13] lifecycle_1.0.4 NLP_0.3-2 KEGGREST_1.47.0
## [16] RSQLite_2.3.8 magrittr_2.0.3 compiler_4.4.2
## [19] rlang_1.1.4 sass_0.4.9 tools_4.4.2
## [22] wordcloud_2.6 igraph_2.1.1 utf8_1.2.4
## [25] yaml_2.3.10 data.table_1.16.2 labeling_0.4.3
## [28] askpass_1.2.1 bit_4.5.0 reticulate_1.40.0
## [31] xml2_1.3.6 RColorBrewer_1.1-3 withr_3.0.2
## [34] BiocGenerics_0.53.3 sys_3.4.3 R.oo_1.27.0
## [37] grid_4.4.2 stats4_4.4.2 fansi_1.0.6
## [40] GOSemSim_2.33.0 xtable_1.8-4 tm_0.7-15
## [43] colorspace_2.1-1 GO.db_3.20.0 ggplot2_3.5.1
## [46] scales_1.3.0 cli_3.6.3 rmarkdown_2.29
## [49] crayon_1.5.3 generics_0.1.3 umap_0.2.10.0
## [52] RSpectra_0.16-2 httr_1.4.7 DBI_1.2.3
## [55] cachem_1.1.0 zlibbioc_1.52.0 parallel_4.4.2
## [58] AnnotationDbi_1.69.0 BiocManager_1.30.25 XVector_0.47.0
## [61] yulab.utils_0.1.8 vctrs_0.6.5 Matrix_1.7-1
## [64] jsonlite_1.8.9 slam_0.1-55 IRanges_2.41.1
## [67] S4Vectors_0.45.2 bit64_4.5.2 ggrepel_0.9.6
## [70] maketools_1.3.1 jquerylib_0.1.4 glue_1.8.0
## [73] codetools_0.2-20 gtable_0.3.6 later_1.4.1
## [76] GenomeInfoDb_1.43.2 UCSC.utils_1.3.0 munsell_0.5.1
## [79] tibble_3.2.1 pillar_1.9.0 htmltools_0.5.8.1
## [82] openssl_2.2.2 GenomeInfoDbData_1.2.13 R6_2.5.1
## [85] lattice_0.22-6 evaluate_1.0.1 shiny_1.9.1
## [88] Biobase_2.67.0 R.methodsS3_1.8.2 png_0.1-8
## [91] pheatmap_1.0.12 memoise_2.0.1 httpuv_1.6.15
## [94] bslib_0.8.0 Rcpp_1.0.13-1 org.Hs.eg.db_3.20.0
## [97] xfun_0.49 fs_1.6.5 buildtools_1.0.0
## [100] pkgconfig_2.0.3