Functional enrichment analysis determines if specific biological functions or pathways are overrepresented in a set of features (e.g., genes, proteins). These sets often originate from differential expression analysis, while the functions and pathways are derived from databases such as GO, Reactome, or KEGG. In its simplest form, enrichment analysis employs Fisher’s test to evaluate if a given function is enriched in the selection. The null hypothesis asserts that the proportion of features annotated with that function is the same between selected and non-selected features.
Functional enrichment analysis requires downloading large datasets from the aforementioned databases before conducting the actual analysis. While downloading data is time-consuming, Fisher’s test can be performed rapidly. This package aims to separate these two steps, enabling fast enrichment analysis for various feature selections using a given database. It is specifically designed for interactive applications like Shiny. A small Shiny app, included in the package, demonstrates the usage of fenr.
Functional enrichment analysis should not be considered the ultimate answer in understanding biological systems. In many instances, it may not provide clear insights into biology. Specifically, when arbitrary groups of genes are selected, enrichment analysis only reveals the statistical overrepresentation of a functional term within the selection, which may not directly correspond to biological relevance. This package serves as a tool for data exploration; any conclusions drawn about biology require independent validation and further investigation.
fenr can be installed from Bioconductor by using:
The input data for fenr consists of two tibbles or data
frames (see figure below). One tibble contains functional term
descriptions, while the other provides the term-to-feature mapping.
Users can acquire these datasets using the convenient helper functions
like fetch_*
, which facilitate data retrieval from GO,
KEGG, BioPlanet, or WikiPathways. Alternatively, users have the option
to provide their own datasets. These tibbles are subsequently
transformed into a fenr_terms object, specifically designed for
efficient data retrieval and access. The resulting object is then
employed by the functional_enrichment function to perform enrichment
analysis for any chosen feature selection.
Package fenr and example data are loaded with the following commands:
The initial step involves downloading functional term data.
fenr supports data downloads from Gene Ontology,
Reactome, KEGG, BioPlanet, and
WikiPathways. Custom ontologies can also be used, provided they
are converted into an appropriate format (refer to the
prepare_for_enrichment
function for more information). The
command below downloads functional terms and gene mapping from Gene
Ontology (GO):
sgd
is a designation for yeast in Gene Ontology. The
full list of available species designations can be obtain using the
command:
which returns a tibble:
go_species
#> # A tibble: 40 × 2
#> species designation
#> <chr> <chr>
#> 1 Species/Database File
#> 2 Gallus gallus - EBI Gene Ontology Annotation Database (g… goa_chicke…
#> 3 Sus scrofa - EBI Gene Ontology Annotation Database (goa) goa_pig_is…
#> 4 Dictyostelium discoideum - dictyBase (dictyBase) dictybase
#> 5 Mus musculus - Mouse Genome Informatics (mgi) mgi
#> 6 Solanaceae - Sol Genomics Network (sgn) sgn
#> 7 Sus scrofa - EBI Gene Ontology Annotation Database (goa) goa_pig
#> 8 Gallus gallus - EBI Gene Ontology Annotation Database (g… goa_chicke…
#> 9 Danio rerio - Zebrafish Information Network (zfin) zfin
#> 10 Bos taurus - EBI Gene Ontology Annotation Database (goa) goa_cow_rna
#> # ℹ 30 more rows
go
is a list with two tibbles. The first tibble contains
term information:
go$terms
#> # A tibble: 51,467 × 2
#> term_id term_name
#> <chr> <chr>
#> 1 GO:0000001 mitochondrion inheritance
#> 2 GO:0000002 mitochondrial genome maintenance
#> 3 GO:0000003 obsolete reproduction
#> 4 GO:0000005 obsolete ribosomal chaperone activity
#> 5 GO:0000006 high-affinity zinc transmembrane transporter activity
#> 6 GO:0000007 low-affinity zinc ion transmembrane transporter activity
#> 7 GO:0000008 obsolete thioredoxin
#> 8 GO:0000009 alpha-1,6-mannosyltransferase activity
#> 9 GO:0000010 trans-hexaprenyltranstransferase activity
#> 10 GO:0000011 vacuole inheritance
#> # ℹ 51,457 more rows
The second tibble contains gene-term mapping:
go$mapping
#> # A tibble: 75,220 × 4
#> gene_symbol gene_id db_id term_id
#> <chr> <chr> <chr> <chr>
#> 1 RHO1 YPR165W S000006369 GO:0090334
#> 2 RHO1 YPR165W S000006369 GO:0008047
#> 3 ANK1 YGL242C S000003211 GO:0005737
#> 4 MCY1 YGR012W S000003244 GO:0005741
#> 5 RBS1 YDL189W S000002348 GO:0005634
#> 6 RBS1 YDL189W S000002348 GO:0048027
#> 7 PDP3 YLR455W S000004447 GO:0005634
#> 8 MCY1 YGR012W S000003244 GO:0005739
#> 9 PDP3 YLR455W S000004447 GO:0000993
#> 10 GCN4 YEL009C S000000735 GO:1990139
#> # ℹ 75,210 more rows
To make these user-friendly data more suitable for rapid functional enrichment analysis, they need to be converted into a machine-friendly object using the following function:
exmpl_all
is an example of gene background - a vector
with gene symbols related to all detections in an imaginary RNA-seq
experiment. Since different datasets use different features (gene id,
gene symbol, protein id), the column name containing features in
go$mapping
needs to be specified using
feature_name = "gene_symbol"
. The resulting object,
go_terms
, is a data structure containing all the mappings
in a quickly accessible form. From this point on, go_terms
can be employed to perform multiple functional enrichment analyses on
various gene selections.
The package includes two pre-defined gene sets.
exmpl_all
contains all background gene symbols, while
exmpl_sel
comprises the genes of interest. To perform
functional enrichment analysis on the selected genes, you can use the
following single, efficient function call:
The result of functional_enrichment
is a tibble with
enrichment results.
enr |>
head(10)
#> # A tibble: 10 × 10
#> term_id term_name N_with n_with_sel n_expect enrichment odds_ratio
#> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
#> 1 GO:0031931 TORC1 co… 5 5 0.02 333 Inf
#> 2 GO:1905356 regulati… 2 2 0.01 333 Inf
#> 3 GO:0031929 TOR sign… 15 14 0.05 310 13900
#> 4 GO:0001558 regulati… 9 5 0.03 185 544
#> 5 GO:0031932 TORC2 co… 8 4 0.02 166 409
#> 6 GO:0010506 regulati… 4 2 0.01 166 366
#> 7 GO:0038202 TORC1 si… 4 2 0.01 166 366
#> 8 GO:0016242 negative… 9 3 0.03 111 193
#> 9 GO:0030950 establis… 15 4 0.05 88.7 149
#> 10 GO:0043666 regulati… 7 2 0.02 95 147
#> # ℹ 3 more variables: ids <chr>, p_value <dbl>, p_adjust <dbl>
The columns are as follows
N_with
: The number of features (genes) associated with
this term in the background of all genes.n_with_sel
: The number of features associated with this
term in the selection.n_expect
: The expected number of features associated
with this term under the null hypothesis (terms are randomly
distributed).enrichment
: The ratio of observed to expected.odds_ratio
: The effect size, represented by the odds
ratio from the contingency table.ids
: The identifiers of features with the term in the
selection.p_value
: The raw p-value from the hypergeometric
distribution.p_adjust
: The p-value adjusted for multiple tests using
the Benjamini-Hochberg approach.A small Shiny app is included in the package to demonstrate the usage of fenr in an interactive environment. All time-consuming data loading and preparation tasks are performed before the app is launched.
yeast_de
is the result of differential expression (using
edgeR
) on a subset of 6+6 replicates from Gierlinski
et al. (2015).
The function fetch_terms_for_example
uses
fetch_*
functions from fenr to download and
process data from GO, Reactome and KEGG. The
step-by-step process of data acquisition can be examined by viewing the
function by typing fetch_terms_for_example
in the R
console. The object term_data
is a named list of
fenr_terms
objects, one for each ontology.
After completing the slow tasks, you can start the Shiny app by running:
To quickly see how fenr works an example can be loaded directly from GitHub:
shiny::runGitHub("bartongroup/fenr-shiny-example")
There’s no shortage of R packages dedicated to functional enrichment analyses, including topGO, GOstats, clusterProfiler, and GOfuncR. Many of these packages provide comprehensive analyses tailored to a single gene selection. While they offer detailed insights, their extensive computational requirements often render them slower.
To illustrate, let’s take the example of the topGO package.
suppressPackageStartupMessages({
library(topGO)
})
#>
#> groupGOTerms: GOBPTerm, GOMFTerm, GOCCTerm environments built.
# Preparing the gene set
all_genes <- setNames(rep(0, length(exmpl_all)), exmpl_all)
all_genes[exmpl_all %in% exmpl_sel] <- 1
# Define the gene selection function
gene_selection <- function(genes) {
genes > 0
}
# Mapping genes to GO, we use the go2gene downloaded above and convert it to a
# list
go2gene <- go$mapping |>
dplyr::select(gene_symbol, term_id) |>
dplyr::distinct() |>
dplyr::group_by(term_id) |>
dplyr::summarise(ids = list(gene_symbol)) |>
tibble::deframe()
# Setting up the GO data for analysis
go_data <- new(
"topGOdata",
ontology = "BP",
allGenes = all_genes,
geneSel = gene_selection,
annot = annFUN.GO2genes,
GO2genes = go2gene
)
#>
#> Building most specific GOs .....
#> ( 2995 GO terms found. )
#>
#> Build GO DAG topology ..........
#> ( 4750 GO terms and 10173 relations. )
#>
#> Annotating nodes ...............
#> ( 6772 genes annotated to the GO terms. )
# Performing the enrichment test
fisher_test <- runTest(go_data, algorithm = "classic", statistic = "fisher")
#>
#> -- Classic Algorithm --
#>
#> the algorithm is scoring 366 nontrivial nodes
#> parameters:
#> test statistic: fisher
Though the enrichment test (runTest
) is fairly fast,
completing in about 200 ms on an Intel Core i7 processor, the
preparatory step (new
call) demands a more substantial time
commitment, around 6 seconds. Considering each additional gene selection
requires a similar amount of time, conducting exploratory interactive
analyses on multiple gene sets becomes notably cumbersome.
fenr separates preparatory steps and gene selection and is designed with interactivity and speed in mind, particularly when users wish to execute rapid, multiple selection tests.
# Setting up data for enrichment
go <- fetch_go(species = "sgd")
go_terms <- prepare_for_enrichment(go$terms, go$mapping, exmpl_all,
feature_name = "gene_symbol")
# Executing the enrichment test
enr <- functional_enrichment(exmpl_all, exmpl_sel, go_terms)
While data preparation with fenr might take a bit longer (around 10 s), the enrichment test is fast, finishing in just about 50 ms. Each new gene selection takes a similar amount of time (with an obvious caveat that a larger gene selection would take more time to calculate).
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets
#> [7] methods base
#>
#> other attached packages:
#> [1] topGO_2.58.0 SparseM_1.84-2 GO.db_3.20.0
#> [4] AnnotationDbi_1.69.0 IRanges_2.41.0 S4Vectors_0.44.0
#> [7] Biobase_2.67.0 graph_1.85.0 BiocGenerics_0.53.1
#> [10] generics_0.1.3 tibble_3.2.1 fenr_1.5.0
#> [13] BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.9 utf8_1.2.4
#> [3] lattice_0.22-6 RSQLite_2.3.7
#> [5] digest_0.6.37 magrittr_2.0.3
#> [7] grid_4.4.1 evaluate_1.0.1
#> [9] fastmap_1.2.0 blob_1.2.4
#> [11] jsonlite_1.8.9 GenomeInfoDb_1.43.0
#> [13] DBI_1.2.3 BiocManager_1.30.25
#> [15] httr_1.4.7 purrr_1.0.2
#> [17] fansi_1.0.6 UCSC.utils_1.2.0
#> [19] Biostrings_2.75.0 jquerylib_0.1.4
#> [21] cli_3.6.3 crayon_1.5.3
#> [23] rlang_1.1.4 XVector_0.46.0
#> [25] bit64_4.5.2 withr_3.0.2
#> [27] cachem_1.1.0 yaml_2.3.10
#> [29] tools_4.4.1 memoise_2.0.1
#> [31] dplyr_1.1.4 GenomeInfoDbData_1.2.13
#> [33] assertthat_0.2.1 buildtools_1.0.0
#> [35] vctrs_0.6.5 R6_2.5.1
#> [37] png_0.1-8 matrixStats_1.4.1
#> [39] lifecycle_1.0.4 zlibbioc_1.52.0
#> [41] KEGGREST_1.47.0 bit_4.5.0
#> [43] pkgconfig_2.0.3 pillar_1.9.0
#> [45] bslib_0.8.0 glue_1.8.0
#> [47] xfun_0.48 tidyselect_1.2.1
#> [49] highr_0.11 sys_3.4.3
#> [51] knitr_1.48 htmltools_0.5.8.1
#> [53] rmarkdown_2.28 maketools_1.3.1
#> [55] compiler_4.4.1