Fast functional enrichment

Purpose

Functional enrichment analysis determines if specific biological functions or pathways are overrepresented in a set of features (e.g., genes, proteins). These sets often originate from differential expression analysis, while the functions and pathways are derived from databases such as GO, Reactome, or KEGG. In its simplest form, enrichment analysis employs Fisher’s test to evaluate if a given function is enriched in the selection. The null hypothesis asserts that the proportion of features annotated with that function is the same between selected and non-selected features.

Functional enrichment analysis requires downloading large datasets from the aforementioned databases before conducting the actual analysis. While downloading data is time-consuming, Fisher’s test can be performed rapidly. This package aims to separate these two steps, enabling fast enrichment analysis for various feature selections using a given database. It is specifically designed for interactive applications like Shiny. A small Shiny app, included in the package, demonstrates the usage of fenr.

Caveats

Functional enrichment analysis should not be considered the ultimate answer in understanding biological systems. In many instances, it may not provide clear insights into biology. Specifically, when arbitrary groups of genes are selected, enrichment analysis only reveals the statistical overrepresentation of a functional term within the selection, which may not directly correspond to biological relevance. This package serves as a tool for data exploration; any conclusions drawn about biology require independent validation and further investigation.

Installation

fenr can be installed from Bioconductor by using:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("fenr")

Overview

The input data for fenr consists of two tibbles or data frames (see figure below). One tibble contains functional term descriptions, while the other provides the term-to-feature mapping. Users can acquire these datasets using the convenient helper functions like fetch_*, which facilitate data retrieval from GO, KEGG, BioPlanet, or WikiPathways. Alternatively, users have the option to provide their own datasets. These tibbles are subsequently transformed into a fenr_terms object, specifically designed for efficient data retrieval and access. The resulting object is then employed by the functional_enrichment function to perform enrichment analysis for any chosen feature selection.

Overview of fenr data pipeline.

Overview of fenr data pipeline.

Example

Package fenr and example data are loaded with the following commands:

library(fenr)
data(exmpl_all, exmpl_sel)

Data preparation

The initial step involves downloading functional term data. fenr supports data downloads from Gene Ontology, Reactome, KEGG, BioPlanet, and WikiPathways. Custom ontologies can also be used, provided they are converted into an appropriate format (refer to the prepare_for_enrichment function for more information). The command below downloads functional terms and gene mapping from Gene Ontology (GO):

go <- fetch_go(species = "sgd")

sgd is a designation for yeast in Gene Ontology. The full list of available species designations can be obtain using the command:

go_species <- fetch_go_species()

which returns a tibble:

go_species
#> # A tibble: 40 × 2
#>    species                                                   designation
#>    <chr>                                                     <chr>      
#>  1 Species/Database                                          File       
#>  2 Gallus gallus - EBI Gene Ontology Annotation Database (g… goa_chicke…
#>  3 Sus scrofa - EBI Gene Ontology Annotation Database (goa)  goa_pig_is…
#>  4 Dictyostelium discoideum - dictyBase (dictyBase)          dictybase  
#>  5 Mus musculus - Mouse Genome Informatics (mgi)             mgi        
#>  6 Solanaceae - Sol Genomics Network (sgn)                   sgn        
#>  7 Sus scrofa - EBI Gene Ontology Annotation Database (goa)  goa_pig    
#>  8 Gallus gallus - EBI Gene Ontology Annotation Database (g… goa_chicke…
#>  9 Danio rerio - Zebrafish Information Network (zfin)        zfin       
#> 10 Bos taurus - EBI Gene Ontology Annotation Database (goa)  goa_cow_rna
#> # ℹ 30 more rows

go is a list with two tibbles. The first tibble contains term information:

go$terms
#> # A tibble: 51,467 × 2
#>    term_id    term_name                                               
#>    <chr>      <chr>                                                   
#>  1 GO:0000001 mitochondrion inheritance                               
#>  2 GO:0000002 mitochondrial genome maintenance                        
#>  3 GO:0000003 obsolete reproduction                                   
#>  4 GO:0000005 obsolete ribosomal chaperone activity                   
#>  5 GO:0000006 high-affinity zinc transmembrane transporter activity   
#>  6 GO:0000007 low-affinity zinc ion transmembrane transporter activity
#>  7 GO:0000008 obsolete thioredoxin                                    
#>  8 GO:0000009 alpha-1,6-mannosyltransferase activity                  
#>  9 GO:0000010 trans-hexaprenyltranstransferase activity               
#> 10 GO:0000011 vacuole inheritance                                     
#> # ℹ 51,457 more rows

The second tibble contains gene-term mapping:

go$mapping
#> # A tibble: 75,220 × 4
#>    gene_symbol gene_id db_id      term_id   
#>    <chr>       <chr>   <chr>      <chr>     
#>  1 RHO1        YPR165W S000006369 GO:0090334
#>  2 RHO1        YPR165W S000006369 GO:0008047
#>  3 ANK1        YGL242C S000003211 GO:0005737
#>  4 MCY1        YGR012W S000003244 GO:0005741
#>  5 RBS1        YDL189W S000002348 GO:0005634
#>  6 RBS1        YDL189W S000002348 GO:0048027
#>  7 PDP3        YLR455W S000004447 GO:0005634
#>  8 MCY1        YGR012W S000003244 GO:0005739
#>  9 PDP3        YLR455W S000004447 GO:0000993
#> 10 GCN4        YEL009C S000000735 GO:1990139
#> # ℹ 75,210 more rows

To make these user-friendly data more suitable for rapid functional enrichment analysis, they need to be converted into a machine-friendly object using the following function:

go_terms <- prepare_for_enrichment(go$terms, go$mapping, exmpl_all,
                                   feature_name = "gene_symbol")

exmpl_all is an example of gene background - a vector with gene symbols related to all detections in an imaginary RNA-seq experiment. Since different datasets use different features (gene id, gene symbol, protein id), the column name containing features in go$mapping needs to be specified using feature_name = "gene_symbol". The resulting object, go_terms, is a data structure containing all the mappings in a quickly accessible form. From this point on, go_terms can be employed to perform multiple functional enrichment analyses on various gene selections.

Functional enrichment

The package includes two pre-defined gene sets. exmpl_all contains all background gene symbols, while exmpl_sel comprises the genes of interest. To perform functional enrichment analysis on the selected genes, you can use the following single, efficient function call:

enr <- functional_enrichment(exmpl_all, exmpl_sel, go_terms)

The output

The result of functional_enrichment is a tibble with enrichment results.

enr |>
  head(10)
#> # A tibble: 10 × 10
#>    term_id    term_name N_with n_with_sel n_expect enrichment odds_ratio
#>    <chr>      <chr>      <int>      <int>    <dbl>      <dbl>      <dbl>
#>  1 GO:0031931 TORC1 co…      5          5     0.02      333          Inf
#>  2 GO:1905356 regulati…      2          2     0.01      333          Inf
#>  3 GO:0031929 TOR sign…     15         14     0.05      310        13900
#>  4 GO:0001558 regulati…      9          5     0.03      185          544
#>  5 GO:0031932 TORC2 co…      8          4     0.02      166          409
#>  6 GO:0010506 regulati…      4          2     0.01      166          366
#>  7 GO:0038202 TORC1 si…      4          2     0.01      166          366
#>  8 GO:0016242 negative…      9          3     0.03      111          193
#>  9 GO:0030950 establis…     15          4     0.05       88.7        149
#> 10 GO:0043666 regulati…      7          2     0.02       95          147
#> # ℹ 3 more variables: ids <chr>, p_value <dbl>, p_adjust <dbl>

The columns are as follows

  • N_with: The number of features (genes) associated with this term in the background of all genes.
  • n_with_sel: The number of features associated with this term in the selection.
  • n_expect: The expected number of features associated with this term under the null hypothesis (terms are randomly distributed).
  • enrichment: The ratio of observed to expected.
  • odds_ratio: The effect size, represented by the odds ratio from the contingency table.
  • ids: The identifiers of features with the term in the selection.
  • p_value: The raw p-value from the hypergeometric distribution.
  • p_adjust: The p-value adjusted for multiple tests using the Benjamini-Hochberg approach.

Interactive Example

A small Shiny app is included in the package to demonstrate the usage of fenr in an interactive environment. All time-consuming data loading and preparation tasks are performed before the app is launched.

data(yeast_de)
term_data <- fetch_terms_for_example(yeast_de)

yeast_de is the result of differential expression (using edgeR) on a subset of 6+6 replicates from Gierlinski et al. (2015).

The function fetch_terms_for_example uses fetch_* functions from fenr to download and process data from GO, Reactome and KEGG. The step-by-step process of data acquisition can be examined by viewing the function by typing fetch_terms_for_example in the R console. The object term_data is a named list of fenr_terms objects, one for each ontology.

After completing the slow tasks, you can start the Shiny app by running:

enrichment_interactive(yeast_de, term_data)

To quickly see how fenr works an example can be loaded directly from GitHub:

shiny::runGitHub("bartongroup/fenr-shiny-example")

Comparison to other functional enrichment packages

There’s no shortage of R packages dedicated to functional enrichment analyses, including topGO, GOstats, clusterProfiler, and GOfuncR. Many of these packages provide comprehensive analyses tailored to a single gene selection. While they offer detailed insights, their extensive computational requirements often render them slower.

To illustrate, let’s take the example of the topGO package.

suppressPackageStartupMessages({
  library(topGO)
})
#> 
#> groupGOTerms:    GOBPTerm, GOMFTerm, GOCCTerm environments built.

# Preparing the gene set
all_genes <- setNames(rep(0, length(exmpl_all)), exmpl_all)
all_genes[exmpl_all %in% exmpl_sel] <- 1

# Define the gene selection function
gene_selection <- function(genes) {
  genes > 0
}

# Mapping genes to GO, we use the go2gene downloaded above and convert it to a
# list
go2gene <- go$mapping |> 
  dplyr::select(gene_symbol, term_id) |> 
  dplyr::distinct() |> 
  dplyr::group_by(term_id) |> 
  dplyr::summarise(ids = list(gene_symbol)) |> 
  tibble::deframe()

# Setting up the GO data for analysis
go_data <- new(
  "topGOdata",
  ontology = "BP",
  allGenes = all_genes,
  geneSel = gene_selection,
  annot = annFUN.GO2genes,
  GO2genes = go2gene
)
#> 
#> Building most specific GOs .....
#>  ( 2995 GO terms found. )
#> 
#> Build GO DAG topology ..........
#>  ( 4750 GO terms and 10173 relations. )
#> 
#> Annotating nodes ...............
#>  ( 6772 genes annotated to the GO terms. )

# Performing the enrichment test
fisher_test <- runTest(go_data, algorithm = "classic", statistic = "fisher")
#> 
#>           -- Classic Algorithm -- 
#> 
#>       the algorithm is scoring 366 nontrivial nodes
#>       parameters: 
#>           test statistic: fisher

Though the enrichment test (runTest) is fairly fast, completing in about 200 ms on an Intel Core i7 processor, the preparatory step (new call) demands a more substantial time commitment, around 6 seconds. Considering each additional gene selection requires a similar amount of time, conducting exploratory interactive analyses on multiple gene sets becomes notably cumbersome.

fenr separates preparatory steps and gene selection and is designed with interactivity and speed in mind, particularly when users wish to execute rapid, multiple selection tests.

# Setting up data for enrichment
go <- fetch_go(species = "sgd")
go_terms <- prepare_for_enrichment(go$terms, go$mapping, exmpl_all,
                                   feature_name = "gene_symbol")

# Executing the enrichment test
enr <- functional_enrichment(exmpl_all, exmpl_sel, go_terms)

While data preparation with fenr might take a bit longer (around 10 s), the enrichment test is fast, finishing in just about 50 ms. Each new gene selection takes a similar amount of time (with an obvious caveat that a larger gene selection would take more time to calculate).

Session info

#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets 
#> [7] methods   base     
#> 
#> other attached packages:
#>  [1] topGO_2.58.0         SparseM_1.84-2       GO.db_3.20.0        
#>  [4] AnnotationDbi_1.69.0 IRanges_2.41.0       S4Vectors_0.44.0    
#>  [7] Biobase_2.67.0       graph_1.85.0         BiocGenerics_0.53.1 
#> [10] generics_0.1.3       tibble_3.2.1         fenr_1.5.0          
#> [13] BiocStyle_2.35.0    
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.9              utf8_1.2.4             
#>  [3] lattice_0.22-6          RSQLite_2.3.7          
#>  [5] digest_0.6.37           magrittr_2.0.3         
#>  [7] grid_4.4.1              evaluate_1.0.1         
#>  [9] fastmap_1.2.0           blob_1.2.4             
#> [11] jsonlite_1.8.9          GenomeInfoDb_1.43.0    
#> [13] DBI_1.2.3               BiocManager_1.30.25    
#> [15] httr_1.4.7              purrr_1.0.2            
#> [17] fansi_1.0.6             UCSC.utils_1.2.0       
#> [19] Biostrings_2.75.0       jquerylib_0.1.4        
#> [21] cli_3.6.3               crayon_1.5.3           
#> [23] rlang_1.1.4             XVector_0.46.0         
#> [25] bit64_4.5.2             withr_3.0.2            
#> [27] cachem_1.1.0            yaml_2.3.10            
#> [29] tools_4.4.1             memoise_2.0.1          
#> [31] dplyr_1.1.4             GenomeInfoDbData_1.2.13
#> [33] assertthat_0.2.1        buildtools_1.0.0       
#> [35] vctrs_0.6.5             R6_2.5.1               
#> [37] png_0.1-8               matrixStats_1.4.1      
#> [39] lifecycle_1.0.4         zlibbioc_1.52.0        
#> [41] KEGGREST_1.47.0         bit_4.5.0              
#> [43] pkgconfig_2.0.3         pillar_1.9.0           
#> [45] bslib_0.8.0             glue_1.8.0             
#> [47] xfun_0.48               tidyselect_1.2.1       
#> [49] highr_0.11              sys_3.4.3              
#> [51] knitr_1.48              htmltools_0.5.8.1      
#> [53] rmarkdown_2.28          maketools_1.3.1        
#> [55] compiler_4.4.1