The GeDi User’s Guide

Compiled date: 2024-10-30

Last edited: 2024-02-29

License: MIT + file LICENSE


Introduction

This vignette introduces the usage of the GeDi package for exploring the results of functional annotation and enrichment analyses.

GeDi is a versatile package designed to simplify the exploration and comprehension of functional annotation and enrichment analysis results. It offers a shiny application that combines interactivity, visualization, and reproducibility to consolidate comprehensive outcomes.

To incorporate GeDi into your workflow, you’ll need the results of a functional annotation or enrichment analysis. This vignette demonstrates the core functionalities of GeDi using a publicly available dataset from Alasoo et al., as described in their paper “Shared genetic effects on chromatin and gene expression indicate a role for enhancer priming in immune response” (Alasoo et al. 2018).

Accessible through the macrophage Bioconductor package, this dataset comprises files generated from Salmon quantification (version 0.12.0, with Gencode v29 reference) and gene-level summarized values.

Within the macrophage experimental setup, samples derive from six different donors under four distinct conditions: naive, treated with Interferon gamma, with SL1344, or with a combination of Interferon gamma and SL1344. For illustration, we will focus on comparing Interferon gamma-treated samples with naive samples.

Getting started

Before you can start using GeDi, the package needs to be installed on your machine. To install the package, begin by opening R and executing the following command:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}

BiocManager::install("GeDi")

Once installed, the package can be loaded and attached to your current workspace as follows:

library("GeDi")

With the attached package, you can simply start the application by running GeDi().

GeDi()

This action will open the application, directing you to the Welcome page. From there, you can easily provide your data using the Data Input panel on the left side menu, ensuring it’s in the correct format for analysis.

Alternatively, you can initiate the application by executing:

GeDi(
  genesets = geneset_df,
  ppi = ppi_df,
  distance_scores = distance_scores_df
)

where

  • geneset_df represents your input data in the form of a data.frame, which should include at least one column named “Genesets” containing geneset identifiers and one column named “Genes” containing a comma-separated list of genes belonging to each respective geneset.
  • ppi_df is another data.frame containing protein-protein interaction scores, with columns named “from”, “to”, and “combined_score”.
  • distance_scores_df is a sparse Matrix containing the distance scores of the genesets in your data.

All of these parameters are optional, as you can alternatively upload, download, and compute them directly within the application. However, some of these processes may require a significant amount of time, especially with larger datasets. Therefore, it may be advantageous to save the intermediate results, such as the downloaded PPI and computed distance scores, for later use within the application.

In this vignette, we demonstrate the functionality of GeDi using enrichment analysis results from the macrophage dataset. To immediately start exploring the application, you can simply execute:

GeDi()

and load the example data with the Load example data button in the Data Input panel.

Alternatively, you can proceed by following the subsequent code chunks to create the necessary input objects, step by step. This can serve as a reference guide for the steps ideally executed prior to analyzing the data with GeDi.

To utilize GeDi, you’ll require results from a functional annotation analysis. In this vignette, we’ll demonstrate how to conduct an enrichment analysis on differentially expressed (DE) genes from the macrophage dataset.

Firstly, we’ll load the macrophage data and create a DESeqDataset, as the subsequent differential expression analysis will be performed using DESeq2 (Love, Huber, and Anders 2014).

# Load required libraries
library("macrophage")
library("DESeq2")

# Load the example dataset "gse" from the "macrophage" package
data("gse", package = "macrophage")

# Create a DESeqDataSet object using the "gse" dataset and define the 
# experimental design.
# We use the condition as part of the experimental design, because we are 
# interested in the differentially expressed genes between treatments. We also 
# add the line to the design to account for the inherent differences between 
# the donors.
dds_macrophage <- DESeqDataSet(gse, design = ~ line + condition)

# Change the row names of the DESeqDataSet object to Ensembl IDs
rownames(dds_macrophage) <- gsub("\\..*", "", rownames(dds_macrophage))

# Have a look at the resulting DESeqDataSet object
dds_macrophage
#> class: DESeqDataSet 
#> dim: 58294 24 
#> metadata(7): tximetaInfo quantInfo ... txdbInfo version
#> assays(3): counts abundance avgTxLength
#> rownames(58294): ENSG00000000003 ENSG00000000005 ... ENSG00000285993 ENSG00000285994
#> rowData names(2): gene_id SYMBOL
#> colnames(24): SAMEA103885102 SAMEA103885347 ... SAMEA103885308 SAMEA103884949
#> colData names(15): names sample_id ... condition line

Now that we’ve obtained our DESeqDataset, we can conduct the differential expression (DE) analysis. In this vignette, we’ll utilize the results from comparing two distinct conditions of the dataset, specifically IFNg and naive, while accounting for the cell line of origin.

Before executing the DE analysis, we’ll filter out lowly expressed features from the dataset. In this instance, we’ll exclude all genes with fewer than 10 counts in at least 6 samples, where 6 corresponds to the smallest group size in the dataset.

Subsequently, we’ll conduct the DE analysis and assess against a null hypothesis of a log2FoldChange of 1 to ensure that we identify genes with consistent and robust changes in expression.

Finally, we’ll append the gene symbols to the resultant DataFrame, which will later serve as our “Genes” column in the input data for GeDi.

# Filter genes based on read counts
# Calculate the number of genes with at least 10 counts in at least 6 samples
keep <- rowSums(counts(dds_macrophage) >= 10) >= 6

# Subset the DESeqDataSet object to keep only the selected genes
dds_macrophage <- dds_macrophage[keep, ]

# Have a look at the resulting DESeqDataSet object
dds_macrophage
#> class: DESeqDataSet 
#> dim: 17806 24 
#> metadata(7): tximetaInfo quantInfo ... txdbInfo version
#> assays(3): counts abundance avgTxLength
#> rownames(17806): ENSG00000000003 ENSG00000000419 ... ENSG00000285982 ENSG00000285994
#> rowData names(2): gene_id SYMBOL
#> colnames(24): SAMEA103885102 SAMEA103885347 ... SAMEA103885308 SAMEA103884949
#> colData names(15): names sample_id ... condition line
# Perform differential expression analysis using DESeq2
dds_macrophage <- DESeq(dds_macrophage)

# Extract differentially expressed genes
# Perform contrast analysis comparing "IFNg" condition to "naive" condition
# Set a log2 fold change threshold of 1 and a significance level (alpha) of 0.05
res_macrophage_IFNg_vs_naive <- results(dds_macrophage,
  contrast = c("condition", "IFNg", "naive"),
  lfcThreshold = 1, alpha = 0.05
)

# Add gene symbols to the results in a column "SYMBOL"
res_macrophage_IFNg_vs_naive$SYMBOL <- rowData(dds_macrophage)$SYMBOL

After completing the differential expression analysis, we move on to conduct the functional annotation analysis. To begin, we extract the differentially expressed (DE) genes from the previously generated results and identify the background genes to be utilized for functional enrichment.

For the enrichment analysis, we use the overrepresentation analysis method provided by the topGO package. To streamline the integration of these results into GeDi, we utilize the topGOtable function from the pcaExplorer package. By default, this function employs the BP ontology and the elim method, which helps decorrelate the Gene Ontology (GO) graph structure, resulting in less redundant functional categories. The output is a DataFrame object that seamlessly integrates with GeDi.

However, as GeDi has only minimal requirements for the input, enrichment results generated using clusterProfiler can also be utilized. While we primarily tested results from the enrichGO method during GeDi development, those from the enrichKEGG and enrichPathway methods are also compatible.

# Load required packages for analysis
library("pcaExplorer")
library("GeneTonic")
library("AnnotationDbi")

# Extract gene symbols from the DESeq2 results object where FDR is below 0.05
# The function deseqresult2df is used to convert the DESeq2 results to a 
# dataframe format
# FDR is set to 0.05 to filter significant results
de_symbols_IFNg_vs_naive <- deseqresult2df(res_macrophage_IFNg_vs_naive,
                                           FDR = 0.05)$SYMBOL

# Extract gene symbols for background using the DESeq2 results object
# Filter genes that have nonzero counts
bg_ids <- rowData(dds_macrophage)$SYMBOL[rowSums(counts(dds_macrophage)) > 0]
# Load required package for analysis
library("topGO")
library("org.Hs.eg.db")

# Perform Gene Ontology enrichment analysis using the topGOtable function from 
# the "pcaExplorer" package
macrophage_topGO_example <-
  pcaExplorer::topGOtable(de_symbols_IFNg_vs_naive,
    bg_ids,
    ontology = "BP",
    mapping = "org.Hs.eg.db",
    geneID = "symbol",
    topTablerows = 500
  )

As mentioned earlier, GeDi expects the input to contain at least two columns: one named “Genesets” and one named “Genes”. While this is not strictly mandatory when providing your data interactively during an application session, it becomes necessary if you intend to initiate the application with your input as parameters (e.g., GeDi(genesets = my_genesets_df)). In such cases, the “Genesets” column should contain identifiers for each geneset in the input, while the “Genes” column should consist of comma-separated lists of genes associated with each geneset.

Therefore, we will adjust the column names of the resulting data.frame from the enrichment analysis to adhere to the required format.

# Rename columns in the macrophage_topGO_example dataframe
# Change the column name "GO.ID" to "Genesets"
names(macrophage_topGO_example)[names(macrophage_topGO_example) == "GO.ID"] <- "Genesets"

# Change the column name "genes" to "Genes"
names(macrophage_topGO_example)[names(macrophage_topGO_example) == "genes"] <- "Genes"

All set!

Now that we’ve obtained functional annotation results from the macrophage dataset, we can begin exploring the data using GeDi. You have two options: you can either launch the application and supply the generated data using the GeDi() command, or if you’ve followed this vignette, you can initiate the application directly with the loaded data by executing GeDi(genesets = macrophage_topGO_example).

GeDi()

GeDi(genesets = macrophage_topGO_example)

The above shown code will open the application, directing you to the Welcome page. The Welcome page of GeDi serves as the entry point to the application, providing users with an overview of its features and functionalities. Upon launching the application, users are greeted with a user-friendly interface designed to facilitate the exploration and interpretation of functional annotation and enrichment analysis results. The Welcome page offers guidance on how to navigate the application and highlights key components such as data input options, visualization tools, and interactive features. Whether users are new to GeDi or returning to explore additional datasets, the Welcome page serves as a central hub for accessing resources and getting started with their analysis journey.

Description of the GeDi user interface

The GeDi application, developed with the shiny framework, incorporates the modern design elements of the bs4Dash package, which is built upon Bootstrap 4. This combination of technologies ensures a sleek and visually appealing user interface for navigating and interacting with the functionality offered by GeDi. By leveraging the features of shiny and bs4Dash, GeDi provides users with an intuitive and aesthetically pleasing environment for conducting functional annotation and enrichment analyses on their datasets.

Header (navbar)

The dashboard navbar in GeDi, referred to as such in the bs4Dash framework, features a dropdown menu accessible by clicking on the respective “info” icon. The menu offers additional functionality through various buttons:

  • The open book icon - This option allows users to explore the GeDi vignette, either the version bundled with the package or the online version, providing detailed documentation and usage guidelines.
  • The information i cirle - Selecting this option displays information about the current session, presenting details such as the R environment and loaded packages, helpful for troubleshooting and debugging purposes.
  • The heart button - This button offers general information about GeDi, including links to its development version for contribution and guidelines on citing the tool in research publications.

Besides the two dropdown menus, users can also find the Bookmark button in the Navbar. The Bookmark button in the GeDi navbar serves as a convenient tool for users to save and bookmark genes and genesets of interest for later reference. To use this feature, users must first select or click on a gene or geneset that they wish to bookmark. Once the desired gene or geneset is selected, users can then click on the Bookmark button to add it to a list of bookmarked items within the GeDi application. This functionality enables users to organize and revisit specific genes or genesets that they find relevant or intriguing during their exploration of functional annotation and enrichment analysis results. The bookmarked genes and genesets can later be found in the Report panel.

Body

The structure of GeDi is designed around different panels, each of which becomes active upon clicking the corresponding icons or text in the sidebar.

While the Welcome panel is relatively self-explanatory, additional information and explanations are provided for the functionality of the remaining panels. For new users seeking guidance, there’s a question circle button available to initiate an interactive tour of GeDi. This tour allows users to learn the basic usage mechanisms by actively engaging with the interface. During the tour, specific elements are highlighted in response to user actions, while the rest of the UI remains shaded to maintain focus. Users can interrupt the tour at any time by clicking outside the highlighted window, and navigation between steps is facilitated by arrow buttons (left, right). The tour functionality is implemented using the rintrojs package.

The GeDi functionality

The GeDi shiny application is organized into distinct panels, each serving a specific purpose, which will be thoroughly explored in the following sections.

The Welcome panel

This panel serves as a guide for utilizing GeDi effectively. It offers detailed instructions on generating input data for the application, elucidating the expected input format and outlining the various interactive elements present in the app’s other panels.

The Welcome panel of GeDi

The Welcome panel of GeDi

The Data Input panel

This panel serves as a hub for managing data input if it’s not provided within the function call. It’s divided into distinct boxes, each representing a step of the data input process, which sequentially appear as you successfully complete each preceding step.

Step 1: Provide your Genesets as input data

In the initial Step 1 box, you can provide your data by utilizing the Browse button. This action opens a modal window enabling you to select the relevant file from your computer storage. After successfully loading the data, a preview is displayed in the Genesets preview box on the right. During this step, the application checks if your input contains the “Genesets” and “Genes” columns. If these columns are missing, a small error message appears in the lower right corner. Additionally, two drop-down menus allow you to select the correct columns from your data and update the input accordingly.

You also have the option to start using GeDi with preprocessed example data based on the macrophage dataset. Simply click the Load demo data button to load the example data’s enrichment results. You can explore these results in the Genesets preview box.

However, instead of loading demo data and observing the expected data structure through the Genesets preview box, you can also use the Have a look at the data structure button. By clicking this button, a modal window with a visual representation of the expected input data structure will open. This screenshot serves as a helpful guide, providing you with a clear understanding of how your data should be formatted for optimal compatibility with GeDi.

Once, you have successfully loaded some data, the data input process will proceed and two additional boxes will be displayed in the panel.

The Data input panel - Step 1

The Data input panel - Step 1

Optional Filtering Step: Filter generic genesets

Introducing the first new box, the Optional Filtering Step offers a non-compulsory yet advantageous opportunity to refine your geneset selection. While not obligatory for data exploration, engaging in this step can notably optimize downstream processing runtime. Here, you’re empowered to filter genesets within your dataset, thereby enhancing result interpretation. This step enables the exclusion of large and generic genesets, contributing to clearer insights. Additionally, you have the flexibility to filter genesets based on size criteria.

The box features a histogram illustrating geneset sizes, providing visual context for the filtering process. Within the interface, two input fields are available for customization. The left input field facilitates the selection of individual genesets by their identifiers in the “Genesets” column of your dataset. Meanwhile, the right input field empowers you to establish a threshold “x” for filtering genesets with a size greater than or equal to “x.” This interactive approach ensures tailored filtering suited to your specific analysis requirements.

Once you’ve chosen the genesets you wish to exclude from your dataset, you can initiate the filtering process by clicking the “Remove the selected Genesets” button. This action will remove all selected genesets from the dataset. Additionally, you have the option to save the filtered data using the “Download the filtered data” button. Clicking this button will save the filtered data to your local machine. This feature can be particularly beneficial for users who intend to revisit their data in a new instance of GeDi and want to ensure that previously identified uninsightful genesets have already been filtered out.

Once you’ve chosen the genesets you wish to exclude from your dataset, you can initiate the filtering process by clicking the Remove the selected Genesets button. This action will remove all selected genesets from the dataset.
Additionally, you have the option to save the filtered data using the Download the filtered data button. Clicking this button will save the filtered data to your local machine.This feature can be particularly beneficial for users who intend to revisit their data in a new instance of GeDi and want to ensure that previously identified uninsightful genesets have already been filtered out.

Optional Filtering Step

Optional Filtering Step

Step 2: Species Selection

Upon advancing to the second box labeled Step 2, you’ll encounter the crucial task of selecting the species associated with your dataset. This step holds significant importance for the computation of the pMM score within GeDi, which heavily relies on a Protein-Protein Interaction (PPI) matrix. This matrix plays a pivotal role in capturing protein interaction strength, thereby enriching distance scores with valuable biological context. To access and utilize this essential information, specifying the species linked to your dataset is mandatory. By clicking the input field, you’ll prompt a dropdown menu showcasing preselected species options. If your species is included, simply make your selection. Alternatively, if your species is not listed, you have the option to manually input it. In cases of uncertainty, a convenient link provided on the right directs you to the STRING database, enabling verification of species details and PPI availability for informed decision-making.

Species Selection

Species Selection

Step 3: PPI Matrix Download

Following species selection, a third box named Step 3 will emerge. In this phase, you have the opportunity to download the Protein-Protein Interaction (PPI) matrix. This process may necessitate some time, with a progress bar positioned in the lower right corner providing real-time updates on the download status. Once the download is complete, you can conveniently preview the PPI matrix within the PPI Preview box situated on the right-hand side of the interface. This will show that the PPI consists of three columns: Gene1 and Gene2, housing the gene symbols corresponding to the interacting proteins, and a column labeled combined_score, denoting the confidence level of each interaction. The assigned score is derived from the number of known interactions between two proteins, normalized to the (0, 1) interval utilizing the formula:

$$ \begin{aligned} combinedScore = \frac{(\#interaction - min)}{(max - min)} \end{aligned} $$

where min and max represent the minimum and maximum number of interactions, respectively.

In addition to downloading a PPI matrix during the current session, users can also upload a previously saved matrix for analysis using the Browse button. This functionality allows users to work with their own customized datasets or previously analyzed PPI matrices. Furthermore, saving the downloaded PPI matrix locally enables users to store the data on their machine for future use. By saving the matrix locally via the Save PPI matrix button, users can access the data quickly in subsequent sessions without having to wait for the download process again. This capability significantly enhances workflow efficiency and allows for seamless continuation of analysis across different sessions.

Downloading the PPI

Downloading the PPI

While the final two steps are optional, note that the PPI matrix is only required for a singular score. Therefore, you can commence data exploration without necessarily completing these additional steps.

Upon concluding the essential tasks outlined in this panel, you are ready to progress to the Distance Scores panel.

The Distance Scores panel

This panel focuses mainly on computing distance scores for the provided input data. Like the preceding panel, it is segmented into two distinct sections, each serving a specific function.

The Distance Score panel

The Distance Score panel

Calculating Distance Scores

In the upper box, titled Calculate distance scores for your Genesets, you have the flexibility to select from various distance scores for computation. This feature provides users with a range of options to tailor the analysis according to their specific requirements and preferences. The available scores are:

  • pMM Score: This score integrates protein-protein interaction (PPI) data into the Meet-Min distance. The PPI-weighted Meet-Min (pMM) score is defined as

$$ \begin{aligned} pMM = min(pMM(A -> B), pMM(B -> A)) \end{aligned} $$ where

$$ \begin{aligned} pMM(A -> B) = 1 - \frac{|A \cap B|}{min(|A|, |B|)} - \frac{\alpha}{min(|A|, |B|)} * \sum_{a \in A - B} \frac{w * \sum_{b \in A \cap B} P(a, b) + \sum_{b \in B - A} P(a, b)}{max(P) * (w * |A \cup B| + |B - A|)} \end{aligned} $$ and

$$ \begin{aligned} w = \frac{min(|A|, |B|)}{|A| + |B|} \end{aligned} $$ α is a scaling factor between 0 and 1. The PPI matrix can be downloaded from the Data Input panel. More details can be found in the paper by Yoon et al (Yoon et al. 2019).

  • Kappa Score: The Kappa distance is a set-based metric based on observed and expected agreement rates between two genesets. It is defined as

$$ \begin{aligned} Kappa = 1 - \frac{O - E}{1 - E} \end{aligned} $$

where

$$ \begin{aligned} O = \frac{|A \cap B| + |A \cup B|^c}{U} \\ E = \frac{|A| |B| + |A^c| |B^c|}{|U|^2} \end{aligned} $$

U is the set of all unique genes in the data. In this application the Kappa distance is additionally normalized to the (0, 1) interval to make it comparable to the remaining distance metrics.

  • Jaccard Score: The Jaccard distance uses the Jaccard coefficient, which is transformed into a distance metric by subtracting it from 1. It is defined as

$$ \begin{aligned} Jaccard = 1 - \frac{|A \cap B|}{|A \cup B|} \end{aligned} $$

  • Meet-Min Score: The Meet-Min (MM) distance transforms the overlap coefficient into a distance measure by subtracting it from 1.The overlap coefficient is a similarity measure which is defined as

$$ \begin{aligned} OC = \frac{|A \cap B|}{min(|A|, |B|)} \end{aligned} $$

In order to transform this measure of similarity into a measure of distance, the overlap coefficient is subtracted from 1, resulting in the calculation of the Meet-Min (MM) distance as

$$ \begin{aligned} MM = 1 - \frac{|A \cap B|}{min(|A|, |B|)} \end{aligned} $$

As a solely set based measurement, the Meet-Min distance only takes the composition of the genesets into account but not the underlying biological information inherent in the genesets.

  • Sorensen-Dice: The Sorensen-Dice distance uses the Sorensen-Dice coefficient, which is transformed into a distance metric by subtracting it from 1. It is defined as

$$ \begin{aligned} Sorensen-Dice(A, B) = 1 - \frac{2 * |A \cap B|}{|A| + |B|} \end{aligned} $$ As a solely set based measurement, the Sorensen-Dice distance only takes the composition of the genesets into account but not the underlying biological information inherent in the genesets.

  • GO distance: The GO distance score measures the relationship between gene sets that are represented by GO terms. Implemented in the GOSemSim Rpackage, there are two main types: information content (IC)-based methods(e.g., Resnik, Lin, Schlicker, and Jiang) and graph-based methods (e.g., Wang). These methods compute similarity scores based on shared characteristics, such as the most informative common ancestor in IC-based methods or the hierarchical structure of the GO database in graph-based methods. To integrate these scores into distance-based analyses, the similarity scores are converted into distance scores by subtracting the similarity score from 1. This transformation ensures compatibility with other distance metrics used in GeDi. While applicable only to GO terms, this approach is particularly useful in gene function analyses.

Each scoring method possesses its own set of advantages and drawbacks, underscoring the importance of selecting one that suits your dataset characteristics and analysis goals. Upon choosing a score, the Compute the distances between genesets button appears on the on the right side. Clicking this button initiates the scoring procedure, which may require some time to execute, particularly for larger datasets. To monitor the progress of this operation, refer to the progress bar located in the lower right corner of the panel. Once the scoring process concludes, you can delve into the Geneset Distance Scores box to explore a variety of visual representations of your data.

Distance Scores Visualizations

  • Distance Scores Heatmap: The initial visualization offered is a heatmap illustrating the distribution of distance scores. Activation of the heatmap generation is triggered by clicking the Calculate Distance Score Heatmap button. Following computation, users can interact with the heatmap by hovering over it, revealing the involved genesets and their corresponding scores. Additionally, users can zoom in on specific areas of interest. To reset the zoomed view, a simple click outside the heatmap area suffices.

  • Distance Scores Dendrogram: The second visualization provided is a dendrogram showcasing individual distance scores. Hierarchical clustering is employed to generate the dendrogram, which effectively groups genesets exhibiting the highest similarity. To enhance the dendrogram’s presentation, users can select different combination methods using the drop-down menu located on the left side.

  • Distance Scores Graph: The final visualization available is the network representation of distance scores. In this representation, nodes/genesets with scores below a predefined threshold are connected by edges. By default, the threshold is set to 0.3, but users can adjust it via the slider located on the left. This interactive graph allows users to hover over or click on nodes to highlight connected nodes and obtain additional information about genesets upon selection. Furthermore, users can search for specific genesets using the input field on the left, with the selected geneset being subsequently highlighted in the graph. The Graph metrics table at the bottom of this box contains various metrics pertaining to the graph, such as degree, betweenness, harmonic centrality, clustering coefficient, and input data. This tabulated information serves to provide users with valuable insights into the underlying data and distance scores.

Bookmarking from the this panel:

As users navigate through the distance scores of genesets in this section, they may encounter genesets and interactions that capture their interest and merit further investigation. To aid in preserving these noteworthy genesets for later exploration, you can utilize the Bookmark button situated in the Navbar. Upon clicking this button, the selected geneset will be added to the list of bookmarked genesets within the Report panel. Additionally, informative messages displayed in the lower right corner will guide users through the bookmarking process.

Once you’ve finished exploring the distance scores, you can proceed to the Clustering graph panel.

The Clustering Graph panel

This panel is dedicated to the computation of clusters among genesets based on their similarity, which is derived from the previously calculated distance scores. Similar to the preceding panel, it comprises two distinct boxes. Within these boxes, users can access functionalities to determine and visualize clusters of genesets that exhibit comparable characteristics or functions.

The computation of clusters involves grouping genesets that display similar patterns of distance scores, thereby indicating shared biological characteristics or functional relationships. This clustering process enables users to identify cohesive groups of genesets with related functionalities or involvement in similar biological processes.

The Clustering Graph panel

The Clustering Graph panel

Choosing a Clustering algorithm

The upper box, labeled Select the clustering method, provides a selection of distinct clustering algorithms. Users can explore various options to find the most suitable algorithm for their analysis:

  • Louvain: The Louvain algorithm, a prevalent tool in biological network analysis, seeks to divide graph nodes into clusters to optimize the modularity metric. This metric gauges the strength of connections within clusters relative to those between clusters. Consequently, nodes within the same cluster exhibit greater similarity to one another than to nodes outside the cluster. This clustering approach aims to enhance data interpretation by grouping similar genesets together. Users can adjust a slider in the bottom left corner of the box to set a similarity threshold, determining when genesets are considered similar based on distance scores.

  • Markov: The Markov algorithm, commonly employed in biological network analysis, is designed to pinpoint densely interconnected regions within graphs. These regions frequently align with communities or clusters in the graph structure. Users can utilize a slider located in the bottom left corner of the box to specify a similarity threshold, determining when genesets are deemed similar based on distance scores.

  • Fuzzy clustering: The Fuzzy Clustering algorithm is a computational technique used to partition data points into clusters based on their similarity, while allowing for data points to belong to multiple clusters with varying degrees of membership. It operates through distinct steps and requires the specification of different thresholds. Firstly, the Similarity threshold is set to determine if two genesets exhibit sufficient similarity to be potentially clustered together. Secondly, the Membership threshold dictates how many members of a potential cluster must possess a close relationship, defined by a distance score less than or equal to the similarity threshold, for the cluster to persist. Lastly, the Clustering threshold determines whether two clusters will be merged. Clusters are merged if their percentage of overlap meets or exceeds the clustering threshold. Users can adjust all thresholds using sliders provided in the interface.

  • PAM: The PAM (Partitioning around Mendoids) clustering algorithm partitions nodes into k distinct clusters, where k is a user-defined parameter. The algorithm iteratively assigns each node to the nearest cluster center based on calculated distance scores, and then updates the cluster centers to minimize the overall variance within each cluster. Users can specify the number of clusters, k, using a slider in the interface, allowing them to tailor the clustering process to the needs of their analysis. Adjusting the value of k enables the exploration of different clustering granularities, providing flexibility in interpreting the data and identifying meaningful patterns.

Once you choose a method, you can start the cluster calculation via the Cluster the Genesets button on the right. Keep in mind that this step might take some time, especially for larger datasets. Look for the progress bar in the lower right corner for updates on the scoring status.

Once the clusters are calculated, you can explore various visualizations of your data in the Geneset Cluster Graphs box.

Cluster Visualizations

  • Geneset Graph: In the Geneset Graph, clusters are visualized as a graph, with individual genesets serving as nodes and edges connecting genesets within the same cluster. To highlight specific nodes, utilize the Select by id feature on the left, or choose to highlight entire clusters by selecting the respective option under Select by cluster. Please note that only genesets belonging to at least one cluster will be displayed in this graph. For additional insights, nodes can be colored based on specific parameters from your input data, accessible through the Color the graph by dropdown menu. Depending on the information provided with your data, various options will be available. While interacting with the network, nodes can be moved by clicking and dragging them to desired locations, offering flexibility in managing node placement which is particularly useful in complex or densely populated graphs.

  • Cluster-Geneset Bipartite Graph: The Cluster-Geneset Bipartite Graph presents a bipartite representation of the clusters. In this visualization, nodes represent both clusters and genesets, with edges connecting cluster nodes to their corresponding geneset members. Hovering over nodes provides additional data insights. Cluster nodes display the members within each cluster, while geneset nodes showcase the genes associated with each geneset.

  • Cluster Enrichment Terms Word Cloud: The Cluster Enrichment Terms Word Cloud displays the most frequently occurring terms for each cluster. This visualization proves particularly useful when your data includes brief descriptions of the genesets, in addition to the mandatory input data. By utilizing the Select a cluster drop-down menu, you can designate the cluster of interest. Furthermore, hovering over the word cloud enables you to select individual terms and view the frequency with which each term appears in the descriptions of the genesets within that cluster.

  • Clustering graph summaries: The cluster information is also summarized in a table-like format in the Clustering graph summaries box. This table displays each geneset alongside the cluster to which it belongs. Additionally, the table features a search function, facilitating the quick retrieval of a geneset of interest.

Bookmarking from this panel:

While exploring the Clustering Graph panel, users may encounter genesets and clusters that intrigue them and warrant further investigation. To facilitate the preservation of these notable genesets and clusters for future exploration, users can utilize the Bookmark button located in the Navbar. Clicking this button will add the selected geneset or cluster to the list of bookmarked items within the Report panel. Helpful messages displayed in the lower right corner will assist users throughout the bookmarking process.

In order to bookmark interesting genes and clusters, users simply select a geneset or cluster from the Geneset Graph or Cluster-Geneset Bipartite Graph and use the Bookmark button to add the respective information to the set of bookmarked features. After exploring the results in the Clustering Graph panel, users can proceed to the Report panel to have a look at the bookmarked genesets and clusters or iterate through the individual panels of the app for a more in depth exploration of the data.

The Report panel

In this panel of the application, users can obtain a comprehensive overview of the items they have bookmarked for further exploration. On the left side of the interface, bookmarked genesets are listed, while bookmarked clusters are displayed on the right side.

During an interactive exploration session, recalling specific details about each bookmarked item can sometimes be challenging. Therefore, users are provided with convenient options to manage their bookmarked data.

Below the interactive tables displaying bookmarked genesets and clusters, users can find buttons allowing them to download the content of each table individually. Additionally, the Start the generation of the report button is provided to generate a detailed report encompassing all selected elements of interest.

The report generation process utilizes a predefined template report included within the GeDi package. This template leverages the input elements and reactive values associated with the bookmarks, ensuring that the generated report contains comprehensive and relevant information.

The resulting report serves as a valuable tool for creating a permanent and reproducible analysis output. Users can easily store or share this report for future reference or collaboration purposes.

The Report panel

The Report panel

Additional Information

If you have questions about the package or the available functionality, please submit them on the Bioconductor support site using the tag ‘GeDi’.

Bug reports can be opened as issues in the GeDi GitHub repository. Please note that the GitHub repository also hosts the development version of the package, where new functionality is continuously added - be cautious, as you may be working with cutting-edge versions!

The authors welcome thoughtful suggestions for enhancements or new features, and even better, pull requests.

Additional example data

In this section, we present additional examples demonstrating the versatility of GeDi in analyzing functional enrichment data from two widely used databases, KEGG and Reactome. By leveraging the rich resources provided by these databases, GeDi offers researchers a comprehensive toolkit for exploring and interpreting complex biological pathways and processes. Through step-by-step demonstrations, we illustrate how GeDi can seamlessly integrate with data from KEGG and Reactome, enabling users to gain deeper insights into the functional annotations of their gene sets. Whether investigating specific pathways or broader biological processes, GeDi provides intuitive and powerful functionalities to enhance the analysis of functional enrichment data from diverse sources.

In this section we will demonstrate how results containing identifiers from databases like KEGG (Kanehisa et al. 2023) or Reactome (Gillespie et al. 2022) - e.g. generated using enrichKegg or enrichPathway functions from the clusterProfiler package - can be utilized as input for GeDi. We will again use the data of the macrophage package, specifically the differentially expressed genes we have identified before. With this data, we demonstrate how to generate the results and prepare them for their use in GeDi.

However, before we can use the enrichKEGG function from the clusterProfiler package, we have to map the ENSEMBL ids of the data to Entrez ids. For this, we will up the first use the biomaRt package to generate a mapping of ENSEMBL to Entrez.

# Load the "biomaRt" package to access the BioMart database
library("biomaRt")

# Set up a connection to the ENSEMBL BioMart database for human genes
mart <-
  useMart(biomart = "ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl")

# Retrieve gene annotations using the BioMart database
anns <- getBM(
  attributes = c(
    "ensembl_gene_id",
    "external_gene_name",
    "entrezgene_id",
    "description"
  ),
  filters = "ensembl_gene_id",
  values = rownames(dds_macrophage),
  mart = mart
)

# Match the retrieved annotations to the genes in dds_macrophage
anns <- anns[match(rownames(dds_macrophage), anns[, 1]), ]

Next, we map the differentially expressed genes to get the right identifiers and run the enrichKEGG function. We set the organism to human and the p-value cutoff to 5%.

# Load the "clusterProfiler" package for functional enrichment analysis
library("clusterProfiler")

# Retrieve Entrez gene IDs from the annotations data frame based on matching 
# Ensembl gene IDs from the DE results
genes <- anns$entrezgene_id[match(rownames(res_macrophage_IFNg_vs_naive),
                                  anns$ensembl_gene_id)]

# Perform KEGG pathway enrichment analysis using the retrieved gene IDs
res_enrich <- enrichKEGG(genes,
  organism = "hsa",
  pvalueCutoff = 0.05
)

We can now use the results of the enrichment in GeDi. For this, we directly start the app with the loaded data. If you have not computed the data following this workflow, you can beforehand load it from the available data in this package.

# Load the "macrophage_KEGG_example" dataset from the "GeDi" package
data("macrophage_KEGG_example", package = "GeDi")

# Start the GeDi app with the loaded data
# The "genesets" parameter is set to the loaded "macrophage_KEGG_example" 
# dataset
GeDi(genesets = macrophage_KEGG_example)

In a similar manner we can use the Reactome database for the functional annotation. Here, we use the ReactomePA package and the differentially expressed genes.

# Load the "ReactomePA" package for pathway enrichment analysis
library("ReactomePA")

# Perform pathway enrichment analysis using the "enrichPathway" function
reactome <- enrichPathway(genes,
  organism = "human",
  pvalueCutoff = 0.05,
  readable = TRUE
)

Now we can use the results in the same manner as for the KEGG pathway analysis.

# Load the "macrophage_Reactome_example" dataset from the "GeDi" package
data("macrophage_Reactome_example", package = "GeDi")

# Start the GeDi app with the loaded data
# The "genesets" parameter is set to the loaded "macrophage_Reactome_example" 
# dataset
GeDi(genesets = macrophage_Reactome_example)

FAQs

Q: My configuration on two machines is somewhat different, so I am having difficulty in finding out what packages are different. Is there something to help on this?

A: Yes, you can check out sessionDiffo, a small utility to compare the outputs of two different sessionInfo outputs. This can help you pinpoint what packages might be causing the issue.

Q: I am using a different service/software for generating the results of functional enrichment analysis. How do I plug this into GeDi?

A: You can use nearly any result of a functional enrichment analysis in GeDi as long as the results are transformed in a way that they fit the input requirements. Please check out the Welcome page to see the specification of the input requirements.

Session Info

utils::sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
#>  [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
#> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] org.Hs.eg.db_3.20.0         topGO_2.57.0                SparseM_1.84-2             
#>  [4] GO.db_3.20.0                graph_1.83.0                AnnotationDbi_1.69.0       
#>  [7] GeneTonic_2.99.1            pcaExplorer_2.99.1          DESeq2_1.45.3              
#> [10] SummarizedExperiment_1.35.5 Biobase_2.67.0              MatrixGenerics_1.17.1      
#> [13] matrixStats_1.4.1           GenomicRanges_1.57.2        GenomeInfoDb_1.41.2        
#> [16] IRanges_2.39.2              S4Vectors_0.43.2            BiocGenerics_0.53.0        
#> [19] macrophage_1.21.0           GeDi_1.3.0                  BiocStyle_2.35.0           
#> 
#> loaded via a namespace (and not attached):
#>   [1] R.methodsS3_1.8.2        GSEABase_1.67.1          progress_1.2.3          
#>   [4] wordcloud2_0.2.1         DT_0.33                  Biostrings_2.75.0       
#>   [7] vctrs_0.6.5              ggtangle_0.0.3           digest_0.6.37           
#>  [10] png_0.1-8                shape_1.4.6.1            shinyBS_0.61.1          
#>  [13] registry_0.5-1           ggrepel_0.9.6            MASS_7.3-61             
#>  [16] reshape2_1.4.4           httpuv_1.6.15            foreach_1.5.2           
#>  [19] qvalue_2.37.0            withr_3.0.2              xfun_0.48               
#>  [22] ggfun_0.1.7              survival_3.7-0           memoise_2.0.1           
#>  [25] clusterProfiler_4.15.0   gson_0.1.0               BiasedUrn_2.0.12        
#>  [28] tidytree_0.4.6           GlobalOptions_0.1.2      gtools_3.9.5            
#>  [31] R.oo_1.26.0              sys_3.4.3                prettyunits_1.2.0       
#>  [34] KEGGREST_1.45.1          promises_1.3.0           httr_1.4.7              
#>  [37] restfulr_0.0.15          hash_2.2.6.3             shinyAce_0.4.3          
#>  [40] UCSC.utils_1.1.0         miniUI_0.1.1.1           generics_0.1.3          
#>  [43] DOSE_4.1.0               base64enc_0.1-3          curl_5.2.3              
#>  [46] zlibbioc_1.51.2          polyclip_1.10-7          ca_0.71.1               
#>  [49] GenomeInfoDbData_1.2.13  SparseArray_1.5.45       RBGL_1.81.0             
#>  [52] threejs_0.3.3            xtable_1.8-4             stringr_1.5.1           
#>  [55] doParallel_1.0.17        evaluate_1.0.1           S4Arrays_1.5.11         
#>  [58] BiocFileCache_2.15.0     hms_1.1.3                colorspace_2.1-1        
#>  [61] filelock_1.0.3           visNetwork_2.1.2         NLP_0.3-0               
#>  [64] shinyWidgets_0.8.7       magrittr_2.0.3           Rgraphviz_2.49.1        
#>  [67] later_1.3.2              buildtools_1.0.0         viridis_0.6.5           
#>  [70] ggtree_3.13.2            lattice_0.22-6           NMF_0.28                
#>  [73] genefilter_1.87.0        XML_3.99-0.17            cowplot_1.1.3           
#>  [76] maketools_1.3.1          pillar_1.9.0             nlme_3.1-166            
#>  [79] iterators_1.0.14         gridBase_0.4-7           caTools_1.18.3          
#>  [82] compiler_4.4.1           stringi_1.8.4            shinycssloaders_1.1.0   
#>  [85] Category_2.73.0          TSP_1.2-4                dendextend_1.18.1       
#>  [88] GenomicAlignments_1.41.0 plyr_1.8.9               crayon_1.5.3            
#>  [91] abind_1.4-8              BiocIO_1.17.0            ggdendro_0.2.0          
#>  [94] gridGraphics_0.5-1       chron_2.3-61             locfit_1.5-9.10         
#>  [97] bit_4.5.0                dplyr_1.1.4              fastmatch_1.1-4         
#> [100] codetools_0.2-20         crosstalk_1.2.1          bslib_0.8.0             
#> [103] slam_0.1-54              GetoptLong_1.0.5         plotly_4.10.4           
#> [106] tm_0.7-14                mime_0.12                mosdef_1.1.3            
#> [109] splines_4.4.1            circlize_0.4.16          Rcpp_1.0.13             
#> [112] dbplyr_2.5.0             tippy_0.1.0              knitr_1.48              
#> [115] blob_1.2.4               utf8_1.2.4               clue_0.3-65             
#> [118] fs_1.6.4                 backbone_2.1.4           expm_1.0-0              
#> [121] ggplotify_0.1.2          tibble_3.2.1             sqldf_0.4-11            
#> [124] Matrix_1.7-1             statmod_1.5.0            tweenr_2.0.3            
#> [127] pkgconfig_2.0.3          pheatmap_1.0.12          tools_4.4.1             
#> [130] cachem_1.1.0             RSQLite_2.3.7            viridisLite_0.4.2       
#> [133] DBI_1.2.3                fastmap_1.2.0            rmarkdown_2.28          
#> [136] scales_1.3.0             grid_4.4.1               shinydashboard_0.7.2    
#> [139] Rsamtools_2.21.2         sass_0.4.9               patchwork_1.3.0         
#> [142] BiocManager_1.30.25      fontawesome_0.5.2        farver_2.1.2            
#> [145] mgcv_1.9-1               gsubfn_0.7               yaml_2.3.10             
#> [148] AnnotationForge_1.49.0   rtracklayer_1.65.0       cli_3.6.3               
#> [151] purrr_1.0.2              txdbmaker_1.1.2          webshot_0.5.5           
#> [154] lifecycle_1.0.4          rintrojs_0.3.4           BiocParallel_1.41.0     
#> [157] annotate_1.85.0          gtable_0.3.6             rjson_0.2.23            
#> [160] ggridges_0.5.6           parallel_4.4.1           ape_5.8                 
#> [163] limma_3.61.12            jsonlite_1.8.9           colourpicker_1.3.0      
#> [166] seriation_1.5.6          bitops_1.0-9             ggplot2_3.5.1           
#> [169] bit64_4.5.2              assertthat_0.2.1         yulab.utils_0.1.7       
#> [172] BiocNeighbors_2.1.0      proto_1.0.0              heatmaply_1.5.0         
#> [175] geneLenDataBase_1.41.2   bs4Dash_2.3.4            highr_0.11              
#> [178] jquerylib_0.1.4          GOSemSim_2.31.2          R.utils_2.12.3          
#> [181] lazyeval_0.2.2           shiny_1.9.1              dynamicTreeCut_1.63-1   
#> [184] htmltools_0.5.8.1        enrichplot_1.25.5        rappdirs_0.3.3          
#> [187] glue_1.8.0               STRINGdb_2.17.3          httr2_1.0.5             
#> [190] XVector_0.45.0           RCurl_1.98-1.16          treeio_1.29.2           
#> [193] ComplexUpset_1.3.3       gridExtra_2.3            igraph_2.1.1            
#> [196] R6_2.5.1                 tidyr_1.3.1              gplots_3.2.0            
#> [199] GenomicFeatures_1.57.1   cluster_2.1.6            rngtools_1.5.2          
#> [202] aplot_0.2.3              DelayedArray_0.31.14     tidyselect_1.2.1        
#> [205] plotrix_3.8-4            GOstats_2.71.0           ggforce_0.4.2           
#> [208] xml2_1.3.6               munsell_0.5.1            KernSmooth_2.23-24      
#> [211] goseq_1.57.2             data.table_1.16.2        htmlwidgets_1.6.4       
#> [214] fgsea_1.31.6             ComplexHeatmap_2.23.0    RColorBrewer_1.1-3      
#> [217] biomaRt_2.63.0           rlang_1.1.4              fansi_1.0.6

References

Alasoo, Kaur, Julia Rodrigues, Subhankar Mukhopadhyay, Andrew J Knights, Alice L Mann, Kousik Kundu, Christine Hale, Gordon Dougan, and Daniel J Gaffney. 2018. Shared genetic effects on chromatin and gene expression indicate a role for enhancer priming in immune response.” Nature Genetics 50 (3): 424–31. https://doi.org/10.1038/s41588-018-0046-7.
Gillespie, Marc, Bijay Jassal, Ralf Stephan, Marija Milacic, Karen Rothfels, Andrea Senff-Ribeiro, Johannes Griss, et al. 2022. The reactome pathway knowledgebase 2022.” Nucleic Acids Research 50 (D1): D687–92. https://doi.org/10.1093/nar/gkab1028.
Kanehisa, Minoru, Miho Furumichi, Yoko Sato, Masayuki Kawashima, and Mari Ishiguro-Watanabe. 2023. KEGG for taxonomy-based analysis of pathways and genomes.” Nucleic Acids Research 51 (D1): D587–92. https://doi.org/10.1093/nar/gkac963.
Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology 15 (12): 550. https://doi.org/10.1186/s13059-014-0550-8.
Yoon, Sora, Jinhwan Kim, Seon-kyu Kim, Bukyung Baik, Sang-mun Chi, and Seon-young Kim. 2019. GScluster: network-weighted gene-set clustering analysis,” 1–14.