---
title: >
The `GeDi` User's Guide
author:
- name: Annekathrin Silvia Nedwed
affiliation:
- Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), Mainz
email: anneludt@uni-mainz.de
- name: Federico Marini
affiliation:
- Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), Mainz
- Center for Thrombosis and Hemostasis (CTH), Mainz
email: marinif@uni-mainz.de
date: "`r BiocStyle::doc_date()`"
package: "`r BiocStyle::pkg_ver('GeDi')`"
output:
BiocStyle::html_document:
toc_float: true
vignette: >
%\VignetteIndexEntry{The GeDi User's Guide}
%\VignetteEncoding{UTF-8}
%\VignettePackage{GeDi}
%\VignetteKeywords{FunctionalAnnotation, Enrichment Analysis,
Distance measurements, Exploration, Visualization, GUI}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: console
bibliography: GeDi.bib
---
**Compiled date**: `r Sys.Date()`
**Last edited**: 2024-02-29
**License**: `r packageDescription("GeDi")[["License"]]`
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
error = FALSE,
warning = FALSE,
eval = TRUE,
message = FALSE,
fig.width = 8
)
options(width = 100)
```
# Introduction {#introduction}
This vignette introduces the usage of the `r BiocStyle::Biocpkg("GeDi")` package
for exploring the results of functional annotation and enrichment analyses.
`r BiocStyle::Biocpkg("GeDi")` is a versatile package designed to simplify the
exploration and comprehension of functional annotation and enrichment analysis
results. It offers a `r BiocStyle::CRANpkg("shiny")` application that combines
interactivity, visualization, and reproducibility to consolidate comprehensive
outcomes.
To incorporate `r BiocStyle::Biocpkg("GeDi")` into your workflow, you'll need
the results of a functional annotation or enrichment analysis. This vignette
demonstrates the core functionalities of `r BiocStyle::Biocpkg("GeDi")` using a
publicly available dataset from Alasoo et al., as described in their paper
"Shared genetic effects on chromatin and gene expression indicate a role for
enhancer priming in immune response" [@Alasoo2018].
Accessible through the `r BiocStyle::Biocpkg("macrophage")` Bioconductor package,
this dataset comprises files generated from Salmon quantification (version
0.12.0, with Gencode v29 reference) and gene-level summarized values.
Within the `r BiocStyle::Biocpkg("macrophage")` experimental setup, samples
derive from six different donors under four distinct conditions: naive, treated
with Interferon gamma, with SL1344, or with a combination of Interferon gamma
and SL1344. For illustration, we will focus on comparing Interferon
gamma-treated samples with naive samples.
# Getting started {#gettingstarted}
Before you can start using GeDi, the package needs to be installed on your
machine. To install the package, begin by opening R and executing the following
command:
```{r install, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("GeDi")
```
Once installed, the package can be loaded and attached to your current workspace
as follows:
```{r loadlib}
library("GeDi")
```
With the attached package, you can simply start the application by running
`GeDi()`.
```{r launchapp, eval=FALSE}
GeDi()
```
This action will open the application, directing you to the **Welcome** page.
From there, you can easily provide your data using the **Data Input** panel on
the left side menu, ensuring it's in the correct format for analysis.
Alternatively, you can initiate the application by executing:
```{r launchappwithData, eval=FALSE}
GeDi(
genesets = geneset_df,
ppi = ppi_df,
distance_scores = distance_scores_df
)
```
where
- `geneset_df` represents your input data in the form of a `data.frame`, which
should include at least one column named "Genesets" containing geneset
identifiers and one column named "Genes" containing a comma-separated list of
genes belonging to each respective geneset.
- `ppi_df` is another `data.frame` containing protein-protein interaction scores,
with columns named "from", "to", and "combined_score".
- `distance_scores_df` is a sparse `Matrix` containing the distance scores of
the genesets in your data.
All of these parameters are optional, as you can alternatively upload, download,
and compute them directly within the application. However, some of these
processes may require a significant amount of time, especially with larger
datasets. Therefore, it may be advantageous to save the intermediate results,
such as the downloaded PPI and computed distance scores, for later use within
the application.
In this vignette, we demonstrate the functionality of
`r #BiocStyle::Biocpkg("GeDi")` `GeDi` using enrichment analysis results from
the `r BiocStyle::Biocpkg("macrophage")` dataset. To immediately start exploring
the application, you can simply execute:
```{r examplerun, eval=FALSE}
GeDi()
```
and load the example data with the `Load example data` button in the
**Data Input** panel.
Alternatively, you can proceed by following the subsequent code chunks to create
the necessary input objects, step by step. This can serve as a reference guide
for the steps ideally executed prior to analyzing the data with
`r BiocStyle::Biocpkg("GeDi")`.
To utilize `r BiocStyle::Biocpkg("GeDi")`, you'll require results from a
functional annotation analysis. In this vignette, we'll demonstrate how to
conduct an enrichment analysis on differentially expressed (DE) genes from the
`r BiocStyle::Biocpkg("macrophage")` dataset.
Firstly, we'll load the macrophage data and create a `DESeqDataset`, as the
subsequent differential expression analysis will be performed using
`r BiocStyle::Biocpkg("DESeq2")` [@Love2014].
```{r create_dds}
# Load required libraries
library("macrophage")
library("DESeq2")
# Load the example dataset "gse" from the "macrophage" package
data("gse", package = "macrophage")
# Create a DESeqDataSet object using the "gse" dataset and define the
# experimental design.
# We use the condition as part of the experimental design, because we are
# interested in the differentially expressed genes between treatments. We also
# add the line to the design to account for the inherent differences between
# the donors.
dds_macrophage <- DESeqDataSet(gse, design = ~ line + condition)
# Change the row names of the DESeqDataSet object to Ensembl IDs
rownames(dds_macrophage) <- gsub("\\..*", "", rownames(dds_macrophage))
# Have a look at the resulting DESeqDataSet object
dds_macrophage
```
Now that we've obtained our `DESeqDataset`, we can conduct the differential
expression (DE) analysis. In this vignette, we'll utilize the results from
comparing two distinct conditions of the dataset, specifically `IFNg` and
`naive`, while accounting for the cell line of origin.
Before executing the DE analysis, we'll filter out lowly expressed features
from the dataset. In this instance, we'll exclude all genes with fewer than 10
counts in at least 6 samples, where 6 corresponds to the smallest group size in
the dataset.
Subsequently, we'll conduct the DE analysis and assess against a null hypothesis
of a log2FoldChange of 1 to ensure that we identify genes with consistent and
robust changes in expression.
Finally, we'll append the gene symbols to the resultant `DataFrame`, which will
later serve as our "Genes" column in the input data for
`r BiocStyle::Biocpkg("GeDi")`.
```{r create_resde1}
# Filter genes based on read counts
# Calculate the number of genes with at least 10 counts in at least 6 samples
keep <- rowSums(counts(dds_macrophage) >= 10) >= 6
# Subset the DESeqDataSet object to keep only the selected genes
dds_macrophage <- dds_macrophage[keep, ]
# Have a look at the resulting DESeqDataSet object
dds_macrophage
```
```{r create_resde2}
# Perform differential expression analysis using DESeq2
dds_macrophage <- DESeq(dds_macrophage)
# Extract differentially expressed genes
# Perform contrast analysis comparing "IFNg" condition to "naive" condition
# Set a log2 fold change threshold of 1 and a significance level (alpha) of 0.05
res_macrophage_IFNg_vs_naive <- results(dds_macrophage,
contrast = c("condition", "IFNg", "naive"),
lfcThreshold = 1, alpha = 0.05
)
# Add gene symbols to the results in a column "SYMBOL"
res_macrophage_IFNg_vs_naive$SYMBOL <- rowData(dds_macrophage)$SYMBOL
```
After completing the differential expression analysis, we move on to conduct
the functional annotation analysis. To begin, we extract the differentially
expressed (DE) genes from the previously generated results and identify the
background genes to be utilized for functional enrichment.
For the enrichment analysis, we use the overrepresentation analysis method
provided by the `r BiocStyle::Biocpkg("topGO")` package. To streamline the
integration of these results into `r BiocStyle::Biocpkg("GeDi")`, we utilize the
`topGOtable` function from the `r BiocStyle::Biocpkg("pcaExplorer")` package.
By default, this function employs the `BP` ontology and the `elim` method, which
helps decorrelate the Gene Ontology (GO) graph structure, resulting in less
redundant functional categories. The output is a `DataFrame` object that
seamlessly integrates with `r BiocStyle::Biocpkg("GeDi")`.
However, as `r BiocStyle::Biocpkg("GeDi")` has only minimal requirements for the
input, enrichment results generated using `r BiocStyle::Biocpkg("clusterProfiler")`
can also be utilized. While we primarily tested results from the `enrichGO`
method during `r BiocStyle::Biocpkg("GeDi")` development, those from the
`enrichKEGG` and `enrichPathway` methods are also compatible.
```{r create_resenrich1, eval=TRUE}
# Load required packages for analysis
library("pcaExplorer")
library("GeneTonic")
library("AnnotationDbi")
# Extract gene symbols from the DESeq2 results object where FDR is below 0.05
# The function deseqresult2df is used to convert the DESeq2 results to a
# dataframe format
# FDR is set to 0.05 to filter significant results
de_symbols_IFNg_vs_naive <- deseqresult2df(res_macrophage_IFNg_vs_naive,
FDR = 0.05)$SYMBOL
# Extract gene symbols for background using the DESeq2 results object
# Filter genes that have nonzero counts
bg_ids <- rowData(dds_macrophage)$SYMBOL[rowSums(counts(dds_macrophage)) > 0]
```
```{r create_resenrich2, eval=TRUE}
# Load required package for analysis
library("topGO")
library("org.Hs.eg.db")
# Perform Gene Ontology enrichment analysis using the topGOtable function from
# the "pcaExplorer" package
macrophage_topGO_example <-
pcaExplorer::topGOtable(de_symbols_IFNg_vs_naive,
bg_ids,
ontology = "BP",
mapping = "org.Hs.eg.db",
geneID = "symbol",
topTablerows = 500
)
```
As mentioned earlier, `r BiocStyle::Biocpkg("GeDi")` expects the input to
contain at least two columns: one named "Genesets" and one named "Genes". While
this is not strictly mandatory when providing your data interactively during an
application session, it becomes necessary if you intend to initiate the
application with your input as parameters (e.g.,
`GeDi(genesets = my_genesets_df)`). In such cases, the "Genesets" column should
contain identifiers for each geneset in the input, while the "Genes" column
should consist of comma-separated lists of genes associated with each geneset.
Therefore, we will adjust the column names of the resulting `data.frame` from
the enrichment analysis to adhere to the required format.
```{r renamecolumns, eval=TRUE}
# Rename columns in the macrophage_topGO_example dataframe
# Change the column name "GO.ID" to "Genesets"
names(macrophage_topGO_example)[names(macrophage_topGO_example) == "GO.ID"] <- "Genesets"
# Change the column name "genes" to "Genes"
names(macrophage_topGO_example)[names(macrophage_topGO_example) == "genes"] <- "Genes"
```
## All set!
Now that we've obtained functional annotation results from the
`r BiocStyle::Biocpkg("macrophage")` dataset, we can begin exploring the data
using `r BiocStyle::Biocpkg("GeDi")`. You have two options: you can either launch
the application and supply the generated data using the `GeDi()` command, or if
you've followed this vignette, you can initiate the application directly with
the loaded data by executing `GeDi(genesets = macrophage_topGO_example)`.
```{r dryrun, eval=FALSE}
GeDi()
GeDi(genesets = macrophage_topGO_example)
```
The above shown code will open the application, directing you to the **Welcome**
page. The **Welcome** page of `r BiocStyle::Biocpkg("GeDi")` serves as the entry
point to the application, providing users with an overview of its features and
functionalities. Upon launching the application, users are greeted with a
user-friendly interface designed to facilitate the exploration and interpretation
of functional annotation and enrichment analysis results. The **Welcome** page
offers guidance on how to navigate the application and highlights key components
such as data input options, visualization tools, and interactive features.
Whether users are new to GeDi or returning to explore additional datasets,
the **Welcome** page serves as a central hub for accessing resources and getting
started with their analysis journey.
# Description of the `GeDi` user interface {#userinterface}
The `r BiocStyle::Biocpkg("GeDi")` application, developed with the
`r BiocStyle::CRANpkg("shiny")` framework, incorporates the modern design
elements of the `r BiocStyle::CRANpkg("bs4Dash")` package, which is built upon
Bootstrap 4. This combination of technologies ensures a sleek and visually
appealing user interface for navigating and interacting with the functionality
offered by `r BiocStyle::Biocpkg("GeDi")`. By leveraging the features of
`r BiocStyle::CRANpkg("shiny")` and `r BiocStyle::CRANpkg("bs4Dash")`,
`r BiocStyle::Biocpkg("GeDi")` provides users with an intuitive and
aesthetically pleasing environment for conducting functional annotation and
enrichment analyses on their datasets.
## Header (navbar)
The dashboard navbar in `r BiocStyle::Biocpkg("GeDi")`, referred to as such in
the `r BiocStyle::CRANpkg("bs4Dash")` framework, features a dropdown menu
accessible by clicking on the respective "info" icon. The menu offers additional
functionality through various buttons:
- The open book icon - This option allows users to explore the
`r BiocStyle::Biocpkg("GeDi")` vignette, either the version bundled with the
package or the online version, providing detailed documentation and usage
guidelines.
- The information i cirle - Selecting this option displays information
about the current session, presenting details such as the R environment and
loaded packages, helpful for troubleshooting and debugging purposes.
- The heart button - This button offers general information about
`r BiocStyle::Biocpkg("GeDi")`, including links to its development version for
contribution and guidelines on citing the tool in research publications.
Besides the two dropdown menus, users can also find the `Bookmark` button in the
Navbar. The `Bookmark` button in the `r BiocStyle::Biocpkg("GeDi")` navbar serves
as a convenient tool for users to save and bookmark genes and genesets of
interest for later reference. To use this feature, users must first select or
click on a gene or geneset that they wish to bookmark. Once the desired gene or
geneset is selected, users can then click on the `Bookmark` button to add it to
a list of bookmarked items within the `r BiocStyle::Biocpkg("GeDi")` application.
This functionality enables users to organize and revisit specific genes or
genesets that they find relevant or intriguing during their exploration of
functional annotation and enrichment analysis results. The bookmarked genes and
genesets can later be found in the **Report** panel.
## Sidebar
By clicking the menu bar icon on the left side of the app (or simply by moving
the mouse over to the left side if viewing the app in full screen mode), users
can activate the sidebar menu. This sidebar menu serves as the primary means of
accessing the various panels of the `r BiocStyle::Biocpkg("GeDi")` application,
providing navigation to different functionalities. More detailed explanations of
each panel will be provided in the next section.
## Body
The structure of `r BiocStyle::Biocpkg("GeDi")` is designed around different
panels, each of which becomes active upon clicking the corresponding icons or
text in the sidebar.
While the Welcome panel is relatively self-explanatory, additional information
and explanations are provided for the functionality of the remaining panels. For
new users seeking guidance, there's a question circle button available to
initiate an interactive tour of `r BiocStyle::Biocpkg("GeDi")`. This tour allows
users to learn the basic usage mechanisms by actively engaging with the
interface. During the tour, specific elements are highlighted in response to
user actions, while the rest of the UI remains shaded to maintain focus. Users
can interrupt the tour at any time by clicking outside the highlighted window,
and navigation between steps is facilitated by arrow buttons (left, right). The
tour functionality is implemented using the `r BiocStyle::CRANpkg("rintrojs")`
package.
# The `GeDi` functionality {#functionality}
The `r BiocStyle::Biocpkg("GeDi")` `r BiocStyle::Biocpkg("shiny")` application is
organized into distinct panels, each serving a specific purpose, which will be
thoroughly explored in the following sections.
## The Welcome panel
This panel serves as a guide for utilizing `r BiocStyle::Biocpkg("GeDi")`
effectively. It offers detailed instructions on generating input data for the
application, elucidating the expected input format and outlining the various
interactive elements present in the app's other panels.
```{r welcome-page2, fig.align = "center", fig.cap = "The Welcome panel of GeDi", echo = FALSE}
knitr::include_graphics("Welcome_page.png")
```
## The Data Input panel
This panel serves as a hub for managing data input if it's not provided within
the function call. It's divided into distinct boxes, each representing a step of
the data input process, which sequentially appear as you successfully complete
each preceding step.
**Step 1**: Provide your Genesets as input data
In the initial **Step 1** box, you can provide your data by utilizing the
**Browse** button. This action opens a modal window enabling you to select the
relevant file from your computer storage. After successfully loading the data, a
preview is displayed in the **Genesets preview** box on the right. During this
step, the application checks if your input contains the "Genesets" and "Genes"
columns. If these columns are missing, a small error message appears in the lower
right corner. Additionally, two drop-down menus allow you to select the correct
columns from your data and update the input accordingly.
You also have the option to start using `r BiocStyle::Biocpkg("GeDi")` with
preprocessed example data based on the `r BiocStyle::Biocpkg("macrophage")`
dataset. Simply click the **Load demo data** button to load the example data's
enrichment results. You can explore these results in the **Genesets preview** box.
However, instead of loading demo data and observing the expected data structure
through the **Genesets preview** box, you can also use the
**Have a look at the data structure** button. By clicking this button, a modal
window with a visual representation of the expected input data structure will
open. This screenshot serves as a helpful guide, providing you with a clear
understanding of how your data should be formatted for optimal compatibility
with `r BiocStyle::Biocpkg("GeDi")`.
Once, you have successfully loaded some data, the data input process will proceed
and two additional boxes will be displayed in the panel.
```{r data-input-step1, fig.align = "center", fig.cap = "The Data input panel - Step 1", echo = FALSE}
knitr::include_graphics("Data_Input_panel_Step1.png")
```
**Optional Filtering Step**: Filter generic genesets
Introducing the first new box, the **Optional Filtering Step** offers a
non-compulsory yet advantageous opportunity to refine your geneset selection.
While not obligatory for data exploration, engaging in this step can notably
optimize downstream processing runtime. Here, you're empowered to filter genesets
within your dataset, thereby enhancing result interpretation. This step enables
the exclusion of large and generic genesets, contributing to clearer insights.
Additionally, you have the flexibility to filter genesets based on size criteria.
The box features a histogram illustrating geneset sizes, providing visual context
for the filtering process. Within the interface, two input fields are available
for customization. The left input field facilitates the selection of individual
genesets by their identifiers in the "Genesets" column of your dataset.
Meanwhile, the right input field empowers you to establish a threshold "x" for
filtering genesets with a size greater than or equal to "x." This interactive
approach ensures tailored filtering suited to your specific analysis requirements.
Once you've chosen the genesets you wish to exclude from your dataset, you can
initiate the filtering process by clicking the "Remove the selected Genesets"
button. This action will remove all selected genesets from the dataset.
Additionally, you have the option to save the filtered data using the "Download
the filtered data" button. Clicking this button will save the filtered data to
your local machine. This feature can be particularly beneficial for users who
intend to revisit their data in a new instance of GeDi and want to ensure that
previously identified uninsightful genesets have already been filtered out.
Once you've chosen the genesets you wish to exclude from your dataset, you can
initiate the filtering process by clicking the `Remove the selected Genesets`
button. This action will remove all selected genesets from the dataset.
Additionally, you have the option to save the filtered data using the
`Download the filtered data` button. Clicking this button will save the filtered
data to your local machine.This feature can be particularly beneficial for users
who intend to revisit their data in a new instance of
`r BiocStyle::Biocpkg("GeDi")` and want to ensure that previously identified
uninsightful genesets have already been filtered out.
```{r optional-filtering, fig.align = "center", fig.cap = "Optional Filtering Step", echo = FALSE}
knitr::include_graphics("Optional_Filtering.png")
```
**Step 2**: Species Selection
Upon advancing to the second box labeled **Step 2**, you'll encounter the crucial
task of selecting the species associated with your dataset. This step holds
significant importance for the computation of the **pMM score** within
`r BiocStyle::Biocpkg("GeDi")`, which heavily relies on a
**Protein-Protein Interaction (PPI)** matrix. This matrix plays a pivotal role in
capturing protein interaction strength, thereby enriching distance scores with
valuable biological context. To access and utilize this essential information,
specifying the species linked to your dataset is mandatory. By clicking the
input field, you'll prompt a dropdown menu showcasing preselected species options.
If your species is included, simply make your selection. Alternatively, if your
species is not listed, you have the option to manually input it. In cases of
uncertainty, a convenient link provided on the right directs you to the STRING
database, enabling verification of species details and PPI availability for
informed decision-making.
```{r species-selection, fig.align = "center", fig.cap = "Species Selection", echo = FALSE}
knitr::include_graphics("Species_Selection.png")
```
**Step 3**: PPI Matrix Download
Following species selection, a third box named **Step 3** will emerge. In this
phase, you have the opportunity to download the Protein-Protein Interaction (PPI)
matrix. This process may necessitate some time, with a progress bar positioned
in the lower right corner providing real-time updates on the download status.
Once the download is complete, you can conveniently preview the PPI matrix
within the **PPI Preview** box situated on the right-hand side of the interface.
This will show that the PPI consists of three columns: **Gene1** and **Gene2**,
housing the gene symbols corresponding to the interacting proteins, and a column
labeled **combined_score**, denoting the confidence level of each interaction.
The assigned score is derived from the number of known interactions between two
proteins, normalized to the (0, 1) interval utilizing the formula:
$$
\begin{aligned}
combinedScore = \frac{(\#interaction - min)}{(max - min)}
\end{aligned}
$$
where **min** and **max** represent the minimum and maximum number of
interactions, respectively.
In addition to downloading a PPI matrix during the current session, users can
also upload a previously saved matrix for analysis using the **Browse** button.
This functionality allows users to work with their own customized datasets or
previously analyzed PPI matrices. Furthermore, saving the downloaded PPI matrix
locally enables users to store the data on their machine for future use. By
saving the matrix locally via the **Save PPI matrix** button, users can access
the data quickly in subsequent sessions without having to wait for the download
process again. This capability significantly enhances workflow efficiency and
allows for seamless continuation of analysis across different sessions.
```{r download-ppi, fig.align = "center", fig.cap = "Downloading the PPI", echo = FALSE}
knitr::include_graphics("Downloading_PPI.png")
```
While the final two steps are optional, note that the PPI matrix is only
required for a singular score. Therefore, you can commence data exploration
without necessarily completing these additional steps.
Upon concluding the essential tasks outlined in this panel, you are ready to
progress to the **Distance Scores** panel.
## The Distance Scores panel
This panel focuses mainly on computing distance scores for the provided input
data. Like the preceding panel, it is segmented into two distinct sections,
each serving a specific function.
```{r distance-score, fig.align = "center", fig.cap = "The Distance Score panel", echo = FALSE}
knitr::include_graphics("Distance_Score_panel.png")
```
**Calculating Distance Scores**
In the upper box, titled **Calculate distance scores for your Genesets**, you
have the flexibility to select from various distance scores for computation.
This feature provides users with a range of options to tailor the analysis
according to their specific requirements and preferences. The available scores
are:
* **pMM Score**: This score integrates protein-protein interaction (PPI) data
into the Meet-Min distance. The PPI-weighted Meet-Min (**pMM**) score is defined
as
$$
\begin{aligned}
pMM = min(pMM(A -> B), pMM(B -> A))
\end{aligned}
$$
where
$$
\begin{aligned}
pMM(A -> B) = 1 - \frac{|A \cap B|}{min(|A|, |B|)} - \frac{\alpha}{min(|A|, |B|)} * \sum_{a \in A - B} \frac{w * \sum_{b \in A \cap B} P(a, b) + \sum_{b \in B - A} P(a, b)}{max(P) * (w * |A \cup B| + |B - A|)}
\end{aligned}
$$
and
$$
\begin{aligned}
w = \frac{min(|A|, |B|)}{|A| + |B|}
\end{aligned}
$$
$\alpha$ is a scaling factor between 0 and 1. The PPI matrix can be downloaded
from the **Data Input** panel. More details can be found in the paper by Yoon
et al [@Yoon2019].
* **Kappa Score**: The **Kappa** distance is a set-based metric based on observed
and expected agreement rates between two genesets. It is defined as
$$
\begin{aligned}
Kappa = 1 - \frac{O - E}{1 - E}
\end{aligned}
$$
where
$$
\begin{aligned}
O = \frac{|A \cap B| + |A \cup B|^c}{U} \\
E = \frac{|A| |B| + |A^c| |B^c|}{|U|^2}
\end{aligned}
$$
U is the set of all unique genes in the data. In this application the Kappa
distance is additionally normalized to the (0, 1) interval to make it comparable
to the remaining distance metrics.
* **Jaccard Score**: The **Jaccard** distance uses the Jaccard coefficient, which
is transformed into a distance metric by subtracting it from 1. It is defined as
$$
\begin{aligned}
Jaccard = 1 - \frac{|A \cap B|}{|A \cup B|}
\end{aligned}
$$
* **Meet-Min Score**: The **Meet-Min** (MM) distance transforms the overlap
coefficient into a distance measure by subtracting it from 1.The overlap
coefficient is a similarity measure which is defined as
$$
\begin{aligned}
OC = \frac{|A \cap B|}{min(|A|, |B|)}
\end{aligned}
$$
In order to transform this measure of similarity into a measure of distance,
the overlap coefficient is subtracted from 1, resulting in the calculation of
the Meet-Min (MM) distance as
$$
\begin{aligned}
MM = 1 - \frac{|A \cap B|}{min(|A|, |B|)}
\end{aligned}
$$
As a solely set based measurement, the Meet-Min distance only takes the
composition of the genesets into account but not the underlying biological
information inherent in the genesets.
* **Sorensen-Dice**: The **Sorensen-Dice** distance uses the Sorensen-Dice
coefficient, which is transformed into a distance metric by subtracting it
from 1. It is defined as
$$
\begin{aligned}
Sorensen-Dice(A, B) = 1 - \frac{2 * |A \cap B|}{|A| + |B|}
\end{aligned}
$$
As a solely set based measurement, the Sorensen-Dice distance only takes the
composition of the genesets into account but not the underlying biological
information inherent in the genesets.
* **GO distance**: The **GO distance** score measures the relationship between
gene sets that are represented by GO terms. Implemented in the
`r BiocStyle::Biocpkg("GOSemSim")` Rpackage, there are two main types:
information content (IC)-based methods(e.g., Resnik, Lin, Schlicker, and Jiang)
and graph-based methods (e.g., Wang). These methods compute similarity scores
based on shared characteristics, such as the most informative common ancestor
in IC-based methods or the hierarchical structure of the GO database in
graph-based methods. To integrate these scores into distance-based analyses,
the similarity scores are converted into distance scores by subtracting the
similarity score from 1. This transformation ensures compatibility with other
distance metrics used in `r BiocStyle::Biocpkg("GeDi")`. While applicable only
to GO terms, this approach is particularly useful in gene function analyses.
Each scoring method possesses its own set of advantages and drawbacks,
underscoring the importance of selecting one that suits your dataset
characteristics and analysis goals. Upon choosing a score, the
**Compute the distances between genesets** button appears on the on the right
side. Clicking this button initiates the scoring procedure, which may require
some time to execute, particularly for larger datasets. To monitor the progress
of this operation, refer to the progress bar located in the lower right corner
of the panel. Once the scoring process concludes, you can delve into the
**Geneset Distance Scores** box to explore a variety of visual representations
of your data.
**Distance Scores Visualizations**
* **Distance Scores Heatmap**: The initial visualization offered is a heatmap
illustrating the distribution of distance scores. Activation of the heatmap
generation is triggered by clicking the **Calculate Distance Score Heatmap**
button. Following computation, users can interact with the heatmap by hovering
over it, revealing the involved genesets and their corresponding scores.
Additionally, users can zoom in on specific areas of interest. To reset the
zoomed view, a simple click outside the heatmap area suffices.
* **Distance Scores Dendrogram**: The second visualization provided is a
dendrogram showcasing individual distance scores. Hierarchical clustering is
employed to generate the dendrogram, which effectively groups genesets exhibiting
the highest similarity. To enhance the dendrogram's presentation, users can
select different combination methods using the drop-down menu located on the
left side.
* **Distance Scores Graph**: The final visualization available is the network
representation of distance scores. In this representation, nodes/genesets with
scores below a predefined threshold are connected by edges. By default, the
threshold is set to 0.3, but users can adjust it via the slider located on the
left. This interactive graph allows users to hover over or click on nodes to
highlight connected nodes and obtain additional information about genesets upon
selection. Furthermore, users can search for specific genesets using the input
field on the left, with the selected geneset being subsequently highlighted in
the graph. The **Graph metrics** table at the bottom of this box contains various
metrics pertaining to the graph, such as degree, betweenness, harmonic centrality,
clustering coefficient, and input data. This tabulated information serves to
provide users with valuable insights into the underlying data and distance scores.
**Bookmarking from the this panel:**
As users navigate through the distance scores of genesets in this section, they
may encounter genesets and interactions that capture their interest and merit
further investigation. To aid in preserving these noteworthy genesets for later
exploration, you can utilize the **Bookmark** button situated in the Navbar.
Upon clicking this button, the selected geneset will be added to the list of
bookmarked genesets within the **Report** panel. Additionally, informative
messages displayed in the lower right corner will guide users through the
bookmarking process.
Once you've finished exploring the distance scores, you can proceed to the
**Clustering graph** panel.
## The Clustering Graph panel
This panel is dedicated to the computation of clusters among genesets based on
their similarity, which is derived from the previously calculated distance
scores. Similar to the preceding panel, it comprises two distinct boxes. Within
these boxes, users can access functionalities to determine and visualize
clusters of genesets that exhibit comparable characteristics or functions.
The computation of clusters involves grouping genesets that display similar
patterns of distance scores, thereby indicating shared biological characteristics
or functional relationships. This clustering process enables users to identify
cohesive groups of genesets with related functionalities or involvement in
similar biological processes.
```{r clustering-graph, fig.align = "center", fig.cap = "The Clustering Graph panel", echo = FALSE}
knitr::include_graphics("Clustering_Graph_Panel.png")
```
**Choosing a Clustering algorithm**
The upper box, labeled **Select the clustering method**, provides a selection of
distinct clustering algorithms. Users can explore various options to find the
most suitable algorithm for their analysis:
* **Louvain**: The Louvain algorithm, a prevalent tool in biological network
analysis, seeks to divide graph nodes into clusters to optimize the modularity
metric. This metric gauges the strength of connections within clusters relative
to those between clusters. Consequently, nodes within the same cluster exhibit
greater similarity to one another than to nodes outside the cluster. This
clustering approach aims to enhance data interpretation by grouping similar
genesets together. Users can adjust a slider in the bottom left corner of the
box to set a similarity threshold, determining when genesets are considered
similar based on distance scores.
* **Markov**: The Markov algorithm, commonly employed in biological network
analysis, is designed to pinpoint densely interconnected regions within graphs.
These regions frequently align with communities or clusters in the graph
structure. Users can utilize a slider located in the bottom left corner of the
box to specify a similarity threshold, determining when genesets are deemed
similar based on distance scores.
* **Fuzzy clustering**: The Fuzzy Clustering algorithm is a computational
technique used to partition data points into clusters based on their similarity,
while allowing for data points to belong to multiple clusters with varying
degrees of membership. It operates through distinct steps and requires the
specification of different thresholds. Firstly, the **Similarity threshold** is
set to determine if two genesets exhibit sufficient similarity to be potentially
clustered together. Secondly, the **Membership threshold** dictates how many
members of a potential cluster must possess a close relationship, defined by a
distance score less than or equal to the similarity threshold, for the cluster
to persist. Lastly, the **Clustering threshold** determines whether two clusters
will be merged. Clusters are merged if their percentage of overlap meets or
exceeds the clustering threshold. Users can adjust all thresholds using sliders
provided in the interface.
* **PAM**: The PAM (Partitioning around Mendoids) clustering algorithm
partitions nodes into k distinct clusters, where **k** is a user-defined
parameter. The algorithm iteratively assigns each node to the nearest cluster
center based on calculated distance scores, and then updates the cluster
centers to minimize the overall variance within each cluster. Users can
specify the number of clusters, **k**, using a slider in the interface,
allowing them to tailor the clustering process to the needs of their analysis.
Adjusting the value of k enables the exploration of different clustering
granularities, providing flexibility in interpreting the data and identifying
meaningful patterns.
Once you choose a method, you can start the cluster calculation via the
**Cluster the Genesets** button on the right. Keep in mind that this step might
take some time, especially for larger datasets. Look for the progress bar in the
lower right corner for updates on the scoring status.
Once the clusters are calculated, you can explore various visualizations of your
data in the **Geneset Cluster Graphs** box.
**Cluster Visualizations**
* **Geneset Graph**: In the **Geneset Graph**, clusters are visualized as a graph,
with individual genesets serving as nodes and edges connecting genesets within
the same cluster. To highlight specific nodes, utilize the **Select by id**
feature on the left, or choose to highlight entire clusters by selecting the
respective option under **Select by cluster**. Please note that only genesets
belonging to at least one cluster will be displayed in this graph. For additional
insights, nodes can be colored based on specific parameters from your input data,
accessible through the **Color the graph** by dropdown menu. Depending on the
information provided with your data, various options will be available. While
interacting with the network, nodes can be moved by clicking and dragging them
to desired locations, offering flexibility in managing node placement which is
particularly useful in complex or densely populated graphs.
* **Cluster-Geneset Bipartite Graph**: The **Cluster-Geneset Bipartite Graph**
presents a bipartite representation of the clusters. In this visualization, nodes
represent both clusters and genesets, with edges connecting cluster nodes to
their corresponding geneset members. Hovering over nodes provides additional data
insights. Cluster nodes display the members within each cluster, while geneset
nodes showcase the genes associated with each geneset.
* **Cluster Enrichment Terms Word Cloud**: The
**Cluster Enrichment Terms Word Cloud** displays the most frequently occurring
terms for each cluster. This visualization proves particularly useful when your
data includes brief descriptions of the genesets, in addition to the mandatory
input data. By utilizing the **Select a cluster** drop-down menu, you can
designate the cluster of interest. Furthermore, hovering over the word cloud
enables you to select individual terms and view the frequency with which each
term appears in the descriptions of the genesets within that cluster.
* **Clustering graph summaries**: The cluster information is also summarized in
a table-like format in the **Clustering graph summaries** box. This table
displays each geneset alongside the cluster to which it belongs. Additionally,
the table features a search function, facilitating the quick retrieval of a
geneset of interest.
**Bookmarking from this panel:**
While exploring the **Clustering Graph panel**, users may encounter genesets
and clusters that intrigue them and warrant further investigation. To facilitate
the preservation of these notable genesets and clusters for future exploration,
users can utilize the **Bookmark** button located in the Navbar. Clicking this
button will add the selected geneset or cluster to the list of bookmarked items
within the Report panel. Helpful messages displayed in the lower right corner
will assist users throughout the bookmarking process.
In order to bookmark interesting genes and clusters, users simply select a
geneset or cluster from the Geneset Graph or Cluster-Geneset Bipartite Graph
and use the Bookmark button to add the respective information to the set of
bookmarked features.
After exploring the results in the **Clustering Graph panel**, users can proceed
to the **Report** panel to have a look at the bookmarked genesets and clusters
or iterate through the individual panels of the app for a more in depth
exploration of the data.
## The Report panel
In this panel of the application, users can obtain a comprehensive overview of
the items they have bookmarked for further exploration. On the left side of the
interface, bookmarked genesets are listed, while bookmarked clusters are
displayed on the right side.
During an interactive exploration session, recalling specific details about each
bookmarked item can sometimes be challenging. Therefore, users are provided with
convenient options to manage their bookmarked data.
Below the interactive tables displaying bookmarked genesets and clusters, users
can find buttons allowing them to download the content of each table individually.
Additionally, the **Start the generation of the report** button is provided to
generate a detailed report encompassing all selected elements of interest.
The report generation process utilizes a predefined template report included
within the `r BiocStyle::Biocpkg("GeDi")` package. This template leverages the
input elements and reactive values associated with the bookmarks, ensuring that
the generated report contains comprehensive and relevant information.
The resulting report serves as a valuable tool for creating a permanent and
reproducible analysis output. Users can easily store or share this report for
future reference or collaboration purposes.
```{r report-panel, fig.align = "center", fig.cap = "The Report panel", echo = FALSE}
knitr::include_graphics("Report_panel.png")
```
# Additional Information {#additionalinfo}
If you have questions about the package or the available functionality, please
submit them on the Bioconductor [support site](https://support.bioconductor.org/)
using the tag 'GeDi'.
Bug reports can be opened as issues in the `r BiocStyle::Biocpkg("GeDi")`
[GitHub repository](https://github.com/AnnekathrinSilvia/GeDi/issues).
Please note that the GitHub repository also hosts the development version of the
package, where new functionality is continuously added - be cautious, as you may
be working with cutting-edge versions!
The authors welcome thoughtful suggestions for enhancements or new features, and
even better, pull requests.
# Additional example data
In this section, we present additional examples demonstrating the versatility of
`r BiocStyle::Biocpkg("GeDi")` in analyzing functional enrichment data from two
widely used databases, KEGG and Reactome. By leveraging the rich resources provided
by these databases, GeDi offers researchers a comprehensive toolkit for exploring
and interpreting complex biological pathways and processes. Through step-by-step
demonstrations, we illustrate how GeDi can seamlessly integrate with data from
KEGG and Reactome, enabling users to gain deeper insights into the functional
annotations of their gene sets. Whether investigating specific pathways or broader
biological processes, GeDi provides intuitive and powerful functionalities to
enhance the analysis of functional enrichment data from diverse sources.
In this section we will demonstrate how results containing identifiers from
databases like KEGG [@Kanehisa2023] or Reactome [@Gillespie2022] - e.g. generated
using `enrichKegg` or `enrichPathway` functions from the
`r BiocStyle::Biocpkg("clusterProfiler")` package - can be utilized as input for
`r BiocStyle::Biocpkg("GeDi")`. We will again use the data of the
`r BiocStyle::Biocpkg("macrophage")` package, specifically the differentially
expressed genes we have identified before. With this data, we demonstrate how to
generate the results and prepare them for their use in
`r BiocStyle::Biocpkg("GeDi")`.
However, before we can use the `enrichKEGG` function from the
`r BiocStyle::Biocpkg("clusterProfiler")` package, we have to map the ENSEMBL ids
of the data to Entrez ids. For this, we will up the first use the
`r BiocStyle::Biocpkg("biomaRt")` package to generate a mapping of ENSEMBL to
Entrez.
```{r withbiomart, eval = FALSE}
# Load the "biomaRt" package to access the BioMart database
library("biomaRt")
# Set up a connection to the ENSEMBL BioMart database for human genes
mart <-
useMart(biomart = "ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl")
# Retrieve gene annotations using the BioMart database
anns <- getBM(
attributes = c(
"ensembl_gene_id",
"external_gene_name",
"entrezgene_id",
"description"
),
filters = "ensembl_gene_id",
values = rownames(dds_macrophage),
mart = mart
)
# Match the retrieved annotations to the genes in dds_macrophage
anns <- anns[match(rownames(dds_macrophage), anns[, 1]), ]
```
Next, we map the differentially expressed genes to get the right identifiers and
run the `enrichKEGG` function. We set the organism to human and the p-value
cutoff to 5%.
```{r enrichKegg, eval = FALSE}
# Load the "clusterProfiler" package for functional enrichment analysis
library("clusterProfiler")
# Retrieve Entrez gene IDs from the annotations data frame based on matching
# Ensembl gene IDs from the DE results
genes <- anns$entrezgene_id[match(rownames(res_macrophage_IFNg_vs_naive),
anns$ensembl_gene_id)]
# Perform KEGG pathway enrichment analysis using the retrieved gene IDs
res_enrich <- enrichKEGG(genes,
organism = "hsa",
pvalueCutoff = 0.05
)
```
We can now use the results of the enrichment in `r BiocStyle::Biocpkg("GeDi")`.
For this, we directly start the app with the loaded data. If you have not computed
the data following this workflow, you can beforehand load it from the available
data in this package.
```{r GeDi_Kegg, eval = FALSE}
# Load the "macrophage_KEGG_example" dataset from the "GeDi" package
data("macrophage_KEGG_example", package = "GeDi")
# Start the GeDi app with the loaded data
# The "genesets" parameter is set to the loaded "macrophage_KEGG_example"
# dataset
GeDi(genesets = macrophage_KEGG_example)
```
In a similar manner we can use the Reactome database for the functional annotation.
Here, we use the
`r BiocStyle::Biocpkg("ReactomePA")` package and the differentially expressed
genes.
```{r enrichReactome, eval = FALSE}
# Load the "ReactomePA" package for pathway enrichment analysis
library("ReactomePA")
# Perform pathway enrichment analysis using the "enrichPathway" function
reactome <- enrichPathway(genes,
organism = "human",
pvalueCutoff = 0.05,
readable = TRUE
)
```
Now we can use the results in the same manner as for the KEGG pathway analysis.
```{r GeDi_Reactome, eval = FALSE}
# Load the "macrophage_Reactome_example" dataset from the "GeDi" package
data("macrophage_Reactome_example", package = "GeDi")
# Start the GeDi app with the loaded data
# The "genesets" parameter is set to the loaded "macrophage_Reactome_example"
# dataset
GeDi(genesets = macrophage_Reactome_example)
```
# FAQs {#faqs}
**Q: My configuration on two machines is somewhat different, so I am having difficulty in finding out what packages are different. Is there something to help on this?**
A: Yes, you can check out `r BiocStyle::Githubpkg("federicomarini/sessionDiffo")`,
a small utility to compare the outputs of two different `sessionInfo` outputs.
This can help you pinpoint what packages might be causing the issue.
**Q: I am using a different service/software for generating the results of functional enrichment analysis. How do I plug this into `GeDi`?**
A: You can use nearly any result of a functional enrichment analysis in
`r BiocStyle::Biocpkg("GeDi")` as long as the results are transformed in a way
that they fit the input requirements. Please check out the **Welcome** page to
see the specification of the input requirements.
# Session Info {- .smaller}
```{r sessioninfo}
utils::sessionInfo()
```
# References {-}