--- title: > The `GeDi` User's Guide author: - name: Annekathrin Silvia Nedwed affiliation: - Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), Mainz email: anneludt@uni-mainz.de - name: Federico Marini affiliation: - Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), Mainz - Center for Thrombosis and Hemostasis (CTH), Mainz email: marinif@uni-mainz.de date: "`r BiocStyle::doc_date()`" package: "`r BiocStyle::pkg_ver('GeDi')`" output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{The GeDi User's Guide} %\VignetteEncoding{UTF-8} %\VignettePackage{GeDi} %\VignetteKeywords{FunctionalAnnotation, Enrichment Analysis, Distance measurements, Exploration, Visualization, GUI} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console bibliography: GeDi.bib --- **Compiled date**: `r Sys.Date()` **Last edited**: 2024-02-29 **License**: `r packageDescription("GeDi")[["License"]]` ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", error = FALSE, warning = FALSE, eval = TRUE, message = FALSE, fig.width = 8 ) options(width = 100) ```
# Introduction {#introduction} This vignette introduces the usage of the `r BiocStyle::Biocpkg("GeDi")` package for exploring the results of functional annotation and enrichment analyses. `r BiocStyle::Biocpkg("GeDi")` is a versatile package designed to simplify the exploration and comprehension of functional annotation and enrichment analysis results. It offers a `r BiocStyle::CRANpkg("shiny")` application that combines interactivity, visualization, and reproducibility to consolidate comprehensive outcomes. To incorporate `r BiocStyle::Biocpkg("GeDi")` into your workflow, you'll need the results of a functional annotation or enrichment analysis. This vignette demonstrates the core functionalities of `r BiocStyle::Biocpkg("GeDi")` using a publicly available dataset from Alasoo et al., as described in their paper "Shared genetic effects on chromatin and gene expression indicate a role for enhancer priming in immune response" [@Alasoo2018]. Accessible through the `r BiocStyle::Biocpkg("macrophage")` Bioconductor package, this dataset comprises files generated from Salmon quantification (version 0.12.0, with Gencode v29 reference) and gene-level summarized values. Within the `r BiocStyle::Biocpkg("macrophage")` experimental setup, samples derive from six different donors under four distinct conditions: naive, treated with Interferon gamma, with SL1344, or with a combination of Interferon gamma and SL1344. For illustration, we will focus on comparing Interferon gamma-treated samples with naive samples. # Getting started {#gettingstarted} Before you can start using GeDi, the package needs to be installed on your machine. To install the package, begin by opening R and executing the following command: ```{r install, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("GeDi") ``` Once installed, the package can be loaded and attached to your current workspace as follows: ```{r loadlib} library("GeDi") ``` With the attached package, you can simply start the application by running `GeDi()`. ```{r launchapp, eval=FALSE} GeDi() ``` This action will open the application, directing you to the **Welcome** page. From there, you can easily provide your data using the **Data Input** panel on the left side menu, ensuring it's in the correct format for analysis. Alternatively, you can initiate the application by executing: ```{r launchappwithData, eval=FALSE} GeDi( genesets = geneset_df, ppi = ppi_df, distance_scores = distance_scores_df ) ``` where - `geneset_df` represents your input data in the form of a `data.frame`, which should include at least one column named "Genesets" containing geneset identifiers and one column named "Genes" containing a comma-separated list of genes belonging to each respective geneset. - `ppi_df` is another `data.frame` containing protein-protein interaction scores, with columns named "from", "to", and "combined_score". - `distance_scores_df` is a sparse `Matrix` containing the distance scores of the genesets in your data. All of these parameters are optional, as you can alternatively upload, download, and compute them directly within the application. However, some of these processes may require a significant amount of time, especially with larger datasets. Therefore, it may be advantageous to save the intermediate results, such as the downloaded PPI and computed distance scores, for later use within the application. In this vignette, we demonstrate the functionality of `r #BiocStyle::Biocpkg("GeDi")` `GeDi` using enrichment analysis results from the `r BiocStyle::Biocpkg("macrophage")` dataset. To immediately start exploring the application, you can simply execute: ```{r examplerun, eval=FALSE} GeDi() ``` and load the example data with the `Load example data` button in the **Data Input** panel. Alternatively, you can proceed by following the subsequent code chunks to create the necessary input objects, step by step. This can serve as a reference guide for the steps ideally executed prior to analyzing the data with `r BiocStyle::Biocpkg("GeDi")`. To utilize `r BiocStyle::Biocpkg("GeDi")`, you'll require results from a functional annotation analysis. In this vignette, we'll demonstrate how to conduct an enrichment analysis on differentially expressed (DE) genes from the `r BiocStyle::Biocpkg("macrophage")` dataset. Firstly, we'll load the macrophage data and create a `DESeqDataset`, as the subsequent differential expression analysis will be performed using `r BiocStyle::Biocpkg("DESeq2")` [@Love2014]. ```{r create_dds} # Load required libraries library("macrophage") library("DESeq2") # Load the example dataset "gse" from the "macrophage" package data("gse", package = "macrophage") # Create a DESeqDataSet object using the "gse" dataset and define the # experimental design. # We use the condition as part of the experimental design, because we are # interested in the differentially expressed genes between treatments. We also # add the line to the design to account for the inherent differences between # the donors. dds_macrophage <- DESeqDataSet(gse, design = ~ line + condition) # Change the row names of the DESeqDataSet object to Ensembl IDs rownames(dds_macrophage) <- gsub("\\..*", "", rownames(dds_macrophage)) # Have a look at the resulting DESeqDataSet object dds_macrophage ``` Now that we've obtained our `DESeqDataset`, we can conduct the differential expression (DE) analysis. In this vignette, we'll utilize the results from comparing two distinct conditions of the dataset, specifically `IFNg` and `naive`, while accounting for the cell line of origin. Before executing the DE analysis, we'll filter out lowly expressed features from the dataset. In this instance, we'll exclude all genes with fewer than 10 counts in at least 6 samples, where 6 corresponds to the smallest group size in the dataset. Subsequently, we'll conduct the DE analysis and assess against a null hypothesis of a log2FoldChange of 1 to ensure that we identify genes with consistent and robust changes in expression. Finally, we'll append the gene symbols to the resultant `DataFrame`, which will later serve as our "Genes" column in the input data for `r BiocStyle::Biocpkg("GeDi")`. ```{r create_resde1} # Filter genes based on read counts # Calculate the number of genes with at least 10 counts in at least 6 samples keep <- rowSums(counts(dds_macrophage) >= 10) >= 6 # Subset the DESeqDataSet object to keep only the selected genes dds_macrophage <- dds_macrophage[keep, ] # Have a look at the resulting DESeqDataSet object dds_macrophage ``` ```{r create_resde2} # Perform differential expression analysis using DESeq2 dds_macrophage <- DESeq(dds_macrophage) # Extract differentially expressed genes # Perform contrast analysis comparing "IFNg" condition to "naive" condition # Set a log2 fold change threshold of 1 and a significance level (alpha) of 0.05 res_macrophage_IFNg_vs_naive <- results(dds_macrophage, contrast = c("condition", "IFNg", "naive"), lfcThreshold = 1, alpha = 0.05 ) # Add gene symbols to the results in a column "SYMBOL" res_macrophage_IFNg_vs_naive$SYMBOL <- rowData(dds_macrophage)$SYMBOL ``` After completing the differential expression analysis, we move on to conduct the functional annotation analysis. To begin, we extract the differentially expressed (DE) genes from the previously generated results and identify the background genes to be utilized for functional enrichment. For the enrichment analysis, we use the overrepresentation analysis method provided by the `r BiocStyle::Biocpkg("topGO")` package. To streamline the integration of these results into `r BiocStyle::Biocpkg("GeDi")`, we utilize the `topGOtable` function from the `r BiocStyle::Biocpkg("pcaExplorer")` package. By default, this function employs the `BP` ontology and the `elim` method, which helps decorrelate the Gene Ontology (GO) graph structure, resulting in less redundant functional categories. The output is a `DataFrame` object that seamlessly integrates with `r BiocStyle::Biocpkg("GeDi")`. However, as `r BiocStyle::Biocpkg("GeDi")` has only minimal requirements for the input, enrichment results generated using `r BiocStyle::Biocpkg("clusterProfiler")` can also be utilized. While we primarily tested results from the `enrichGO` method during `r BiocStyle::Biocpkg("GeDi")` development, those from the `enrichKEGG` and `enrichPathway` methods are also compatible. ```{r create_resenrich1, eval=TRUE} # Load required packages for analysis library("pcaExplorer") library("GeneTonic") library("AnnotationDbi") # Extract gene symbols from the DESeq2 results object where FDR is below 0.05 # The function deseqresult2df is used to convert the DESeq2 results to a # dataframe format # FDR is set to 0.05 to filter significant results de_symbols_IFNg_vs_naive <- deseqresult2df(res_macrophage_IFNg_vs_naive, FDR = 0.05)$SYMBOL # Extract gene symbols for background using the DESeq2 results object # Filter genes that have nonzero counts bg_ids <- rowData(dds_macrophage)$SYMBOL[rowSums(counts(dds_macrophage)) > 0] ``` ```{r create_resenrich2, eval=TRUE} # Load required package for analysis library("topGO") library("org.Hs.eg.db") # Perform Gene Ontology enrichment analysis using the topGOtable function from # the "pcaExplorer" package macrophage_topGO_example <- pcaExplorer::topGOtable(de_symbols_IFNg_vs_naive, bg_ids, ontology = "BP", mapping = "org.Hs.eg.db", geneID = "symbol", topTablerows = 500 ) ``` As mentioned earlier, `r BiocStyle::Biocpkg("GeDi")` expects the input to contain at least two columns: one named "Genesets" and one named "Genes". While this is not strictly mandatory when providing your data interactively during an application session, it becomes necessary if you intend to initiate the application with your input as parameters (e.g., `GeDi(genesets = my_genesets_df)`). In such cases, the "Genesets" column should contain identifiers for each geneset in the input, while the "Genes" column should consist of comma-separated lists of genes associated with each geneset. Therefore, we will adjust the column names of the resulting `data.frame` from the enrichment analysis to adhere to the required format. ```{r renamecolumns, eval=TRUE} # Rename columns in the macrophage_topGO_example dataframe # Change the column name "GO.ID" to "Genesets" names(macrophage_topGO_example)[names(macrophage_topGO_example) == "GO.ID"] <- "Genesets" # Change the column name "genes" to "Genes" names(macrophage_topGO_example)[names(macrophage_topGO_example) == "genes"] <- "Genes" ``` ## All set! Now that we've obtained functional annotation results from the `r BiocStyle::Biocpkg("macrophage")` dataset, we can begin exploring the data using `r BiocStyle::Biocpkg("GeDi")`. You have two options: you can either launch the application and supply the generated data using the `GeDi()` command, or if you've followed this vignette, you can initiate the application directly with the loaded data by executing `GeDi(genesets = macrophage_topGO_example)`. ```{r dryrun, eval=FALSE} GeDi() GeDi(genesets = macrophage_topGO_example) ``` The above shown code will open the application, directing you to the **Welcome** page. The **Welcome** page of `r BiocStyle::Biocpkg("GeDi")` serves as the entry point to the application, providing users with an overview of its features and functionalities. Upon launching the application, users are greeted with a user-friendly interface designed to facilitate the exploration and interpretation of functional annotation and enrichment analysis results. The **Welcome** page offers guidance on how to navigate the application and highlights key components such as data input options, visualization tools, and interactive features. Whether users are new to GeDi or returning to explore additional datasets, the **Welcome** page serves as a central hub for accessing resources and getting started with their analysis journey. # Description of the `GeDi` user interface {#userinterface} The `r BiocStyle::Biocpkg("GeDi")` application, developed with the `r BiocStyle::CRANpkg("shiny")` framework, incorporates the modern design elements of the `r BiocStyle::CRANpkg("bs4Dash")` package, which is built upon Bootstrap 4. This combination of technologies ensures a sleek and visually appealing user interface for navigating and interacting with the functionality offered by `r BiocStyle::Biocpkg("GeDi")`. By leveraging the features of `r BiocStyle::CRANpkg("shiny")` and `r BiocStyle::CRANpkg("bs4Dash")`, `r BiocStyle::Biocpkg("GeDi")` provides users with an intuitive and aesthetically pleasing environment for conducting functional annotation and enrichment analyses on their datasets. ## Header (navbar) The dashboard navbar in `r BiocStyle::Biocpkg("GeDi")`, referred to as such in the `r BiocStyle::CRANpkg("bs4Dash")` framework, features a dropdown menu accessible by clicking on the respective "info" icon. The menu offers additional functionality through various buttons: - The open book icon - This option allows users to explore the `r BiocStyle::Biocpkg("GeDi")` vignette, either the version bundled with the package or the online version, providing detailed documentation and usage guidelines. - The information i cirle - Selecting this option displays information about the current session, presenting details such as the R environment and loaded packages, helpful for troubleshooting and debugging purposes. - The heart button - This button offers general information about `r BiocStyle::Biocpkg("GeDi")`, including links to its development version for contribution and guidelines on citing the tool in research publications. Besides the two dropdown menus, users can also find the `Bookmark` button in the Navbar. The `Bookmark` button in the `r BiocStyle::Biocpkg("GeDi")` navbar serves as a convenient tool for users to save and bookmark genes and genesets of interest for later reference. To use this feature, users must first select or click on a gene or geneset that they wish to bookmark. Once the desired gene or geneset is selected, users can then click on the `Bookmark` button to add it to a list of bookmarked items within the `r BiocStyle::Biocpkg("GeDi")` application. This functionality enables users to organize and revisit specific genes or genesets that they find relevant or intriguing during their exploration of functional annotation and enrichment analysis results. The bookmarked genes and genesets can later be found in the **Report** panel. ## Sidebar By clicking the menu bar icon on the left side of the app (or simply by moving the mouse over to the left side if viewing the app in full screen mode), users can activate the sidebar menu. This sidebar menu serves as the primary means of accessing the various panels of the `r BiocStyle::Biocpkg("GeDi")` application, providing navigation to different functionalities. More detailed explanations of each panel will be provided in the next section. ## Body The structure of `r BiocStyle::Biocpkg("GeDi")` is designed around different panels, each of which becomes active upon clicking the corresponding icons or text in the sidebar. While the Welcome panel is relatively self-explanatory, additional information and explanations are provided for the functionality of the remaining panels. For new users seeking guidance, there's a question circle button available to initiate an interactive tour of `r BiocStyle::Biocpkg("GeDi")`. This tour allows users to learn the basic usage mechanisms by actively engaging with the interface. During the tour, specific elements are highlighted in response to user actions, while the rest of the UI remains shaded to maintain focus. Users can interrupt the tour at any time by clicking outside the highlighted window, and navigation between steps is facilitated by arrow buttons (left, right). The tour functionality is implemented using the `r BiocStyle::CRANpkg("rintrojs")` package. # The `GeDi` functionality {#functionality} The `r BiocStyle::Biocpkg("GeDi")` `r BiocStyle::Biocpkg("shiny")` application is organized into distinct panels, each serving a specific purpose, which will be thoroughly explored in the following sections. ## The Welcome panel This panel serves as a guide for utilizing `r BiocStyle::Biocpkg("GeDi")` effectively. It offers detailed instructions on generating input data for the application, elucidating the expected input format and outlining the various interactive elements present in the app's other panels. ```{r welcome-page2, fig.align = "center", fig.cap = "The Welcome panel of GeDi", echo = FALSE} knitr::include_graphics("Welcome_page.png") ``` ## The Data Input panel This panel serves as a hub for managing data input if it's not provided within the function call. It's divided into distinct boxes, each representing a step of the data input process, which sequentially appear as you successfully complete each preceding step. **Step 1**: Provide your Genesets as input data In the initial **Step 1** box, you can provide your data by utilizing the **Browse** button. This action opens a modal window enabling you to select the relevant file from your computer storage. After successfully loading the data, a preview is displayed in the **Genesets preview** box on the right. During this step, the application checks if your input contains the "Genesets" and "Genes" columns. If these columns are missing, a small error message appears in the lower right corner. Additionally, two drop-down menus allow you to select the correct columns from your data and update the input accordingly. You also have the option to start using `r BiocStyle::Biocpkg("GeDi")` with preprocessed example data based on the `r BiocStyle::Biocpkg("macrophage")` dataset. Simply click the **Load demo data** button to load the example data's enrichment results. You can explore these results in the **Genesets preview** box. However, instead of loading demo data and observing the expected data structure through the **Genesets preview** box, you can also use the **Have a look at the data structure** button. By clicking this button, a modal window with a visual representation of the expected input data structure will open. This screenshot serves as a helpful guide, providing you with a clear understanding of how your data should be formatted for optimal compatibility with `r BiocStyle::Biocpkg("GeDi")`. Once, you have successfully loaded some data, the data input process will proceed and two additional boxes will be displayed in the panel. ```{r data-input-step1, fig.align = "center", fig.cap = "The Data input panel - Step 1", echo = FALSE} knitr::include_graphics("Data_Input_panel_Step1.png") ``` **Optional Filtering Step**: Filter generic genesets Introducing the first new box, the **Optional Filtering Step** offers a non-compulsory yet advantageous opportunity to refine your geneset selection. While not obligatory for data exploration, engaging in this step can notably optimize downstream processing runtime. Here, you're empowered to filter genesets within your dataset, thereby enhancing result interpretation. This step enables the exclusion of large and generic genesets, contributing to clearer insights. Additionally, you have the flexibility to filter genesets based on size criteria. The box features a histogram illustrating geneset sizes, providing visual context for the filtering process. Within the interface, two input fields are available for customization. The left input field facilitates the selection of individual genesets by their identifiers in the "Genesets" column of your dataset. Meanwhile, the right input field empowers you to establish a threshold "x" for filtering genesets with a size greater than or equal to "x." This interactive approach ensures tailored filtering suited to your specific analysis requirements. Once you've chosen the genesets you wish to exclude from your dataset, you can initiate the filtering process by clicking the "Remove the selected Genesets" button. This action will remove all selected genesets from the dataset. Additionally, you have the option to save the filtered data using the "Download the filtered data" button. Clicking this button will save the filtered data to your local machine. This feature can be particularly beneficial for users who intend to revisit their data in a new instance of GeDi and want to ensure that previously identified uninsightful genesets have already been filtered out. Once you've chosen the genesets you wish to exclude from your dataset, you can initiate the filtering process by clicking the `Remove the selected Genesets` button. This action will remove all selected genesets from the dataset. Additionally, you have the option to save the filtered data using the `Download the filtered data` button. Clicking this button will save the filtered data to your local machine.This feature can be particularly beneficial for users who intend to revisit their data in a new instance of `r BiocStyle::Biocpkg("GeDi")` and want to ensure that previously identified uninsightful genesets have already been filtered out. ```{r optional-filtering, fig.align = "center", fig.cap = "Optional Filtering Step", echo = FALSE} knitr::include_graphics("Optional_Filtering.png") ``` **Step 2**: Species Selection Upon advancing to the second box labeled **Step 2**, you'll encounter the crucial task of selecting the species associated with your dataset. This step holds significant importance for the computation of the **pMM score** within `r BiocStyle::Biocpkg("GeDi")`, which heavily relies on a **Protein-Protein Interaction (PPI)** matrix. This matrix plays a pivotal role in capturing protein interaction strength, thereby enriching distance scores with valuable biological context. To access and utilize this essential information, specifying the species linked to your dataset is mandatory. By clicking the input field, you'll prompt a dropdown menu showcasing preselected species options. If your species is included, simply make your selection. Alternatively, if your species is not listed, you have the option to manually input it. In cases of uncertainty, a convenient link provided on the right directs you to the STRING database, enabling verification of species details and PPI availability for informed decision-making. ```{r species-selection, fig.align = "center", fig.cap = "Species Selection", echo = FALSE} knitr::include_graphics("Species_Selection.png") ``` **Step 3**: PPI Matrix Download Following species selection, a third box named **Step 3** will emerge. In this phase, you have the opportunity to download the Protein-Protein Interaction (PPI) matrix. This process may necessitate some time, with a progress bar positioned in the lower right corner providing real-time updates on the download status. Once the download is complete, you can conveniently preview the PPI matrix within the **PPI Preview** box situated on the right-hand side of the interface. This will show that the PPI consists of three columns: **Gene1** and **Gene2**, housing the gene symbols corresponding to the interacting proteins, and a column labeled **combined_score**, denoting the confidence level of each interaction. The assigned score is derived from the number of known interactions between two proteins, normalized to the (0, 1) interval utilizing the formula: $$ \begin{aligned} combinedScore = \frac{(\#interaction - min)}{(max - min)} \end{aligned} $$ where **min** and **max** represent the minimum and maximum number of interactions, respectively. In addition to downloading a PPI matrix during the current session, users can also upload a previously saved matrix for analysis using the **Browse** button. This functionality allows users to work with their own customized datasets or previously analyzed PPI matrices. Furthermore, saving the downloaded PPI matrix locally enables users to store the data on their machine for future use. By saving the matrix locally via the **Save PPI matrix** button, users can access the data quickly in subsequent sessions without having to wait for the download process again. This capability significantly enhances workflow efficiency and allows for seamless continuation of analysis across different sessions. ```{r download-ppi, fig.align = "center", fig.cap = "Downloading the PPI", echo = FALSE} knitr::include_graphics("Downloading_PPI.png") ``` While the final two steps are optional, note that the PPI matrix is only required for a singular score. Therefore, you can commence data exploration without necessarily completing these additional steps. Upon concluding the essential tasks outlined in this panel, you are ready to progress to the **Distance Scores** panel. ## The Distance Scores panel This panel focuses mainly on computing distance scores for the provided input data. Like the preceding panel, it is segmented into two distinct sections, each serving a specific function. ```{r distance-score, fig.align = "center", fig.cap = "The Distance Score panel", echo = FALSE} knitr::include_graphics("Distance_Score_panel.png") ``` **Calculating Distance Scores** In the upper box, titled **Calculate distance scores for your Genesets**, you have the flexibility to select from various distance scores for computation. This feature provides users with a range of options to tailor the analysis according to their specific requirements and preferences. The available scores are: * **pMM Score**: This score integrates protein-protein interaction (PPI) data into the Meet-Min distance. The PPI-weighted Meet-Min (**pMM**) score is defined as $$ \begin{aligned} pMM = min(pMM(A -> B), pMM(B -> A)) \end{aligned} $$ where $$ \begin{aligned} pMM(A -> B) = 1 - \frac{|A \cap B|}{min(|A|, |B|)} - \frac{\alpha}{min(|A|, |B|)} * \sum_{a \in A - B} \frac{w * \sum_{b \in A \cap B} P(a, b) + \sum_{b \in B - A} P(a, b)}{max(P) * (w * |A \cup B| + |B - A|)} \end{aligned} $$ and $$ \begin{aligned} w = \frac{min(|A|, |B|)}{|A| + |B|} \end{aligned} $$ $\alpha$ is a scaling factor between 0 and 1. The PPI matrix can be downloaded from the **Data Input** panel. More details can be found in the paper by Yoon et al [@Yoon2019]. * **Kappa Score**: The **Kappa** distance is a set-based metric based on observed and expected agreement rates between two genesets. It is defined as $$ \begin{aligned} Kappa = 1 - \frac{O - E}{1 - E} \end{aligned} $$ where $$ \begin{aligned} O = \frac{|A \cap B| + |A \cup B|^c}{U} \\ E = \frac{|A| |B| + |A^c| |B^c|}{|U|^2} \end{aligned} $$ U is the set of all unique genes in the data. In this application the Kappa distance is additionally normalized to the (0, 1) interval to make it comparable to the remaining distance metrics. * **Jaccard Score**: The **Jaccard** distance uses the Jaccard coefficient, which is transformed into a distance metric by subtracting it from 1. It is defined as $$ \begin{aligned} Jaccard = 1 - \frac{|A \cap B|}{|A \cup B|} \end{aligned} $$ * **Meet-Min Score**: The **Meet-Min** (MM) distance transforms the overlap coefficient into a distance measure by subtracting it from 1.The overlap coefficient is a similarity measure which is defined as $$ \begin{aligned} OC = \frac{|A \cap B|}{min(|A|, |B|)} \end{aligned} $$ In order to transform this measure of similarity into a measure of distance, the overlap coefficient is subtracted from 1, resulting in the calculation of the Meet-Min (MM) distance as $$ \begin{aligned} MM = 1 - \frac{|A \cap B|}{min(|A|, |B|)} \end{aligned} $$ As a solely set based measurement, the Meet-Min distance only takes the composition of the genesets into account but not the underlying biological information inherent in the genesets. * **Sorensen-Dice**: The **Sorensen-Dice** distance uses the Sorensen-Dice coefficient, which is transformed into a distance metric by subtracting it from 1. It is defined as $$ \begin{aligned} Sorensen-Dice(A, B) = 1 - \frac{2 * |A \cap B|}{|A| + |B|} \end{aligned} $$ As a solely set based measurement, the Sorensen-Dice distance only takes the composition of the genesets into account but not the underlying biological information inherent in the genesets. * **GO distance**: The **GO distance** score measures the relationship between gene sets that are represented by GO terms. Implemented in the `r BiocStyle::Biocpkg("GOSemSim")` Rpackage, there are two main types: information content (IC)-based methods(e.g., Resnik, Lin, Schlicker, and Jiang) and graph-based methods (e.g., Wang). These methods compute similarity scores based on shared characteristics, such as the most informative common ancestor in IC-based methods or the hierarchical structure of the GO database in graph-based methods. To integrate these scores into distance-based analyses, the similarity scores are converted into distance scores by subtracting the similarity score from 1. This transformation ensures compatibility with other distance metrics used in `r BiocStyle::Biocpkg("GeDi")`. While applicable only to GO terms, this approach is particularly useful in gene function analyses. Each scoring method possesses its own set of advantages and drawbacks, underscoring the importance of selecting one that suits your dataset characteristics and analysis goals. Upon choosing a score, the **Compute the distances between genesets** button appears on the on the right side. Clicking this button initiates the scoring procedure, which may require some time to execute, particularly for larger datasets. To monitor the progress of this operation, refer to the progress bar located in the lower right corner of the panel. Once the scoring process concludes, you can delve into the **Geneset Distance Scores** box to explore a variety of visual representations of your data. **Distance Scores Visualizations** * **Distance Scores Heatmap**: The initial visualization offered is a heatmap illustrating the distribution of distance scores. Activation of the heatmap generation is triggered by clicking the **Calculate Distance Score Heatmap** button. Following computation, users can interact with the heatmap by hovering over it, revealing the involved genesets and their corresponding scores. Additionally, users can zoom in on specific areas of interest. To reset the zoomed view, a simple click outside the heatmap area suffices. * **Distance Scores Dendrogram**: The second visualization provided is a dendrogram showcasing individual distance scores. Hierarchical clustering is employed to generate the dendrogram, which effectively groups genesets exhibiting the highest similarity. To enhance the dendrogram's presentation, users can select different combination methods using the drop-down menu located on the left side. * **Distance Scores Graph**: The final visualization available is the network representation of distance scores. In this representation, nodes/genesets with scores below a predefined threshold are connected by edges. By default, the threshold is set to 0.3, but users can adjust it via the slider located on the left. This interactive graph allows users to hover over or click on nodes to highlight connected nodes and obtain additional information about genesets upon selection. Furthermore, users can search for specific genesets using the input field on the left, with the selected geneset being subsequently highlighted in the graph. The **Graph metrics** table at the bottom of this box contains various metrics pertaining to the graph, such as degree, betweenness, harmonic centrality, clustering coefficient, and input data. This tabulated information serves to provide users with valuable insights into the underlying data and distance scores. **Bookmarking from the this panel:** As users navigate through the distance scores of genesets in this section, they may encounter genesets and interactions that capture their interest and merit further investigation. To aid in preserving these noteworthy genesets for later exploration, you can utilize the **Bookmark** button situated in the Navbar. Upon clicking this button, the selected geneset will be added to the list of bookmarked genesets within the **Report** panel. Additionally, informative messages displayed in the lower right corner will guide users through the bookmarking process. Once you've finished exploring the distance scores, you can proceed to the **Clustering graph** panel. ## The Clustering Graph panel This panel is dedicated to the computation of clusters among genesets based on their similarity, which is derived from the previously calculated distance scores. Similar to the preceding panel, it comprises two distinct boxes. Within these boxes, users can access functionalities to determine and visualize clusters of genesets that exhibit comparable characteristics or functions. The computation of clusters involves grouping genesets that display similar patterns of distance scores, thereby indicating shared biological characteristics or functional relationships. This clustering process enables users to identify cohesive groups of genesets with related functionalities or involvement in similar biological processes. ```{r clustering-graph, fig.align = "center", fig.cap = "The Clustering Graph panel", echo = FALSE} knitr::include_graphics("Clustering_Graph_Panel.png") ``` **Choosing a Clustering algorithm** The upper box, labeled **Select the clustering method**, provides a selection of distinct clustering algorithms. Users can explore various options to find the most suitable algorithm for their analysis: * **Louvain**: The Louvain algorithm, a prevalent tool in biological network analysis, seeks to divide graph nodes into clusters to optimize the modularity metric. This metric gauges the strength of connections within clusters relative to those between clusters. Consequently, nodes within the same cluster exhibit greater similarity to one another than to nodes outside the cluster. This clustering approach aims to enhance data interpretation by grouping similar genesets together. Users can adjust a slider in the bottom left corner of the box to set a similarity threshold, determining when genesets are considered similar based on distance scores. * **Markov**: The Markov algorithm, commonly employed in biological network analysis, is designed to pinpoint densely interconnected regions within graphs. These regions frequently align with communities or clusters in the graph structure. Users can utilize a slider located in the bottom left corner of the box to specify a similarity threshold, determining when genesets are deemed similar based on distance scores. * **Fuzzy clustering**: The Fuzzy Clustering algorithm is a computational technique used to partition data points into clusters based on their similarity, while allowing for data points to belong to multiple clusters with varying degrees of membership. It operates through distinct steps and requires the specification of different thresholds. Firstly, the **Similarity threshold** is set to determine if two genesets exhibit sufficient similarity to be potentially clustered together. Secondly, the **Membership threshold** dictates how many members of a potential cluster must possess a close relationship, defined by a distance score less than or equal to the similarity threshold, for the cluster to persist. Lastly, the **Clustering threshold** determines whether two clusters will be merged. Clusters are merged if their percentage of overlap meets or exceeds the clustering threshold. Users can adjust all thresholds using sliders provided in the interface. * **PAM**: The PAM (Partitioning around Mendoids) clustering algorithm partitions nodes into k distinct clusters, where **k** is a user-defined parameter. The algorithm iteratively assigns each node to the nearest cluster center based on calculated distance scores, and then updates the cluster centers to minimize the overall variance within each cluster. Users can specify the number of clusters, **k**, using a slider in the interface, allowing them to tailor the clustering process to the needs of their analysis. Adjusting the value of k enables the exploration of different clustering granularities, providing flexibility in interpreting the data and identifying meaningful patterns. Once you choose a method, you can start the cluster calculation via the **Cluster the Genesets** button on the right. Keep in mind that this step might take some time, especially for larger datasets. Look for the progress bar in the lower right corner for updates on the scoring status. Once the clusters are calculated, you can explore various visualizations of your data in the **Geneset Cluster Graphs** box. **Cluster Visualizations** * **Geneset Graph**: In the **Geneset Graph**, clusters are visualized as a graph, with individual genesets serving as nodes and edges connecting genesets within the same cluster. To highlight specific nodes, utilize the **Select by id** feature on the left, or choose to highlight entire clusters by selecting the respective option under **Select by cluster**. Please note that only genesets belonging to at least one cluster will be displayed in this graph. For additional insights, nodes can be colored based on specific parameters from your input data, accessible through the **Color the graph** by dropdown menu. Depending on the information provided with your data, various options will be available. While interacting with the network, nodes can be moved by clicking and dragging them to desired locations, offering flexibility in managing node placement which is particularly useful in complex or densely populated graphs. * **Cluster-Geneset Bipartite Graph**: The **Cluster-Geneset Bipartite Graph** presents a bipartite representation of the clusters. In this visualization, nodes represent both clusters and genesets, with edges connecting cluster nodes to their corresponding geneset members. Hovering over nodes provides additional data insights. Cluster nodes display the members within each cluster, while geneset nodes showcase the genes associated with each geneset. * **Cluster Enrichment Terms Word Cloud**: The **Cluster Enrichment Terms Word Cloud** displays the most frequently occurring terms for each cluster. This visualization proves particularly useful when your data includes brief descriptions of the genesets, in addition to the mandatory input data. By utilizing the **Select a cluster** drop-down menu, you can designate the cluster of interest. Furthermore, hovering over the word cloud enables you to select individual terms and view the frequency with which each term appears in the descriptions of the genesets within that cluster. * **Clustering graph summaries**: The cluster information is also summarized in a table-like format in the **Clustering graph summaries** box. This table displays each geneset alongside the cluster to which it belongs. Additionally, the table features a search function, facilitating the quick retrieval of a geneset of interest. **Bookmarking from this panel:** While exploring the **Clustering Graph panel**, users may encounter genesets and clusters that intrigue them and warrant further investigation. To facilitate the preservation of these notable genesets and clusters for future exploration, users can utilize the **Bookmark** button located in the Navbar. Clicking this button will add the selected geneset or cluster to the list of bookmarked items within the Report panel. Helpful messages displayed in the lower right corner will assist users throughout the bookmarking process. In order to bookmark interesting genes and clusters, users simply select a geneset or cluster from the Geneset Graph or Cluster-Geneset Bipartite Graph and use the Bookmark button to add the respective information to the set of bookmarked features. After exploring the results in the **Clustering Graph panel**, users can proceed to the **Report** panel to have a look at the bookmarked genesets and clusters or iterate through the individual panels of the app for a more in depth exploration of the data. ## The Report panel In this panel of the application, users can obtain a comprehensive overview of the items they have bookmarked for further exploration. On the left side of the interface, bookmarked genesets are listed, while bookmarked clusters are displayed on the right side. During an interactive exploration session, recalling specific details about each bookmarked item can sometimes be challenging. Therefore, users are provided with convenient options to manage their bookmarked data. Below the interactive tables displaying bookmarked genesets and clusters, users can find buttons allowing them to download the content of each table individually. Additionally, the **Start the generation of the report** button is provided to generate a detailed report encompassing all selected elements of interest. The report generation process utilizes a predefined template report included within the `r BiocStyle::Biocpkg("GeDi")` package. This template leverages the input elements and reactive values associated with the bookmarks, ensuring that the generated report contains comprehensive and relevant information. The resulting report serves as a valuable tool for creating a permanent and reproducible analysis output. Users can easily store or share this report for future reference or collaboration purposes. ```{r report-panel, fig.align = "center", fig.cap = "The Report panel", echo = FALSE} knitr::include_graphics("Report_panel.png") ``` # Additional Information {#additionalinfo} If you have questions about the package or the available functionality, please submit them on the Bioconductor [support site](https://support.bioconductor.org/) using the tag 'GeDi'. Bug reports can be opened as issues in the `r BiocStyle::Biocpkg("GeDi")` [GitHub repository](https://github.com/AnnekathrinSilvia/GeDi/issues). Please note that the GitHub repository also hosts the development version of the package, where new functionality is continuously added - be cautious, as you may be working with cutting-edge versions! The authors welcome thoughtful suggestions for enhancements or new features, and even better, pull requests. # Additional example data In this section, we present additional examples demonstrating the versatility of `r BiocStyle::Biocpkg("GeDi")` in analyzing functional enrichment data from two widely used databases, KEGG and Reactome. By leveraging the rich resources provided by these databases, GeDi offers researchers a comprehensive toolkit for exploring and interpreting complex biological pathways and processes. Through step-by-step demonstrations, we illustrate how GeDi can seamlessly integrate with data from KEGG and Reactome, enabling users to gain deeper insights into the functional annotations of their gene sets. Whether investigating specific pathways or broader biological processes, GeDi provides intuitive and powerful functionalities to enhance the analysis of functional enrichment data from diverse sources. In this section we will demonstrate how results containing identifiers from databases like KEGG [@Kanehisa2023] or Reactome [@Gillespie2022] - e.g. generated using `enrichKegg` or `enrichPathway` functions from the `r BiocStyle::Biocpkg("clusterProfiler")` package - can be utilized as input for `r BiocStyle::Biocpkg("GeDi")`. We will again use the data of the `r BiocStyle::Biocpkg("macrophage")` package, specifically the differentially expressed genes we have identified before. With this data, we demonstrate how to generate the results and prepare them for their use in `r BiocStyle::Biocpkg("GeDi")`. However, before we can use the `enrichKEGG` function from the `r BiocStyle::Biocpkg("clusterProfiler")` package, we have to map the ENSEMBL ids of the data to Entrez ids. For this, we will up the first use the `r BiocStyle::Biocpkg("biomaRt")` package to generate a mapping of ENSEMBL to Entrez. ```{r withbiomart, eval = FALSE} # Load the "biomaRt" package to access the BioMart database library("biomaRt") # Set up a connection to the ENSEMBL BioMart database for human genes mart <- useMart(biomart = "ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl") # Retrieve gene annotations using the BioMart database anns <- getBM( attributes = c( "ensembl_gene_id", "external_gene_name", "entrezgene_id", "description" ), filters = "ensembl_gene_id", values = rownames(dds_macrophage), mart = mart ) # Match the retrieved annotations to the genes in dds_macrophage anns <- anns[match(rownames(dds_macrophage), anns[, 1]), ] ``` Next, we map the differentially expressed genes to get the right identifiers and run the `enrichKEGG` function. We set the organism to human and the p-value cutoff to 5%. ```{r enrichKegg, eval = FALSE} # Load the "clusterProfiler" package for functional enrichment analysis library("clusterProfiler") # Retrieve Entrez gene IDs from the annotations data frame based on matching # Ensembl gene IDs from the DE results genes <- anns$entrezgene_id[match(rownames(res_macrophage_IFNg_vs_naive), anns$ensembl_gene_id)] # Perform KEGG pathway enrichment analysis using the retrieved gene IDs res_enrich <- enrichKEGG(genes, organism = "hsa", pvalueCutoff = 0.05 ) ``` We can now use the results of the enrichment in `r BiocStyle::Biocpkg("GeDi")`. For this, we directly start the app with the loaded data. If you have not computed the data following this workflow, you can beforehand load it from the available data in this package. ```{r GeDi_Kegg, eval = FALSE} # Load the "macrophage_KEGG_example" dataset from the "GeDi" package data("macrophage_KEGG_example", package = "GeDi") # Start the GeDi app with the loaded data # The "genesets" parameter is set to the loaded "macrophage_KEGG_example" # dataset GeDi(genesets = macrophage_KEGG_example) ``` In a similar manner we can use the Reactome database for the functional annotation. Here, we use the `r BiocStyle::Biocpkg("ReactomePA")` package and the differentially expressed genes. ```{r enrichReactome, eval = FALSE} # Load the "ReactomePA" package for pathway enrichment analysis library("ReactomePA") # Perform pathway enrichment analysis using the "enrichPathway" function reactome <- enrichPathway(genes, organism = "human", pvalueCutoff = 0.05, readable = TRUE ) ``` Now we can use the results in the same manner as for the KEGG pathway analysis. ```{r GeDi_Reactome, eval = FALSE} # Load the "macrophage_Reactome_example" dataset from the "GeDi" package data("macrophage_Reactome_example", package = "GeDi") # Start the GeDi app with the loaded data # The "genesets" parameter is set to the loaded "macrophage_Reactome_example" # dataset GeDi(genesets = macrophage_Reactome_example) ``` # FAQs {#faqs} **Q: My configuration on two machines is somewhat different, so I am having difficulty in finding out what packages are different. Is there something to help on this?** A: Yes, you can check out `r BiocStyle::Githubpkg("federicomarini/sessionDiffo")`, a small utility to compare the outputs of two different `sessionInfo` outputs. This can help you pinpoint what packages might be causing the issue. **Q: I am using a different service/software for generating the results of functional enrichment analysis. How do I plug this into `GeDi`?** A: You can use nearly any result of a functional enrichment analysis in `r BiocStyle::Biocpkg("GeDi")` as long as the results are transformed in a way that they fit the input requirements. Please check out the **Welcome** page to see the specification of the input requirements. # Session Info {- .smaller} ```{r sessioninfo} utils::sessionInfo() ``` # References {-}