Title: | Visualising Set Enrichment Analysis Results |
---|---|
Description: | This package enables the interpretation and analysis of results from a gene set enrichment analysis using network-based and text-mining approaches. Most enrichment analyses result in large lists of significant gene sets that are difficult to interpret. Tools in this package help build a similarity-based network of significant gene sets from a gene set enrichment analysis that can then be investigated for their biological function using text-mining approaches. |
Authors: | Dharmesh D. Bhuva [aut, cre] , Ahmed Mohamed [ctb] |
Maintainer: | Dharmesh D. Bhuva <[email protected]> |
License: | GPL-3 |
Version: | 1.15.0 |
Built: | 2024-11-19 04:47:56 UTC |
Source: | https://github.com/bioc/vissE |
Custom theme
bhuvad_theme(rl = 1.1)
bhuvad_theme(rl = 1.1)
rl |
a numeric, scaling factor to apply to text sizes |
a ggplot2 theme
p1 = ggplot2::ggplot() p1 + bhuvad_theme()
p1 = ggplot2::ggplot() p1 + bhuvad_theme()
This function can be used to perform a network-based enrichment analysis of a list of genes. The list of genes are characterised based on their similarity with gene sets from the MSigDB. A network of similar gene sets is retrieved using this function.
characteriseGeneset( gs, thresh = 0.2, measure = c("ovlapcoef", "jaccard"), gscolcs = c("h", "c2", "c5"), org = c("auto", "hs", "mm") )
characteriseGeneset( gs, thresh = 0.2, measure = c("ovlapcoef", "jaccard"), gscolcs = c("h", "c2", "c5"), org = c("auto", "hs", "mm") )
gs |
a GeneSet object, representing the list of genes that need to be characterised. |
thresh |
a numeric, specifying the threshold to discard pairs of gene sets. |
measure |
a character, specifying the similarity measure to use: |
gscolcs |
a character, listing the MSigDB collections to use as a
background (defaults to h, c2, and c5). Collection types can be retrieved
using |
org |
a character, specifying the organism to use. This can either be "auto" (default), "hs" or "mm". |
an igraph object, containing gene sets that are similar to the query set. The network contains relationships between results of the query too.
library(GSEABase) data(hgsc) #create a geneset using one of the Hallmark gene sets mySet <- GeneSet( geneIds(hgsc[[2]]), setName = 'MySet', geneIdType = SymbolIdentifier() ) #characterise the custom gene set ig <- characteriseGeneset(mySet) plotMsigNetwork(ig)
library(GSEABase) data(hgsc) #create a geneset using one of the Hallmark gene sets mySet <- GeneSet( geneIds(hgsc[[2]]), setName = 'MySet', geneIdType = SymbolIdentifier() ) #characterise the custom gene set ig <- characteriseGeneset(mySet) plotMsigNetwork(ig)
Computes an igraph object using information on gene sets and gene sets
computed using the computeMsigOverlap()
function.
computeMsigNetwork(genesetOverlap, msigGsc)
computeMsigNetwork(genesetOverlap, msigGsc)
genesetOverlap |
a data.frame, containing results of an overlap analysis
computed using the |
msigGsc |
a GeneSetCollection object, containing gene sets used to compute overlap. |
an igraph object
data(hgsc) ovlap <- computeMsigOverlap(hgsc) ig <- computeMsigNetwork(ovlap, hgsc)
data(hgsc) ovlap <- computeMsigOverlap(hgsc) ig <- computeMsigNetwork(ovlap, hgsc)
Compute overlap between gene sets from a GeneSetCollection using the Jaccard index or the overlap coefficient. These values can then be used to compute a network of gene set overlaps.
computeMsigOverlap( msigGsc1, msigGsc2 = NULL, thresh = 0.25, measure = c("ari", "jaccard", "ovlapcoef") )
computeMsigOverlap( msigGsc1, msigGsc2 = NULL, thresh = 0.25, measure = c("ari", "jaccard", "ovlapcoef") )
msigGsc1 |
a GeneSetCollection object. |
msigGsc2 |
a GeneSetCollection object or NULL if pairwise overlaps are to be computed. |
thresh |
a numeric, specifying the threshold to discard pairs of gene sets. |
measure |
a character, specifying the similarity measure to use: |
a data.frame, containing the overlap structure of gene sets represented as a network in the simple interaction format (SIF).
data(hgsc) ovlap <- computeMsigOverlap(hgsc)
data(hgsc) ovlap <- computeMsigOverlap(hgsc)
Compute word frequencies for a single MSigDB collection
computeMsigWordFreq( msigGsc, weight = NULL, measure = c("tfidf", "tf"), version = msigdb::getMsigdbVersions(), org = c("auto", "hs", "mm"), rmwords = getMsigExclusionList(), idf = NULL )
computeMsigWordFreq( msigGsc, weight = NULL, measure = c("tfidf", "tf"), version = msigdb::getMsigdbVersions(), org = c("auto", "hs", "mm"), rmwords = getMsigExclusionList(), idf = NULL )
msigGsc |
a GeneSetCollection object, containing gene sets from the
MSigDB. The |
weight |
a named numeric vector, containing weights to apply to each gene-set. This can be -log10(FDR), -log10(p-value) or an enrichment score (ideally unsigned). |
measure |
a character, specifying how frequencies should be computed. "tf" uses term frequencies and "tfidf" (default) applies inverse document frequency weights to term frequencies. |
version |
a character, specifying the version of msigdb to use (see
|
org |
a character, specifying the organism to use. This can either be "auto" (default), "hs" or "mm". |
rmwords |
a character vector, containing an exclusion list of words to discard from the analysis. |
idf |
a list of named numeric vectors, specifying inverse document frequencies to use to penalise terms from gene-set names and short descriptions. This should be a vector of length 2 with names "Name" and "Short". Numeric vectors should contain weights and names should represent the term. Precomputed versions can be retrieved using the |
a list, containing two data.frames summarising the results of the frequency analysis on gene set names and short descriptions.
data(hgsc) freq <- computeMsigWordFreq(hgsc, measure = 'tfidf')
data(hgsc) freq <- computeMsigWordFreq(hgsc, measure = 'tfidf')
This function identifies gene-set clusters from a gene-set overlap network
produced using vissE. Various graph clustering algorithms from the igraph
package can be used for clustering. Gene-set clusters identified are then
sorted based on their size and a given statistic of interest (absolute of the
statistic is maximised per cluster).
findMsigClusters( ig, genesetStat = NULL, minSize = 2, alg = igraph::cluster_walktrap, algparams = list() )
findMsigClusters( ig, genesetStat = NULL, minSize = 2, alg = igraph::cluster_walktrap, algparams = list() )
ig |
an igraph object, containing a network of gene set overlaps computed
using |
genesetStat |
a named numeric, containing statistics for each gene-set that are to be used in cluster prioritisation. If NULL, clusters are prioritised based on their size (number of gene-sets in them). |
minSize |
a numeric, stating the minimum size a cluster can be (default is 2). |
alg |
a function, from the |
algparams |
a list, specifying additional parameters that are to be passed to the graph clustering algorithm. |
Gene-sets clusters are identified using graph clustering and are prioritised based on a combination of cluster size and optionally, a statistic of interest (e.g., enrichment scores). A product-of-ranks approach is used to prioritise clusters when gene-set statistics are available. In this approach, clusters are ranked based on their cluster size (largest to smallest) and on the median absolute statistic of gene-sets within it (largest to smallest). The product of these ranks is computed and clusters are ranked based on these product-of-rank statistic (smallest to largest).
When prioritising using cluster size and gene-set statistics, if statistics for some gene-sets in the network are missing, only the size is used in cluster prioritisation.
a list, containing gene-sets that belong to each cluster. Items in the list are organised based on prioritisation.
data(hgsc) ovlap <- computeMsigOverlap(hgsc, thresh = 0.25) ig <- computeMsigNetwork(ovlap, hgsc) findMsigClusters(ig)
data(hgsc) ovlap <- computeMsigOverlap(hgsc, thresh = 0.25) ig <- computeMsigNetwork(ovlap, hgsc) findMsigClusters(ig)
List of words to discard when performing text mining MSigDB gene set names and short descriptions.
getMsigExclusionList(custom = c())
getMsigExclusionList(custom = c())
custom |
a character vector, containing list of words to add onto existing exclusion list. |
a character vector, containing words to be excluded from the text mining analysis.
getMsigExclusionList('remove')
getMsigExclusionList('remove')
The molecular signatures database (MSigDB) is a collection of over 25000 gene expression signatures. Signatures in v7.2 are divided into 9 categories. The Hallmarks collection contains gene expression signatures representing molecular processes that are hallmarks in cancer development and progression.
hgsc
hgsc
A GeneSetCollection object with 50 GeneSet objects representing the 50 Hallmark gene expression signatures.
Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., ... & Mesirov, J. P. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences, 102(43), 15545-15550.
Liberzon, A., Subramanian, A., Pinchback, R., Thorvaldsdóttir, H., Tamayo, P., & Mesirov, J. P. (2011). Molecular signatures database (MSigDB) 3.0. Bioinformatics, 27(12), 1739-1740.
Liberzon, A., Birger, C., Thorvaldsdóttir, H., Ghandi, M., Mesirov, J. P., & Tamayo, P. (2015). The molecular signatures database hallmark gene set collection. Cell systems, 1(6), 417-425.
This function plots gene statistics against gene frequencies for any given cluster of gene sets. The plot can be used to identify genes that are over-represented in a cluster of gene-sets (identified based on gene-set overlaps) and have a strong statistic (e.g. log fold-chage or p-value).
plotGeneStats( geneStat, msigGsc, groups, statName = "Gene-level statistic", topN = 5 )
plotGeneStats( geneStat, msigGsc, groups, statName = "Gene-level statistic", topN = 5 )
geneStat |
a named numeric, containing the statistic to be displayed. The vector must be named with either gene Symbols or Entrez IDs depending on annotations in msigGsc. |
msigGsc |
a GeneSetCollection object, containing gene sets from the
MSigDB. The |
groups |
a named list, of character vectors or numeric indices specifying node groupings. Each element of the list represent a group and contains a character vector with node names. |
statName |
a character, specifying the name of the statistic. |
topN |
a numeric, specifying the number of genes to label. The top genes are those with the largest count and statistic. |
a ggplot object, plotting the gene-level statistic against gene frequencies in the cluster of gene sets.
library(GSEABase) data(hgsc) groups <- list('g1' = names(hgsc)[1:25], 'g2' = names(hgsc)[26:50]) #create statistics allgenes = unique(unlist(geneIds(hgsc))) gstats = rnorm(length(allgenes)) names(gstats) = allgenes #plot plotGeneStats(gstats, hgsc, groups)
library(GSEABase) data(hgsc) groups <- list('g1' = names(hgsc)[1:25], 'g2' = names(hgsc)[26:50]) #create statistics allgenes = unique(unlist(geneIds(hgsc))) gstats = rnorm(length(allgenes)) names(gstats) = allgenes #plot plotGeneStats(gstats, hgsc, groups)
Plots a network of gene set overlap with overlap computed using the
computeMsigOverlap()
and a graph created using computeMsigNetwork()
.
plotMsigNetwork( ig, markGroups = NULL, genesetStat = NULL, nodeSF = 1, edgeSF = 1, lytFunc = "graphopt", lytParams = list(), rmUnmarkedGroups = FALSE, maxGrp = 12 )
plotMsigNetwork( ig, markGroups = NULL, genesetStat = NULL, nodeSF = 1, edgeSF = 1, lytFunc = "graphopt", lytParams = list(), rmUnmarkedGroups = FALSE, maxGrp = 12 )
ig |
an igraph object, containing a network of gene set overlaps
computed using |
markGroups |
a named list, of character vectors. Each element of the list represent a group and contains a character vector with node names. Up to 12 groups can be visualised in the plot. |
genesetStat |
a named numeric, statistic to project onto the nodes. These could be p-values, log fold-changes or gene set score from a singscore-based analysis. |
nodeSF |
a numeric, indicating the scaling factor to apply to node sizes. |
edgeSF |
a numeric, indicating the scaling factor to apply to edge widths. |
lytFunc |
a character, specifying the layout to use (see
|
lytParams |
a named list, containing additional parameters needed for
the layout (see |
rmUnmarkedGroups |
a logical, indicating whether unmarked groups should be removed from the network (TRUE) or retained (FALSE - default). |
maxGrp |
a numeric, specifying the maximum number of groups to plot. |
a ggplot2 object
data(hgsc) ovlap <- computeMsigOverlap(hgsc, thresh = 0.15) ig <- computeMsigNetwork(ovlap, hgsc) groups <- list( 'g1' = c("HALLMARK_HYPOXIA", "HALLMARK_GLYCOLYSIS"), 'g2' = c("HALLMARK_INTERFERON_GAMMA_RESPONSE") ) plotMsigNetwork(ig, markGroups = groups)
data(hgsc) ovlap <- computeMsigOverlap(hgsc, thresh = 0.15) ig <- computeMsigNetwork(ovlap, hgsc) groups <- list( 'g1' = c("HALLMARK_HYPOXIA", "HALLMARK_GLYCOLYSIS"), 'g2' = c("HALLMARK_INTERFERON_GAMMA_RESPONSE") ) plotMsigNetwork(ig, markGroups = groups)
This function plots the protein-protein interaction (PPI) network for a gene-set cluster identified using vissE. The international molecular exchange (IMEx) PPI is used to obtain PPIs for genes present in a gene-set cluster.
plotMsigPPI( ppidf, msigGsc, groups, geneStat = NULL, statName = "Gene-level statistic", threshConfidence = 0, threshFrequency = 0.25, threshStatistic = 0, threshUseAbsolute = TRUE, topN = 5, nodeSF = 1, edgeSF = 1, lytFunc = "graphopt", lytParams = list() )
plotMsigPPI( ppidf, msigGsc, groups, geneStat = NULL, statName = "Gene-level statistic", threshConfidence = 0, threshFrequency = 0.25, threshStatistic = 0, threshUseAbsolute = TRUE, topN = 5, nodeSF = 1, edgeSF = 1, lytFunc = "graphopt", lytParams = list() )
ppidf |
a data.frame, containing a protein-protein interaction from the
IMEx database. This can be retrieved from the |
msigGsc |
a GeneSetCollection object, containing gene sets from the
MSigDB. The |
groups |
a named list, of character vectors or numeric indices specifying node groupings. Each element of the list represent a group and contains a character vector with node names. |
geneStat |
a named numeric, containing the statistic to be displayed. The vector must be named with either gene Symbols or Entrez IDs depending on annotations in msigGsc. |
statName |
a character, specifying the name of the statistic. |
threshConfidence |
a numeric, specifying the confidence threshold to apply to determine high confidence interactions. This should be a value between 0 and 1 (default is 0). |
threshFrequency |
a numeric, specifying the frequency threshold to apply to determine more frequent genes in the gene-set cluster. The frequecy of a gene is computed as the proportion of gene-sets to which the gene belongs. This should be a value between 0 and 1 (default is 0.25). |
threshStatistic |
a numeric, specifying the threshold to apply to gene-level statistics (e.g. a log fold-change). This should be a value between 0 and 1 (default is 0). |
threshUseAbsolute |
a logical, indicating whether the |
topN |
a numeric, specifying the number of genes to label. The top genes are those with the largest count and statistic. |
nodeSF |
a numeric, indicating the scaling factor to apply to node sizes. |
edgeSF |
a numeric, indicating the scaling factor to apply to edge widths. |
lytFunc |
a character, specifying the layout to use (see
|
lytParams |
a named list, containing additional parameters needed for
the layout (see |
a ggplot object with the protein-protein interaction networks plot for each gene-set cluster.
data(hgsc) grps = list('early' = 'HALLMARK_ESTROGEN_RESPONSE_EARLY', 'late' = 'HALLMARK_ESTROGEN_RESPONSE_LATE') ppi = msigdb::getIMEX(org = 'hs', inferred = TRUE) plotMsigPPI(ppi, hgsc, grps)
data(hgsc) grps = list('early' = 'HALLMARK_ESTROGEN_RESPONSE_EARLY', 'late' = 'HALLMARK_ESTROGEN_RESPONSE_LATE') ppi = msigdb::getIMEX(org = 'hs', inferred = TRUE) plotMsigPPI(ppi, hgsc, grps)
Given a gene set collection, this function computes the word frequency of gene set names from the Molecular Signatures Database (MSigDB) collection (split by _). Word frequencies are also computed using short descriptions attached with each gene set object.
plotMsigWordcloud( msigGsc, groups, weight = NULL, measure = c("tfidf", "tf"), version = msigdb::getMsigdbVersions(), org = c("auto", "hs", "mm"), rmwords = getMsigExclusionList(), type = c("Name", "Short"), idf = NULL )
plotMsigWordcloud( msigGsc, groups, weight = NULL, measure = c("tfidf", "tf"), version = msigdb::getMsigdbVersions(), org = c("auto", "hs", "mm"), rmwords = getMsigExclusionList(), type = c("Name", "Short"), idf = NULL )
msigGsc |
a GeneSetCollection object, containing gene sets from the
MSigDB. The |
groups |
a named list, of character vectors or numeric indices specifying node groupings. Each element of the list represent a group and contains a character vector with node names. |
weight |
a named numeric vector, containing weights to apply to each gene-set. This can be -log10(FDR), -log10(p-value) or an enrichment score (ideally unsigned). |
measure |
a character, specifying how frequencies should be computed. "tf" uses term frequencies and "tfidf" (default) applies inverse document frequency weights to term frequencies. |
version |
a character, specifying the version of msigdb to use (see
|
org |
a character, specifying the organism to use. This can either be "auto" (default), "hs" or "mm". |
rmwords |
a character vector, containing an exclusion list of words to discard from the analysis. |
type |
a character, specifying the source of text mining. Either gene
set names ( |
idf |
a list of named numeric vectors, specifying inverse document frequencies to use to penalise terms from gene-set names and short descriptions. This should be a vector of length 2 with names "Name" and "Short". Numeric vectors should contain weights and names should represent the term. Precomputed versions can be retrieved using the |
a ggplot object.
data("hgsc") groups <- list('g1' = names(hgsc)[1:25], 'g2' = names(hgsc)[26:50]) plotMsigWordcloud(hgsc, groups, rmwords = getMsigExclusionList())
data("hgsc") groups <- list('g1' = names(hgsc)[1:25], 'g2' = names(hgsc)[26:50]) plotMsigWordcloud(hgsc, groups, rmwords = getMsigExclusionList())