Title: | Combining GSEA-based pathway enrichment with multi omics data integration |
---|---|
Description: | Extracted features from pathways derived from 8 different databases (KEGG, Reactome, Biocarta, etc.) can be used on transcriptomic, proteomic, and/or metabolomic level to calculate a combined GSEA-based enrichment score. |
Authors: | Sebastian Canzler [aut, cre] , Jörg Hackermüller [aut] |
Maintainer: | Sebastian Canzler <[email protected]> |
License: | GPL-3 |
Version: | 1.17.1 |
Built: | 2024-11-22 03:12:07 UTC |
Source: | https://github.com/bioc/multiGSEA |
Retrieve the path to the cache directory for the multiGSEA package. Create the cache directory if need be.
archiveDir()
archiveDir()
String containing the path to the cache directory.
The function retrieves the path to a file that is cached in the archive directory.
archivePath(filename)
archivePath(filename)
filename |
Name of the file. |
String containing the path to the file.
This function applies the Stouffer method, the Edgington method or the Fisher\'s combined probability test to combine p-values of independent tests that are based on the same null hypothesis. The Stouffer method can also be applied in a weighted fashion.
combinePvalues(df, method = "stouffer", col_pattern = "pval", weights = NULL)
combinePvalues(df, method = "stouffer", col_pattern = "pval", weights = NULL)
df |
Data frame where rows represent a certain pathway or gene set and columns represent p-values derived from independent tests, e.g., different omics layer. |
method |
String that specifies the method to combine multiple p-values. Default: "stouffer" Options: "stouffer", "fisher", "edgington" |
col_pattern |
String of the pattern that specifies the columns to be combined. Default: "pval", Options: "pval", "padj" (legacy) |
weights |
List of weights that will be used in a weighted Stouffer method. |
Vector of length nrow(df)
with combined p-values.
df <- cbind(runif(5), runif(5), runif(5)) colnames(df) <- c("trans.pval", "prot.pval", "meta.pval") # run the unweighted summation of z values combinePvalues(df) # run the weighted variant combinePvalues(df, weights = c(10, 5, 1)) # run the Fisher's combined probability test combinePvalues(df, method = "fisher") # run the Edgington's method combinePvalues(df, method = "edgington")
df <- cbind(runif(5), runif(5), runif(5)) colnames(df) <- c("trans.pval", "prot.pval", "meta.pval") # run the unweighted summation of z values combinePvalues(df) # run the weighted variant combinePvalues(df, weights = c(10, 5, 1)) # run the Fisher's combined probability test combinePvalues(df, method = "fisher") # run the Edgington's method combinePvalues(df, method = "edgington")
This function reshapes the output from multiGSEA to get a single data frame with columns for p-values and adjusted p-values for each omics layer. Each row of the data frame represents one pathway.
extractPvalues(enrichmentScores, pathwayNames)
extractPvalues(enrichmentScores, pathwayNames)
enrichmentScores |
Nested List of enrichment scores, calculated by multiGSEA function. |
pathwayNames |
List containing Pathway names. |
Data frame where rows are pathways and columns are (adjusted) p-values for each omics layer.
# Download pathway definition and extract features pathways <- getMultiOmicsFeatures(dbs = c("kegg"), layer = c("transcriptome", "proteome")) # load omics data and calculate ranks data(transcriptome) data(proteome) ranks <- initOmicsDataStructure(c("transcriptome", "proteome")) ranks$transcriptome <- rankFeatures(transcriptome$logFC, transcriptome$pValue) names(ranks$transcriptome) <- transcriptome$Symbol ranks$proteome <- rankFeatures(proteome$logFC, proteome$pValue) names(ranks$proteome) <- proteome$Symbol # run the enrichment es <- multiGSEA(pathways, ranks) extractPvalues( enrichmentScores = es, pathwayNames = names(pathways[[1]]) )
# Download pathway definition and extract features pathways <- getMultiOmicsFeatures(dbs = c("kegg"), layer = c("transcriptome", "proteome")) # load omics data and calculate ranks data(transcriptome) data(proteome) ranks <- initOmicsDataStructure(c("transcriptome", "proteome")) ranks$transcriptome <- rankFeatures(transcriptome$logFC, transcriptome$pValue) names(ranks$transcriptome) <- transcriptome$Symbol ranks$proteome <- rankFeatures(proteome$logFC, proteome$pValue) names(ranks$proteome) <- proteome$Symbol # run the enrichment es <- multiGSEA(pathways, ranks) extractPvalues( enrichmentScores = es, pathwayNames = names(pathways[[1]]) )
Function to extract the features (nodes) from a given list of pathways. These
pathways have to be compiled with the
pathways
function. Features can only be
extracted for \'proteins\' or \'metabolites\'. Features will by default be
mapped to gene symbols.
getFeatures( pathway, which = "proteins", org = "hsapiens", returntype = "SYMBOL" )
getFeatures( pathway, which = "proteins", org = "hsapiens", returntype = "SYMBOL" )
pathway |
A pathway created with |
which |
Mode to extract the features, either \'proteins\' or \'metabolites\'. |
org |
String specifying the organism, which is necessary for featureID mapping. Default: human |
returntype |
String that specifies the returning ID type. Default: SYMBOL Options (genes/proteins): SYMBOL, ENTREZID, UNIPROT, ENSEMBL, REFSEQ Options (metabolites): HMDB, CAS, DTXCID, DTXSID, SID, CID, ChEBI, KEGG, Drugbank |
Feature list with gene symbols (genes/proteins) or CHEBI IDs (metabolites)
pathways <- graphite::pathways("hsapiens", "kegg")[[1]] getFeatures(pathways) pathways <- graphite::pathways("mmusculus", "kegg")[[1]] getFeatures(pathways, which = "metabolites", org = "mmusculus", returntype = "HMDB") pathways <- graphite::pathways("mmusculus", "kegg")[[1]] getFeatures(pathways, which = "proteins", org = "mmusculus", returntype = "SYMBOL")
pathways <- graphite::pathways("hsapiens", "kegg")[[1]] getFeatures(pathways) pathways <- graphite::pathways("mmusculus", "kegg")[[1]] getFeatures(pathways, which = "metabolites", org = "mmusculus", returntype = "HMDB") pathways <- graphite::pathways("mmusculus", "kegg")[[1]] getFeatures(pathways, which = "proteins", org = "mmusculus", returntype = "SYMBOL")
Function to retrieve the gene/protein identifier mapping. Ongoing from genes/proteins retrieved from pathway definitions, which often include two or more ID formats or a format that is not present in your omics measurement, this function maps those IDs to a given format. Depending on the organism, additional packages have to be installed.
getGeneMapping(features, keytype, org = "hsapiens", returntype = "SYMBOL")
getGeneMapping(features, keytype, org = "hsapiens", returntype = "SYMBOL")
features |
List of identifiers to be mapped. |
keytype |
String specifying the ID type, e.g., "ENTREZID" or "UNIPROT". |
org |
String that defines the organism. Default: hsapiens
Options: see |
returntype |
String that specifies the returning ID type. Default: SYMBOL, Options: SYMBOL, ENTREZID, UNIPROT, ENSEMBL, REFSEQ |
List containing mapped gene/protein IDs.
features <- graphite::nodes(graphite::pathways("hsapiens", "kegg")[[1]]) features <- gsub("ENTREZID:", "", features) keytype <- "ENTREZID" getGeneMapping(features, keytype) getGeneMapping(features, keytype, returntype = "UNIPROT") features <- graphite::nodes(graphite::pathways("rnorvegicus", "reactome")[[1]]) features <- gsub("UNIPROT:", "", features) getGeneMapping(features, keytype = "UNIPROT", org = "rnorvegicus") getGeneMapping(features, keytype = "UNIPROT", org = "rnorvegicus", returntype = "ENSEMBL" )
features <- graphite::nodes(graphite::pathways("hsapiens", "kegg")[[1]]) features <- gsub("ENTREZID:", "", features) keytype <- "ENTREZID" getGeneMapping(features, keytype) getGeneMapping(features, keytype, returntype = "UNIPROT") features <- graphite::nodes(graphite::pathways("rnorvegicus", "reactome")[[1]]) features <- gsub("UNIPROT:", "", features) getGeneMapping(features, keytype = "UNIPROT", org = "rnorvegicus") getGeneMapping(features, keytype = "UNIPROT", org = "rnorvegicus", returntype = "ENSEMBL" )
Check by means of the given organism name if the required 'AnnotationDbi' package is installed. Select the ID mapping table based on the organism name and return it.
getIDMappingDatabase(organism)
getIDMappingDatabase(organism)
organism |
String that defines the organism. |
AnnotationDbi database for ID mapping.
Feature mappings will be used from hard disk in case they have been mapped before and 'useLocal' is not set to be FALSE. In other cases, a feature extraction will be done and the results are stored for a following occasion.
getMappedFeatures( pathways, returnID = "SYMBOL", organism = "hsapiens", which = "proteins", useLocal = TRUE )
getMappedFeatures( pathways, returnID = "SYMBOL", organism = "hsapiens", which = "proteins", useLocal = TRUE )
pathways |
List of pathway definitions. |
returnID |
String specifying the returned ID format. |
organism |
String defining the organism of analysis. |
which |
Mode to extract the features, either \'proteins\' or \'metabolites\'. |
useLocal |
Boolean specifying whether or not to use the local preprocessed mapping. |
List of mapped features for an omics layer.
This helper function extracts all used ID formats in all pathways and returns a nested list for each pathway database.
getMetaboliteIDformats(pathways)
getMetaboliteIDformats(pathways)
pathways |
List of pathway databases and their pathway definition. |
List of metabolite ID formats.
Function to retrieve the metabolite identifier mapping. Ongoing from metabolites retrieved from pathway definitions, which often include two or more ID formats, this function maps those IDs to a given format. The complete mapping table based on Comptox Dashboard, PubChem, HMDB, and ChEBI is provided in the AnnotationHub package metaboliteIDmapping.
getMetaboliteMapping(features, keytype, returntype = "HMDB")
getMetaboliteMapping(features, keytype, returntype = "HMDB")
features |
List of identifiers to be mapped. |
keytype |
String specifying the ID type, e.g., "ChEBI" or "KEGGCOMP". |
returntype |
String that specifies the returning ID type. Default: HMDB Options: HMDB, CAS, DTXCID, DTXSID, SID, CID, ChEBI, KEGG, Drugbank |
List containing mapped gene/protein IDs.
features <- graphite::nodes(graphite::pathways("hsapiens", "kegg")[[1]], which = "metabolites") features <- gsub("KEGGCOMP:", "", features) keytype <- "KEGG" getMetaboliteMapping(features, keytype) getMetaboliteMapping(features, keytype = "KEGG", returntype = "CID")
features <- graphite::nodes(graphite::pathways("hsapiens", "kegg")[[1]], which = "metabolites") features <- gsub("KEGGCOMP:", "", features) keytype <- "KEGG" getMetaboliteMapping(features, keytype) getMetaboliteMapping(features, keytype = "KEGG", returntype = "CID")
The functions makes use of the graphite R package to collect pathways from user specified databases. Depending on the omics layer specified, the function extracts either annotated genes/proteins (for transcriptome, proteome layer) or metabolites (for metabolite layer). The data structure that is returned is mandatory to calculate the multi-omics pathway enrichment.
getMultiOmicsFeatures( dbs = c("all"), layer = c("all"), returnTranscriptome = "SYMBOL", returnProteome = "SYMBOL", returnMetabolome = "HMDB", organism = "hsapiens", useLocal = TRUE )
getMultiOmicsFeatures( dbs = c("all"), layer = c("all"), returnTranscriptome = "SYMBOL", returnProteome = "SYMBOL", returnMetabolome = "HMDB", organism = "hsapiens", useLocal = TRUE )
dbs |
List of databases that should be queried for pathways. Default: all available databases |
layer |
List of omics layer that should be addressed. Default: all three layer (transcriptome, proteome, metabolome) |
returnTranscriptome |
String specifying the returned gene ID format. Default: SYMBOL Options: SYMBOL, ENTREZID, UNIPROT, ENSEMBL, REFSEQ |
returnProteome |
String specifying the returned protein ID format. Default: SYMBOL Options: SYMBOL, ENTREZID, UNIPROT, ENSEMBL, REFSEQ |
returnMetabolome |
String specifying the returned metabolite ID format. Default: HMDB Options: HMDB, CAS, DTXCID, DTXSID, SID, CID, ChEBI, KEGG, Drugbank |
organism |
String specifying the organism of interest. This has direct
influence on the available pathway databases. Default: "hsapiens"
Options: see |
useLocal |
Boolean to use local pathway/feature descriptions. In case useLocal is set to FALSE, pathway definitions and feature extraction will be recalculated. This could take several minutes depending on the database used. Pathbank, for example, contains nearly 50000 pathway definition that have to be re-mapped. useLocal has no effect when pathway definitions are retrieved for the first time. Default: TRUE |
Nested list with extracted and mapped pathway features.
getMultiOmicsFeatures( dbs = c("kegg"), layer = c("transcriptome", "proteome"), organism = "hsapiens" ) getMultiOmicsFeatures( dbs = c("kegg", "reactome"), layer = c("transcriptome", "metabolome"), organism = "mmusculus" ) getMultiOmicsFeatures( dbs = c("reactome"), layer = c("proteome"), organism = "rnorvegicus", returnProteome = "ENTREZID" )
getMultiOmicsFeatures( dbs = c("kegg"), layer = c("transcriptome", "proteome"), organism = "hsapiens" ) getMultiOmicsFeatures( dbs = c("kegg", "reactome"), layer = c("transcriptome", "metabolome"), organism = "mmusculus" ) getMultiOmicsFeatures( dbs = c("reactome"), layer = c("proteome"), organism = "rnorvegicus", returnProteome = "ENTREZID" )
Get a list of organisms that are covered in our workflow through a supporting 'AnnotationDBi' package. Without such a package we would not be able to map transcript and protein identifier between different formats. All the organisms that are listed here have at lest kegg and or reactome pathway annotation that can be queried by 'graphite'.
getOrganisms()
getOrganisms()
List of supported organisms
getOrganisms()
getOrganisms()
This function creates a data structure of nested but empty lists. One list for each omics layer. By default all three supported omics layer are used to create a data structures with three empty sublists: transcriptome, proteome, and metabolome.
initOmicsDataStructure(layer = c("transcriptome", "proteome", "metabolome"))
initOmicsDataStructure(layer = c("transcriptome", "proteome", "metabolome"))
layer |
List specifying the omics layer which should be created |
List with length(layer) empty sublists
initOmicsDataStructure() initOmicsDataStructure(c("transcriptome", "proteome")) initOmicsDataStructure(c("Transcriptome", "Metabolome"))
initOmicsDataStructure() initOmicsDataStructure(c("transcriptome", "proteome")) initOmicsDataStructure(c("Transcriptome", "Metabolome"))
Use the readRDS function to load the given file which should be in RDS format.
loadLocal(filename)
loadLocal(filename)
filename |
Path to the file to be read. |
Content of file.
This helper function becomes necessary since there are sometimes multiple ID formats used in a single pathway definition.
mapIDType(features, keytype = "CHEBI", maptype = "ChEBI", returntype = "HMDB")
mapIDType(features, keytype = "CHEBI", maptype = "ChEBI", returntype = "HMDB")
features |
List of metabolite feature IDs of the pathway. |
keytype |
String specifying the ID format in pathway definition. |
maptype |
String specifying the corresponding ID format in multiGSEA. |
returntype |
String identifying the ID type that should be mapped. |
List of mapped metabolite IDs.
Processed metabolomics data set that will be used throughout the vignette provided by the 'multiGSEA' package. The raw data was originally published by [Quiros _et al._](http://doi.org/10.1083/jcb.201702058) and can be accessed within the online supplementary material.
data(metabolome)
data(metabolome)
A tibble with 4 variables and 4881 measured proteome features:
HMDB identifier of measured metabolites.
Log2-transformed fold change between treatment and control.
P-value associated with the fold change.
Adjusted p-value associated with the fold change.
data(metabolome)
data(metabolome)
This function calculates GSEA-based enrichments scores for multiple omics
layer at once. Input pathways or gene sets have to be prepared in advance by
means of the function initOmicsDataStructure
. The function uses pre-
ranked lists for each omics layer to calculate the enrichment score. The
ranking can be calculated by means of the function
rankFeatures.
multiGSEA(pathways, ranks, eps = 0)
multiGSEA(pathways, ranks, eps = 0)
pathways |
Nested list containing all pathway features for the respective omics layer. |
ranks |
Nested list containing the measured and pre-ranked features for each omics layer. |
eps |
This parameter sets the boundary for calculating the p value. |
Nested list containing the enrichment scores for each given pathway and omics layer.
# Download pathway definition and extract features pathways <- getMultiOmicsFeatures(dbs = c("kegg"), layer = c("transcriptome", "proteome")) # load omics data and calculate ranks data(transcriptome) data(proteome) ranks <- initOmicsDataStructure(c("transcriptome", "proteome")) ranks$transcriptome <- rankFeatures(transcriptome$logFC, transcriptome$pValue) names(ranks$transcriptome) <- transcriptome$Symbol ranks$proteome <- rankFeatures(proteome$logFC, proteome$pValue) names(ranks$proteome) <- proteome$Symbol ## run the enrichment multiGSEA(pathways, ranks)
# Download pathway definition and extract features pathways <- getMultiOmicsFeatures(dbs = c("kegg"), layer = c("transcriptome", "proteome")) # load omics data and calculate ranks data(transcriptome) data(proteome) ranks <- initOmicsDataStructure(c("transcriptome", "proteome")) ranks$transcriptome <- rankFeatures(transcriptome$logFC, transcriptome$pValue) names(ranks$transcriptome) <- transcriptome$Symbol ranks$proteome <- rankFeatures(proteome$logFC, proteome$pValue) names(ranks$proteome) <- proteome$Symbol ## run the enrichment multiGSEA(pathways, ranks)
Processed proteomics data set that will be used throughout the vignette provided by the 'multiGSEA' package. The raw data was originally published by [Quiros _et al._](http://doi.org/10.1083/jcb.201702058) and deposited at [ProteomeXchange](http://proteomecentral.proteomexchange.org/cgi/GetDataset?ID=PXD006293).
data(proteome)
data(proteome)
A tibble with 4 variables and 8275 measured proteome features:
HGNC symbol of measured proteins.
Log2-transformed fold change between treatment and control.
P-value associated with the fold change.
Adjusted p-value associated with the fold change.
data(proteome)
data(proteome)
Rank features based on the direction of their fold change and their magnitude implicated through their assigned p-value.
rankFeatures(logFC, pvalues, base = 10)
rankFeatures(logFC, pvalues, base = 10)
logFC |
Vector containing the log-transformed fold changes of features. |
pvalues |
Vector containing the p-values associated with those logFCs. |
base |
Integer specifying the base of the logarithm. Default: 10 |
Vector of pre-ranked features, still unsorted
logFC <- rnorm(10) pvalues <- runif(10) rankFeatures(logFC, pvalues)
logFC <- rnorm(10) pvalues <- runif(10) rankFeatures(logFC, pvalues)
It might happen that there are duplicated strings in a list. With this function we will rename those duplicated entries in a way that we simply add the number of occurrences to the string. I.e., when the string foo occurs three times in a list, it will be renamed to foo_1, foo_2, and foo_3, respectively.
rename_duplicates(names)
rename_duplicates(names)
names |
List of strings where duplicates should be renamed |
List where duplicates are renamed.
l <- c("foo", "bar", "foo", "bars") rename_duplicates(l)
l <- c("foo", "bar", "foo", "bars") rename_duplicates(l)
Processed transcriptomics data set that will be used throughout the vignette provided by the 'multiGSEA' package. The raw data was originally published by [Quiros _et al._](http://doi.org/10.1083/jcb.201702058) and deposited at [NCBI Geo](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE84631).
data(transcriptome)
data(transcriptome)
A tibble with 4 variables and 15174 measured transcriptome features:
HGNC symbol of measured transcripts.
Log2-transformed fold change between treatment and control.
P-value associated with the fold change.
Adjusted p-value associated with the fold change.
data(transcriptome)
data(transcriptome)