Title: | Generally Applicable Gene-set Enrichment for Pathway Analysis |
---|---|
Description: | GAGE is a published method for gene set (enrichment or GSEA) or pathway analysis. GAGE is generally applicable independent of microarray or RNA-Seq data attributes including sample sizes, experimental designs, assay platforms, and other types of heterogeneity, and consistently achieves superior performance over other frequently used methods. In gage package, we provide functions for basic GAGE analysis, result processing and presentation. We have also built pipeline routines for of multiple GAGE analyses in a batch, comparison between parallel analyses, and combined analysis of heterogeneous data from different sources/studies. In addition, we provide demo microarray data and commonly used gene set data based on KEGG pathways and GO terms. These funtions and data are also useful for gene set analysis using other methods. |
Authors: | Weijun Luo |
Maintainer: | Weijun Luo <[email protected]> |
License: | GPL (>=2.0) |
Version: | 2.57.0 |
Built: | 2024-11-29 08:14:45 UTC |
Source: | https://github.com/bioc/gage |
These functions convert Entrez Gene IDs to official gene symbols for human genes, or vise versa.
eg2sym(eg) sym2eg(sym)
eg2sym(eg) sym2eg(sym)
eg |
character vector for Entrez Gene IDs (human genes only). |
sym |
character vector for official gene symbols (human genes only). |
Currently, only conversion for human genes are supported. Notice that some gene symbols are not official, hence not recognized and NA will be returned in such cases.
A character vector giving the converted official gene symbols or Entrez IDs.
Weijun Luo <[email protected]>
Luo, W., Friedman, M., Shedden K., Hankenson, K. and Woolf, P GAGE: Generally Applicable Gene Set Enrichment for Pathways Analysis. BMC Bioinformatics 2009, 10:161
egSymb
mapping data between Entrez Gene IDs and official
symbols;
readList
read in gene set list
#genes in gse16873 was label by Entrez IDs data(gse16873) head(rownames(gse16873)) #may convert the gene IDs to official symbols gse16873.sym<-gse16873 data(egSymb) rownames(gse16873.sym)<-eg2sym(rownames(gse16873.sym)) head(rownames(gse16873.sym)) #convert kegg.gs correspondingly data(kegg.gs) kegg.gs.sym<-lapply(kegg.gs, eg2sym) lapply(kegg.gs.sym[1:3],head) #GAGE analysis with the converted data cn=colnames(gse16873) hn=grep('HN',cn, ignore.case =TRUE) dcis=grep('DCIS',cn, ignore.case =TRUE) gse16873.kegg.p2 <- gage(gse16873.sym, gsets = kegg.gs.sym, ref = hn, samp = dcis)
#genes in gse16873 was label by Entrez IDs data(gse16873) head(rownames(gse16873)) #may convert the gene IDs to official symbols gse16873.sym<-gse16873 data(egSymb) rownames(gse16873.sym)<-eg2sym(rownames(gse16873.sym)) head(rownames(gse16873.sym)) #convert kegg.gs correspondingly data(kegg.gs) kegg.gs.sym<-lapply(kegg.gs, eg2sym) lapply(kegg.gs.sym[1:3],head) #GAGE analysis with the converted data cn=colnames(gse16873) hn=grep('HN',cn, ignore.case =TRUE) dcis=grep('DCIS',cn, ignore.case =TRUE) gse16873.kegg.p2 <- gage(gse16873.sym, gsets = kegg.gs.sym, ref = hn, samp = dcis)
A two column matrix listing the Entrez IDs and official symbols for all currently known human genes.
data(egSymb)
data(egSymb)
The format is: chr [1:40784, 1:2] "1" "10" "100" "1000" ... - attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr [1:2] "eg" "sym"
This mapping matrix is commonly used together with functions
eg2sym
and sym2eg
. Check the examples below.
This mapping data matrix were compiled using the gene data from NCBI Entrez Gene database.
Similar information can also be derived from Bioconductor package org.Hs.eg.db. Please check the package for more information.
Entrez Gene <URL: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene>
#genes in gse16873 was label by Entrez IDs data(gse16873) head(rownames(gse16873)) #may convert the gene IDs to official symbols gse16873.sym<-gse16873 data(egSymb) rownames(gse16873.sym)<-eg2sym(rownames(gse16873.sym)) head(rownames(gse16873.sym)) #convert kegg.gs correspondingly data(kegg.gs) kegg.gs.sym<-lapply(kegg.gs, eg2sym) lapply(kegg.gs.sym[1:3],head) #GAGE analysis with the converted data cn=colnames(gse16873) hn=grep('HN',cn, ignore.case =TRUE) dcis=grep('DCIS',cn, ignore.case =TRUE) gse16873.kegg.p2 <- gage(gse16873.sym, gsets = kegg.gs.sym, ref = hn, samp = dcis)
#genes in gse16873 was label by Entrez IDs data(gse16873) head(rownames(gse16873)) #may convert the gene IDs to official symbols gse16873.sym<-gse16873 data(egSymb) rownames(gse16873.sym)<-eg2sym(rownames(gse16873.sym)) head(rownames(gse16873.sym)) #convert kegg.gs correspondingly data(kegg.gs) kegg.gs.sym<-lapply(kegg.gs, eg2sym) lapply(kegg.gs.sym[1:3],head) #GAGE analysis with the converted data cn=colnames(gse16873) hn=grep('HN',cn, ignore.case =TRUE) dcis=grep('DCIS',cn, ignore.case =TRUE) gse16873.kegg.p2 <- gage(gse16873.sym, gsets = kegg.gs.sym, ref = hn, samp = dcis)
This function extract a non-redundant signcant gene set list, groups of
redundant gene sets, and related data from gage
results. Redundant gene sets are those overlap heavily in their
effective member gene lists or core genes.
esset.grp(setp, exprs, gsets, ref = NULL, samp = NULL, test4up = TRUE, same.dir = TRUE, compare = "paired", use.fold = TRUE, cutoff = 0.01, use.q = FALSE, pc = 10^-10, output = TRUE, outname = "esset.grp", make.plot = FALSE, pdf.size = c(7, 7), core.counts = FALSE, get.essets = TRUE, bins = 10, bsize = 1, cex = 0.5, layoutType = "circo", name.str = c(10, 100), ...)
esset.grp(setp, exprs, gsets, ref = NULL, samp = NULL, test4up = TRUE, same.dir = TRUE, compare = "paired", use.fold = TRUE, cutoff = 0.01, use.q = FALSE, pc = 10^-10, output = TRUE, outname = "esset.grp", make.plot = FALSE, pdf.size = c(7, 7), core.counts = FALSE, get.essets = TRUE, bins = 10, bsize = 1, cex = 0.5, layoutType = "circo", name.str = c(10, 100), ...)
setp |
a numeric matrix, the result p-value matrix returned by |
exprs |
an expression matrix or matrix-like data structure, with genes as rows and samples as columns. |
gsets |
a named list, each element contains a gene set that is a character
vector of gene IDs or symbols. For example, type |
ref |
a numeric vector of column numbers for the reference condition or phenotype (i.e. the control group) in the exprs data matrix. Default ref = NULL, all columns are considered as target experiments. |
samp |
a numeric vector of column numbers for the target condition or phenotype (i.e. the experiment group) in the exprs data matrix. Default samp = NULL, all columns other than ref are considered as target experiments. |
test4up |
boolean, whether the input |
same.dir |
boolean, whether the input |
compare |
character, which comparison scheme to be used: 'paired', 'unpaired', '1ongroup', 'as.group'. 'paired' is the default, ref and samp are of equal length and one-on-one paired by the original experimental design; 'as.group', group-on-group comparison between ref and samp; 'unpaired' (used to be '1on1'), one-on-one comparison between all possible ref and samp combinations, although the original experimental design may not be one-on-one paired; '1ongroup', comparison between one samp column at a time vs the average of all ref columns. |
use.fold |
Boolean, whether the input |
cutoff |
numeric, p- or q-value cutoff, between 0 and 1. Default 0.01 (for p-value). When q-value is used, recommended cutoff value is 0.1. |
use.q |
boolean, whether to use q-value or not as the pre-selection of a signficant gene set list. Default to be FALSE, i.e. use the p-value instead. |
pc |
numeric, cutoff p-value for the overlap between gene sets to be
called 'redundant', default to |
output |
boolean, whether output the non-redundant gene set list as tab-delimited text file? Default to be TRUE. |
outname |
character, the prefix used to label the output file names when output = TRUE. |
make.plot |
boolean, whether to generate the network graph to visualize the redundancy (overlap in core genes) between significant gene sets. Currently the only feasible option is FALSE. |
pdf.size |
numeric vector of length 2, spcifies the PDF file size for network graph outpout. Currently unsupported. |
core.counts |
Currently unsupported. |
get.essets |
Currently unsupported. |
bins |
Currently unsupported. |
bsize |
Currently unsupported. |
cex |
Currently unsupported. |
layoutType |
Currently unsupported. |
name.str |
numeric vector of length 2, specifies the substring range of the gene set name to show in the network graph. Currently unsupported. |
... |
extra arguments to be passed into internal function
|
Redundant gene sets are defined to be those overlap heavily in their effective
member gene lists or core genes. Core genes are those member genes
that really contribute to the signficance of the gene set in GAGE
analysis in the interesting direction(s). Argument pc
set the cutoff for
the overlap to be called "redundant". The redundancy between
gene sets is then represented by a undirected graph/network. Groups of
redundant gene sets are then derived as the connected component in the
network graph.
The selection criterion for gene sets here is p-value, instead of the commonly used q-value. This is because for extracting a non-redundant list of signficant gene sets, p-value is relative stable, but q-value changes when the total number of gene sets being considered changes. Of course, q-value is also a sensible selection criterion, when one take this step as a further refinement on the list of signficant gene sets.
The value returned by pairData
is a list of 7 elements:
essentialSets |
character vector, the non-redundant signficant gene set list. |
setGroups |
list, each element is a character vector of a group of redundant gene sets. |
allSets |
character vector, the full list of signficant gene sets. |
setGroups |
list, each element is a character vector of a connected component in the redundancy graph representation of the gene set. |
overlapCounts |
numeric matrix, the overlap core gene counts between the signficant gene sets. |
overlapPvals |
numeric matrix, the significance (in p-values) of the overlap core gene counts between the signficant gene sets. |
coreGeneSets |
list, each element is a character vector of the core genes in a significant gene set. |
Weijun Luo <[email protected]>
Luo, W., Friedman, M., Shedden K., Hankenson, K. and Woolf, P GAGE: Generally Applicable Gene Set Enrichment for Pathways Analysis. BMC Bioinformatics 2009, 10:161
gage
the main function for GAGE analysis;
sigGeneSet
significant gene set from GAGE analysis;
essGene
essential member genes in a gene set;
data(gse16873) cn=colnames(gse16873) hn=grep('HN',cn, ignore.case =TRUE) dcis=grep('DCIS',cn, ignore.case =TRUE) data(kegg.gs) #kegg test for 1-directional changes gse16873.kegg.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis) #kegg test for 2-directional changes gse16873.kegg.2d.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis, same.dir = FALSE) gse16873.kegg.esg.up <- esset.grp(gse16873.kegg.p$greater, gse16873, gsets = kegg.gs, ref = hn, samp = dcis, test4up = TRUE, output = TRUE, outname = "gse16873.kegg.up", make.plot = FALSE) gse16873.kegg.esg.dn <- esset.grp(gse16873.kegg.p$less, gse16873, gsets = kegg.gs, ref = hn, samp = dcis, test4up = FALSE, output = TRUE, outname = "gse16873.kegg.dn", make.plot = FALSE) gse16873.kegg.esg.2d <- esset.grp(gse16873.kegg.2d.p$greater, gse16873, gsets = kegg.gs, ref = hn, samp = dcis, test4up = TRUE, output = TRUE, outname = "gse16873.kegg.2d", make.plot = FALSE) names(gse16873.kegg.esg.up) head(gse16873.kegg.esg.up$essentialSets, 4) head(gse16873.kegg.esg.up$setGroups, 4) head(gse16873.kegg.esg.up$coreGeneSets, 4)
data(gse16873) cn=colnames(gse16873) hn=grep('HN',cn, ignore.case =TRUE) dcis=grep('DCIS',cn, ignore.case =TRUE) data(kegg.gs) #kegg test for 1-directional changes gse16873.kegg.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis) #kegg test for 2-directional changes gse16873.kegg.2d.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis, same.dir = FALSE) gse16873.kegg.esg.up <- esset.grp(gse16873.kegg.p$greater, gse16873, gsets = kegg.gs, ref = hn, samp = dcis, test4up = TRUE, output = TRUE, outname = "gse16873.kegg.up", make.plot = FALSE) gse16873.kegg.esg.dn <- esset.grp(gse16873.kegg.p$less, gse16873, gsets = kegg.gs, ref = hn, samp = dcis, test4up = FALSE, output = TRUE, outname = "gse16873.kegg.dn", make.plot = FALSE) gse16873.kegg.esg.2d <- esset.grp(gse16873.kegg.2d.p$greater, gse16873, gsets = kegg.gs, ref = hn, samp = dcis, test4up = TRUE, output = TRUE, outname = "gse16873.kegg.2d", make.plot = FALSE) names(gse16873.kegg.esg.up) head(gse16873.kegg.esg.up$essentialSets, 4) head(gse16873.kegg.esg.up$setGroups, 4) head(gse16873.kegg.esg.up$coreGeneSets, 4)
This function extracts data for essential member genes in a gene set. Essential genes are genes that have changes over noise level.
essGene(gs, exprs, ref = NULL, samp = NULL, gsets = NULL, compare = "paired", use.fold = TRUE, rank.abs = FALSE, use.chi = FALSE, chi.p = 0.05, ...)
essGene(gs, exprs, ref = NULL, samp = NULL, gsets = NULL, compare = "paired", use.fold = TRUE, rank.abs = FALSE, use.chi = FALSE, chi.p = 0.05, ...)
gs |
character, either the name of an interesting gene set in a gene set
collection passed by |
exprs |
an expression matrix or matrix-like data structure, with genes as rows and samples as columns. |
ref |
a numeric vector of column numbers for the reference condition or phenotype (i.e. the control group) in the exprs data matrix. Default ref = NULL, all columns are considered as target experiments. |
samp |
a numeric vector of column numbers for the target condition or phenotype (i.e. the experiment group) in the exprs data matrix. Default samp = NULL, all columns other than ref are considered as target experiments. |
gsets |
a named list, each element contains a gene set that is a character
vector of gene IDs or symbols. For example, type |
compare |
character, which comparison scheme to be used: 'paired', 'unpaired', '1ongroup', 'as.group'. 'paired' is the default, ref and samp are of equal length and one-on-one paired by the original experimental design; 'as.group', group-on-group comparison between ref and samp; 'unpaired' (used to be '1on1'), one-on-one comparison between all possible ref and samp combinations, although the original experimental design may not be one-on-one paired; '1ongroup', comparison between one samp column at a time vs the average of all ref columns. |
use.fold |
Boolean, whether the input |
rank.abs |
boolean, whether to sort the essential gene data based on absoluate changes. Default to be FALSE. |
use.chi |
boolean, whether to use chi-square test to select the essential genes. Default to be FALSE, use the mean plus standard deviation of all gene changes instead. Check details for more information. |
chi.p |
numeric value between 0 and 1, cutoff p-value for the chi-square test to select the essential genes. Default to 0.05. |
... |
other arguments to be passed into the inside |
There are two different criteria for essential gene selection. One uses a chi-square test to determin whether the change of a gene is more than noise. A second considers any changes beyond 1 standard deviation from mean of all genes as real.
Note that essential genes are different from core genes considered in
esset.grp
function. Essential genes may change in a different
direction than the overall change of a gene set. But core genes need to
change in the in the interesting direction(s) of the gene set test.
A expression data matrix extracted for the essential genes, with
similar structure as exprs
.
Weijun Luo <[email protected]>
Luo, W., Friedman, M., Shedden K., Hankenson, K. and Woolf, P GAGE: Generally Applicable Gene Set Enrichment for Pathways Analysis. BMC Bioinformatics 2009, 10:161
gage
the main function for GAGE analysis;
geneData
output and visualization of expression data
for selected genes;
esset.grp
non-redundant signcant gene set list;
data(gse16873) cn=colnames(gse16873) hn=grep('HN',cn, ignore.case =TRUE) dcis=grep('DCIS',cn, ignore.case =TRUE) #kegg test for 1-directional changes data(kegg.gs) gse16873.kegg.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis) rownames(gse16873.kegg.p$greater)[1:3] gs=unique(unlist(kegg.gs[rownames(gse16873.kegg.p$greater)[1:3]])) essData=essGene(gs, gse16873, ref =hn, samp =dcis) head(essData) ref1=1:6 samp1=7:12 #generated text file for data table, pdf files for heatmap and scatterplot for (gs in rownames(gse16873.kegg.p$greater)[1:3]) { outname = gsub(" |:|/", "_", substr(gs, 10, 100)) geneData(genes = kegg.gs[[gs]], exprs = essData, ref = ref1, samp = samp1, outname = outname, txt = TRUE, heatmap = TRUE, Colv = FALSE, Rowv = FALSE, dendrogram = "none", limit = 3, scatterplot = TRUE) }
data(gse16873) cn=colnames(gse16873) hn=grep('HN',cn, ignore.case =TRUE) dcis=grep('DCIS',cn, ignore.case =TRUE) #kegg test for 1-directional changes data(kegg.gs) gse16873.kegg.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis) rownames(gse16873.kegg.p$greater)[1:3] gs=unique(unlist(kegg.gs[rownames(gse16873.kegg.p$greater)[1:3]])) essData=essGene(gs, gse16873, ref =hn, samp =dcis) head(essData) ref1=1:6 samp1=7:12 #generated text file for data table, pdf files for heatmap and scatterplot for (gs in rownames(gse16873.kegg.p$greater)[1:3]) { outname = gsub(" |:|/", "_", substr(gs, 10, 100)) geneData(genes = kegg.gs[[gs]], exprs = essData, ref = ref1, samp = samp1, outname = outname, txt = TRUE, heatmap = TRUE, Colv = FALSE, Rowv = FALSE, dendrogram = "none", limit = 3, scatterplot = TRUE) }
Run GAGE analysis to infer gene sets (or pathways, functional groups etc) that are signficantly perturbed relative to all genes considered. GAGE is generally applicable to essentially all microarray dta independent of data attributes including sample size, experimental layout, study design, and all types of heterogeneity in data generation.
gage is the main function; gagePrep is the functions for the initial data preparation, especially sample pairing; gageSum carries out the final meta-test summarization.
gage(exprs, gsets, ref = NULL, samp = NULL, set.size = c(10, 500), same.dir = TRUE, compare = "paired", rank.test = FALSE, use.fold = TRUE, FDR.adj = TRUE, weights = NULL, full.table = FALSE, saaPrep = gagePrep, saaTest = gs.tTest, saaSum = gageSum, use.stouffer=TRUE, ...) gagePrep(exprs, ref = NULL, samp = NULL, same.dir = TRUE, compare = "paired", rank.test = FALSE, use.fold = TRUE, weights = NULL, full.table = FALSE, ...) gageSum(rawRes, ref = NULL, test4up = TRUE, same.dir = TRUE, compare = "paired", use.fold = TRUE, weights = NULL, full.table = FALSE, use.stouffer=TRUE, ...)
gage(exprs, gsets, ref = NULL, samp = NULL, set.size = c(10, 500), same.dir = TRUE, compare = "paired", rank.test = FALSE, use.fold = TRUE, FDR.adj = TRUE, weights = NULL, full.table = FALSE, saaPrep = gagePrep, saaTest = gs.tTest, saaSum = gageSum, use.stouffer=TRUE, ...) gagePrep(exprs, ref = NULL, samp = NULL, same.dir = TRUE, compare = "paired", rank.test = FALSE, use.fold = TRUE, weights = NULL, full.table = FALSE, ...) gageSum(rawRes, ref = NULL, test4up = TRUE, same.dir = TRUE, compare = "paired", use.fold = TRUE, weights = NULL, full.table = FALSE, use.stouffer=TRUE, ...)
exprs |
an expression matrix or matrix-like data structure, with genes as rows and samples as columns. |
gsets |
a named list, each element contains a gene set that is a character
vector of gene IDs or symbols. For example, type |
ref |
a numeric vector of column numbers for the reference condition or phenotype (i.e. the control group) in the exprs data matrix. Default ref = NULL, all columns are considered as target experiments. |
samp |
a numeric vector of column numbers for the target condition or phenotype (i.e. the experiment group) in the exprs data matrix. Default samp = NULL, all columns other than ref are considered as target experiments. |
set.size |
gene set size (number of genes) range to be considered for enrichment test. Tests for too small or too big gene sets are not robust statistically or informative biologically. Default to be set.size = c(10, 500). |
same.dir |
boolean, whether to test for changes in a gene set toward a single direction (all genes up or down regulated) or changes towards both directions simultaneously. For experimentally derived gene sets, GO term groups, etc, coregulation is commonly the case, hence same.dir = TRUE (default); In KEGG, BioCarta pathways, genes frequently are not coregulated, hence it could be informative to let same.dir = FALSE. Although same.dir = TRUE could also be interesting for pathways. |
compare |
character, which comparison scheme to be used: 'paired', 'unpaired', '1ongroup', 'as.group'. 'paired' is the default, ref and samp are of equal length and one-on-one paired by the original experimental design; 'as.group', group-on-group comparison between ref and samp; 'unpaired' (used to be '1on1'), one-on-one comparison between all possible ref and samp combinations, although the original experimental design may not be one-on-one paired; '1ongroup', comparison between one samp column at a time vs the average of all ref columns. For PAGE-like analysis, the default is compare='as.group', which is the only option provided in the original PAGE method. All other comparison schemas are set here for direct comparison to gage. |
rank.test |
rank.test: Boolean, whether do the optional rank based two-sample t-test (equivalent to the non-parametric Wilcoxon Mann-Whitney test) instead of parametric two-sample t-test. Default rank.test = FALSE. This argument should be used with respect to argument saaTest. |
use.fold |
Boolean, whether to use fold changes or t-test statistics as per gene statistics. Default use.fold= TRUE. |
FDR.adj |
Boolean, whether to do adjust for multiple testing as to control FDR (False dicovery rate). Default to be TRUE. |
weights |
a numeric vector to specify the weights assigned to pairs of ref-samp. This is needed for data with both technical replicates and biological replicates as to count for the different contributions from the two types of replicates. This argument is also useful in manually paring ref-samp for unpaired data, as in pairData function. function. Default to be NULL. |
full.table |
This option is obsolete. Boolean, whether to output the full table of all individual p-values from the pairwise comparisons of ref and samp. Default to be FALSE. |
saaPrep |
function used for data preparation for single array based analysis, including sanity check, sample pairing, per gene statistics calculation etc. Default to be gagePrep, i.e. the default data preparation routine for gage analysis. |
saaTest |
function used for gene set tests for single array based analysis. Default to be gs.tTest, which features a two-sample t-test for differential expression of gene sets. Other options includes: gs.zTest, using one-sample z-test as in PAGE, or gs.KSTest, using the non-parametric Kolmogorov-Smirnov tests as in GSEA. The two non-default options should only be used when rank.test = FALSE. |
saaSum |
function used for summarization of the results from single array analysis (i.e. pairwise comparison between ref and samp). This function should include a meta-test for a global p-value or summary statistis and a FDR adjustment for multi-testing issue. Default to be gageSum, i.e. the default data summarization routine for gage analysis. |
rawRes |
a named list, the raw results of gene set tests. Check the help
information of gene set test functions like |
test4up |
boolean, whether to summarize the p-value results for up-regulation
test (p.results) or not (ps.results for down-regulation).
This argument is only needed when
the argument same.dir=TRUE in the main |
use.stouffer |
Boolean, whether to use Stouffer's method when summarizing individual p-values into a global p-value. Default to be TRUE. This default setting is recommended as to avoid the "dual significance", i.e. a gene set is significant for both up-regulation and down-regulation tests simultaneously. Dual signficance occurs sometimes for data with large sample size, due to extremely small p-values in a few pair-wise comparison. More robust p-value summarization using Stouffer's method is a important new feature added to GAGE since version 2.2.0 (Bioconductor 2.8). This new argument is set as to provide a option to the original summarization based on Gamma distribution (FALSE). |
... |
other arguments to be passed into the optional functions for saaPrep, saaTest and saaSum. |
We proposed a single array analysis (i.e. the one-on-one comparison) approach with GAGE. Here we made single array analysis a general workflow for gene set analysis. Single array analysis has 4 major steps: Step 1 sample pairing, Step 2 per gene tests, Step 3 gene set tests and Step 4 meta-test summarization. Correspondingly, this new main function, gage, is divided into 3 relatively independent modules. Module 1 input preparation covers step 1-2 of single array analysis. Module 2 corresponds to step 3 gene set test, and module 3 to step 4 meta-test summarization. These 3 modules become 3 argument functions to gage, saaPrep, saaTest and saaSum. The modulization made gage open to customization or plug-in routines at each steps and fully realize the general applicability of single array analysis. More examples will be included in a second vignette to demonstrate the customization with these modules.
some important updates has been made to gage package since version 2.2.0
(Bioconductor 2.8):
First, more robust p-value summarization using Stouffer's method
through argument use.stouffer=TRUE. The original p-value
summarization, i.e. negative log sum following a Gamma distribution as
the Null hypothesis, may produce less stable global p-values for large
or heterogenous datasets. In other words, the global p-value could be
heavily affected by a small subset of extremely small individual
p-values from pair-wise comparisons. Such sensitive global p-value
leads to the "dual signficance" phenomenon. Dual-signficant means a gene set is called
significant simultaneously in both 1-direction tests (up- and
down-regulated). "Dual
signficance" could be informative in revealing the sub-types or
sub-classes in big clinical or disease studies, but may not be
desirable in other cases.
Second, output of gage function now includes the gene set test
statistics from pair-wise comparisons for all proper gene sets. The
output is always a named list now, with either 3 elements
("greater", "less", "stats") for one-directional test or 2 elements
("greater", "stats") for two-directional test.
Third, the individual p-value (and test statistics)from dependent pair-wise
comparisions, i.e. comparisions between the same experiment vs
different controls, are now summarized into a single value. In other
words, the column number of individual p-values or statistics is
always the same as the sample number in the experiment (or disease)
group. This change made the argument value compare="1ongroup"
and argument full.table less useful. It also became easier to check the
perturbations at gene-set level for individual samples.
Fourth, whole gene-set level changes (either p-values or statistics)
can now be visualized using heatmaps due to the third change above.
Correspondingly, functions sigGeneSet
and gagePipe
have
been revised to plot heatmaps for whole gene sets.
The result returned by gage function is a named list, with either 3 elements ("greater", "less", "stats") for one-directional test (same.dir = TRUE) or 2 elements ("greater", "stats") for two-directional test (same.dir = FALSE). Elements "greater" and "less" are two data matrices of the same structure, mainly the p-values, element "stats" contains the test statistics. Each data matrix here has gene sets as rows sorted by global p- or q-values. Test signficance or statistics columns include:
p.geomean |
geometric mean of the individual p-values from multiple single array based gene set tests |
stat.mean |
mean of the individual statistics from multiple single array based gene set tests. Normally, its absoluate value measures the magnitude of gene-set level changes, and its sign indicates direction of the changes. When saaTest=gs.KSTest, stat.mean is always positive. |
p.val |
gloal p-value or summary of the individual p-values from multiple single array based gene set tests. This is the default p-value being used. |
q.val |
FDR q-value adjustment of the global p-value using the Benjamini & Hochberg procedure implemented in multtest package. This is the default q-value being used. |
set.size |
the effective gene set size, i.e. the number of genes included in the gene set test |
other columns |
columns of the individual p-values or statistics, each measures the gene set perturbation in a single experiment (vs its control or all controls, depends on the "compare argument value) |
The result returned by gagePrep
is a data matrix derived from
exprs
, but ready for column-wise gene est tests. In the
matrix, genes are rows, and columns are the per gene test
statistics from the ref-samp pairwise comparison.
The result returned by gageSum
is almost identical to the results
of gage
function, it is also a named list but has only 2
elements, "p.glob" and "results", with one round of test results.
Weijun Luo <[email protected]>
Luo, W., Friedman, M., Shedden K., Hankenson, K. and Woolf, P GAGE: Generally Applicable Gene Set Enrichment for Pathways Analysis. BMC Bioinformatics 2009, 10:161
gs.tTest
, gs.zTest
, and
gs.KSTest
functions used for gene set test;
gagePipe
and heter.gage
function used for
multiple GAGE analysis in a batch or combined GAGE analysis on
heterogeneous data
data(gse16873) cn=colnames(gse16873) hn=grep('HN',cn, ignore.case =TRUE) dcis=grep('DCIS',cn, ignore.case =TRUE) data(kegg.gs) data(go.gs) #kegg test for 1-directional changes gse16873.kegg.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis) #go.gs with the first 1000 entries as a fast example. gse16873.go.p <- gage(gse16873, gsets = go.gs, ref = hn, samp = dcis) str(gse16873.kegg.p) head(gse16873.kegg.p$greater) head(gse16873.kegg.p$less) head(gse16873.kegg.p$stats) #kegg test for 2-directional changes gse16873.kegg.2d.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis, same.dir = FALSE) head(gse16873.kegg.2d.p$greater) head(gse16873.kegg.2d.p$stats) ###alternative ways to do GAGE analysis### #with unpaired samples gse16873.kegg.unpaired.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis, compare = "unpaired") #other options to tweak includes: #saaTest, use.fold, rank.test, etc. Check arguments section above for #details and the vignette for more examples.
data(gse16873) cn=colnames(gse16873) hn=grep('HN',cn, ignore.case =TRUE) dcis=grep('DCIS',cn, ignore.case =TRUE) data(kegg.gs) data(go.gs) #kegg test for 1-directional changes gse16873.kegg.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis) #go.gs with the first 1000 entries as a fast example. gse16873.go.p <- gage(gse16873, gsets = go.gs, ref = hn, samp = dcis) str(gse16873.kegg.p) head(gse16873.kegg.p$greater) head(gse16873.kegg.p$less) head(gse16873.kegg.p$stats) #kegg test for 2-directional changes gse16873.kegg.2d.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis, same.dir = FALSE) head(gse16873.kegg.2d.p$greater) head(gse16873.kegg.2d.p$stats) ###alternative ways to do GAGE analysis### #with unpaired samples gse16873.kegg.unpaired.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis, compare = "unpaired") #other options to tweak includes: #saaTest, use.fold, rank.test, etc. Check arguments section above for #details and the vignette for more examples.
This function is used to compare the results after running multiple
rounds of GAGE analysis. It is frequently used after batch analysis
using gagePipe
, but may also be used after multiple runs of
gage
manually.
gageComp(sampnames, dataname, gsname = c("kegg.gs", "go.gs"), use.cols = c("stat.mean", "q.val"), q.cutoff = 0.1, do.plot = TRUE)
gageComp(sampnames, dataname, gsname = c("kegg.gs", "go.gs"), use.cols = c("stat.mean", "q.val"), q.cutoff = 0.1, do.plot = TRUE)
sampnames |
character vector, the names of the sample groups, on which the GAGE
analysis has been done and to be compared. This same argument is
used in |
dataname |
character, the name of the data on which the GAGE analysis has been
done. This same argument is used in |
gsname |
character, the name(s) of the gene set collection(s) to be
considered in the comparison. In other words, this argument
specifies GAGE analysis results with what type(s) of gene sets
are to be compared on. Default to be |
use.cols |
character, what columns in the |
q.cutoff |
numeric, q-value cutoff between 0 and 1 for signficant gene sets
selection. Default to be 0.1. The same argument is used in
|
do.plot |
boolean, whether to plot the venn diagram for the comparison results. Default to be TRUE. |
gageComp
works with the results of gagePipe
run by
default. Try to load the .RData file named after dataname
first. It there is no such file, it assumes that the gage
result objects have been loaded and exist in the global environment.
For the GAGE analysis results with each gene set collection specified
in gsname
, gagePipe
compares the signficant gene set
lists between the sample groups specified in sampnames
. For
each gene set collection, three comparisons will be done, on the
2-direction perturbed, up-regulated, and down-regulated gene sets.
The comparison results are output as tab-delimited text files. Venn digrams are only plot for comparison between 2-3 parties. But the text file outputs are not limited by the number of parties under comparison. The venn diagram is generated by calling a revised function based on the VennDigram function from limma package.
The function returns invisible 1 when successfully executed.
Weijun Luo <[email protected]>
Luo, W., Friedman, M., Shedden K., Hankenson, K. and Woolf, P GAGE: Generally Applicable Gene Set Enrichment for Pathways Analysis. BMC Bioinformatics 2009, 10:161
gagePipe
pipeline for multiple GAGE analysis in a batch;
gage
the main function for GAGE analysis
data(gse16873) cn=colnames(gse16873) hn=grep('HN',cn, ignore.case =TRUE) dcis=grep('DCIS',cn, ignore.case =TRUE) data(kegg.gs) library(gageData) data(gse16873.2) cn2=colnames(gse16873.2) hn2=grep('HN',cn2, ignore.case =TRUE) dcis2=grep('DCIS',cn2, ignore.case =TRUE) #multiple GAGE analysis in a batch with the combined data gse16873=cbind(gse16873, gse16873.2) dataname='gse16873' #output data prefix sampnames=c('dcis.1', 'dcis.2') refList=list(hn, hn2+12) sampList=list(dcis, dcis2+12) gagePipe(gse16873, gsname = "kegg.gs", dataname = "gse16873", sampnames = sampnames, ref.list = refList, samp.list = sampList, comp.list = "paired") #follow up comparison between the analyses load('gse16873.gage.RData') #list gage result objects objects(pat = "[.]p$") gageComp(sampnames, dataname, gsname = "kegg.gs", do.plot = TRUE)
data(gse16873) cn=colnames(gse16873) hn=grep('HN',cn, ignore.case =TRUE) dcis=grep('DCIS',cn, ignore.case =TRUE) data(kegg.gs) library(gageData) data(gse16873.2) cn2=colnames(gse16873.2) hn2=grep('HN',cn2, ignore.case =TRUE) dcis2=grep('DCIS',cn2, ignore.case =TRUE) #multiple GAGE analysis in a batch with the combined data gse16873=cbind(gse16873, gse16873.2) dataname='gse16873' #output data prefix sampnames=c('dcis.1', 'dcis.2') refList=list(hn, hn2+12) sampList=list(dcis, dcis2+12) gagePipe(gse16873, gsname = "kegg.gs", dataname = "gse16873", sampnames = sampnames, ref.list = refList, samp.list = sampList, comp.list = "paired") #follow up comparison between the analyses load('gse16873.gage.RData') #list gage result objects objects(pat = "[.]p$") gageComp(sampnames, dataname, gsname = "kegg.gs", do.plot = TRUE)
Function gagePipe runs mutliple rounds of GAGE in a batch without interference, and outputs signcant gene set lists in text format, heatmaps in pdf format, and save the results in RData format.
gagePipe(arraydata, dataname = "arraydata", trim.at=TRUE, sampnames, gsdata = NULL, gsname = c("kegg.gs", "go.gs"), ref.list, samp.list, weight.list = NULL, comp.list = "paired", q.cutoff = 0.1, heatmap=TRUE, pdf.size = c(7, 7), p.limit=c(0.5, 5.5), stat.limit=5, ... )
gagePipe(arraydata, dataname = "arraydata", trim.at=TRUE, sampnames, gsdata = NULL, gsname = c("kegg.gs", "go.gs"), ref.list, samp.list, weight.list = NULL, comp.list = "paired", q.cutoff = 0.1, heatmap=TRUE, pdf.size = c(7, 7), p.limit=c(0.5, 5.5), stat.limit=5, ... )
arraydata |
corresponds to |
dataname |
character, the name of the data to be analyzed. This name will be included as the prefix of the output file names. Default to be "arraydata". |
trim.at |
boolean, whether to trim the suffix "_at" from the probe set IDs or row names of the microarray data. Default to be TRUE. |
sampnames |
character vector, the names of the sample groups, on which the GAGE analysis is done. Each sample groups corresponds to one element of samp.list and the matching element of ref.list. These names will be included in the output file names or object names. |
gsdata |
character, the full path to the gene set data file in .RData format if
the data has not been loaded. Default to be NULL, i.e. the gene set
data has been loaded.
Make sure that the same gene ID system is used for both |
gsname |
character, the name(s) of the gene set collections to be
used. Default to be |
ref.list |
a list of |
samp.list |
a list of |
weight.list |
a list or a vector of |
comp.list |
a list or a vector of |
q.cutoff |
numeric, q-value cutoff between 0 and 1 for signficant gene sets selection. Default to be 0.1. |
heatmap |
boolean, whether to plot heatmap for the selected gene data as a PDF file. Default to be FALSE. |
pdf.size |
a numeric vector to specify the the width and height of PDF graphics region in inches. Default to be c(7, 7). |
stat.limit |
numeric vector of length 1 or 2 to specify the value range of gene set statistics to visualize using the heatmap. Statistics beyong will be reset to equal the proximal limit. Default to 5, i.e. plot all gene set statistics within (-5, 5) range. May also be NULL, i.e. plot all statistics without limit. This argument allows optimal differentiation between most gene set statistic values when extremely positive/negative values exsit and squeeze the normal-value region. |
p.limit |
numeric vector of length 1 or 2 to specify the value range of gene set -log10(p-values) to visualize using the heatmap. Values beyong will be reset to equal the proximal limit. Default to c(0.5,5.5), i.e. plot all -log10(p-values) within this range. This argument is similar to argument stat.limit. |
... |
other arguments to be passed into |
gagePipe
carries two rounds of GAGE analysis on each sample
groups for each
gene set collection specified in gsnames
: one test for
1-direction changes (up- or down-regualted gene sets), one test for
2-direction changes (two-way perturbed gene sets). Correspondingly,
the gage
result p-value matrices for the signficant gene sets are written
into two tab-delimited text files, named after the dataname
and sampnames
. Note that the text file for 1-direction changes
tests combines results for both up- and down-regulated gene
sets. By default, heatmaps in pdf format are also produced to show the gene set
perturbations using either -log10(p-value) or statistics. Meanwhile,
the full gage
analysis result objects
(named lists of p-value or statistics matrices) are saved into a .RData file. The
result objects are name after the sampnames
and
gsnames
.
The function returns invisible 1 when successfully executed.
Weijun Luo <[email protected]>
Luo, W., Friedman, M., Shedden K., Hankenson, K. and Woolf, P GAGE: Generally Applicable Gene Set Enrichment for Pathways Analysis. BMC Bioinformatics 2009, 10:161
gage
the main function for GAGE analysis;
heter.gage
GAGE analysis for heterogeneous data
data(gse16873) cn=colnames(gse16873) hn=grep('HN',cn, ignore.case =TRUE) dcis=grep('DCIS',cn, ignore.case =TRUE) data(kegg.gs) library(gageData) data(gse16873.2) cn2=colnames(gse16873.2) hn2=grep('HN',cn2, ignore.case =TRUE) dcis2=grep('DCIS',cn2, ignore.case =TRUE) #multiple GAGE analysis in a batch with the combined data gse16873=cbind(gse16873, gse16873.2) dataname='gse16873' #output data prefix sampnames=c('dcis.1', 'dcis.2') refList=list(hn, hn2+12) sampList=list(dcis, dcis2+12) gagePipe(gse16873, gsname = "kegg.gs", dataname = "gse16873", sampnames = sampnames, ref.list = refList, samp.list = sampList, comp.list = "paired") #follow up comparison between the analyses load('gse16873.gage.RData') #list gage result objects objects(pat = "[.]p$") gageComp(sampnames, dataname, gsname = "kegg.gs", do.plot = TRUE)
data(gse16873) cn=colnames(gse16873) hn=grep('HN',cn, ignore.case =TRUE) dcis=grep('DCIS',cn, ignore.case =TRUE) data(kegg.gs) library(gageData) data(gse16873.2) cn2=colnames(gse16873.2) hn2=grep('HN',cn2, ignore.case =TRUE) dcis2=grep('DCIS',cn2, ignore.case =TRUE) #multiple GAGE analysis in a batch with the combined data gse16873=cbind(gse16873, gse16873.2) dataname='gse16873' #output data prefix sampnames=c('dcis.1', 'dcis.2') refList=list(hn, hn2+12) sampList=list(dcis, dcis2+12) gagePipe(gse16873, gsname = "kegg.gs", dataname = "gse16873", sampnames = sampnames, ref.list = refList, samp.list = sampList, comp.list = "paired") #follow up comparison between the analyses load('gse16873.gage.RData') #list gage result objects objects(pat = "[.]p$") gageComp(sampnames, dataname, gsname = "kegg.gs", do.plot = TRUE)
This function outputs and visualizes the expression data for seleted genes. Potential output files include: a tab-delimited text file, a heatmap in PDF format, and a scatter plot in PDF format.
geneData(genes, exprs, ref = NULL, samp = NULL, outname = "array", txt = TRUE, heatmap = FALSE, scatterplot = FALSE, samp.mean = FALSE, pdf.size = c(7, 7), cols = NULL, scale = "row", limit = NULL, label.groups = TRUE, ...)
geneData(genes, exprs, ref = NULL, samp = NULL, outname = "array", txt = TRUE, heatmap = FALSE, scatterplot = FALSE, samp.mean = FALSE, pdf.size = c(7, 7), cols = NULL, scale = "row", limit = NULL, label.groups = TRUE, ...)
genes |
character, either a vector of interesting genes IDs or a 2-column
matrix, where the first column specifies gene IDs used in |
exprs |
an expression matrix or matrix-like data structure, with genes as rows and samples as columns. |
ref |
a numeric vector of column numbers for the reference condition or phenotype (i.e. the control group) in the exprs data matrix. Default ref = NULL, all columns are considered as target experiments. |
samp |
a numeric vector of column numbers for the target condition or phenotype (i.e. the experiment group) in the exprs data matrix. Default samp = NULL, all columns other than ref are considered as target experiments. |
outname |
a character string, to be used as the prefix of the output data files. Default to be "array". |
txt |
boolean, whether to output the selected gene data as a tab-delimited text file. Default to be TRUE. |
heatmap |
boolean, whether to plot heatmap for the selected gene data as a PDF file. Default to be FALSE. |
scatterplot |
boolean, whether to make scatter plot for the selected gene data as a PDF file. Default to be FALSE. |
samp.mean |
boolean, whether to take the mean of gene data over the ref and samp group when making the scatter plot. Default to be FALSE, i.e. make scatter plots for the first two ref-samp pairs and label them differently on the same graph panel. |
pdf.size |
a numeric vector to specify the the width and height of PDF graphics region in inches. Default to be c(7, 7). |
cols |
a character vector to specify colors used for the heatmap image blocks. Default to be NULL, i.e. to generate a green-red spectrum based on the gene data automatically. |
scale |
character indicating if the values should be centered and scaled in either the row direction or the column direction, or none for the heatmap. The default is "row", other options include "column" and "none". |
limit |
numeric value to specify the maximal absolute value of gene data to visualize using the heatmap. Gene data beyong will be reset to equal this value. Default to NULL, i.e. plot all gene data values. This argument allows optimal differentiation between most gene data values when extremely positive/negative values exsit and squeeze the normal-value region. Recommend limit = 3 when the gene data is scaled by row. |
label.groups |
boolean, whether to label the two sample groups, i.e. ref and samp, differently using side color bars along the heatmap area. Default to be TRUE. |
... |
other arguments to be passed into the inside |
This function integrated three most common presentation methods for gene expression data: tab-delimited text file, heatmap and scatter plot. Heatmap is ideal for visualizing relative changes with gene-wise standardized (or row-scaled) data. The heatmap is generated by calling a improved version of the heatmap.2 function from gplots package. Scatter plot is ideal for visualizing the modest or small but consistent changes over a gene set between two states under comparison.
Although geneData
is designed to be a standard-alone function,
it is frequently used in tandem with essGene
function to
present the changes of the essential genes in signficant gene sets.
The function returns invisible 1 when successfully executed.
Weijun Luo <[email protected]>
Luo, W., Friedman, M., Shedden K., Hankenson, K. and Woolf, P GAGE: Generally Applicable Gene Set Enrichment for Pathways Analysis. BMC Bioinformatics 2009, 10:161
essGene
extract the essential member genes in a gene
set;
gage
the main function for GAGE analysis;
data(gse16873) cn=colnames(gse16873) hn=grep('HN',cn, ignore.case =TRUE) dcis=grep('DCIS',cn, ignore.case =TRUE) #kegg test for 1-directional changes data(kegg.gs) gse16873.kegg.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis) rownames(gse16873.kegg.p$greater)[1:3] gs=unique(unlist(kegg.gs[rownames(gse16873.kegg.p$greater)[1:3]])) essData=essGene(gs, gse16873, ref =hn, samp =dcis) head(essData) ref1=1:6 samp1=7:12 #generated text file for data table, pdf files for heatmap and scatterplot for (gs in rownames(gse16873.kegg.p$greater)[1:3]) { outname = gsub(" |:|/", "_", substr(gs, 10, 100)) geneData(genes = kegg.gs[[gs]], exprs = essData, ref = ref1, samp = samp1, outname = outname, txt = TRUE, heatmap = TRUE, Colv = FALSE, Rowv = FALSE, dendrogram = "none", limit = 3, scatterplot = TRUE) }
data(gse16873) cn=colnames(gse16873) hn=grep('HN',cn, ignore.case =TRUE) dcis=grep('DCIS',cn, ignore.case =TRUE) #kegg test for 1-directional changes data(kegg.gs) gse16873.kegg.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis) rownames(gse16873.kegg.p$greater)[1:3] gs=unique(unlist(kegg.gs[rownames(gse16873.kegg.p$greater)[1:3]])) essData=essGene(gs, gse16873, ref =hn, samp =dcis) head(essData) ref1=1:6 samp1=7:12 #generated text file for data table, pdf files for heatmap and scatterplot for (gs in rownames(gse16873.kegg.p$greater)[1:3]) { outname = gsub(" |:|/", "_", substr(gs, 10, 100)) geneData(genes = kegg.gs[[gs]], exprs = essData, ref = ref1, samp = samp1, outname = outname, txt = TRUE, heatmap = TRUE, Colv = FALSE, Rowv = FALSE, dendrogram = "none", limit = 3, scatterplot = TRUE) }
Generate up-to-date GO gene sets for specified species, which has either Bioconductor or user supplied gene annotation package.
go.gsets(species = "human", pkg.name=NULL, id.type = "eg", keep.evidence=FALSE)
go.gsets(species = "human", pkg.name=NULL, id.type = "eg", keep.evidence=FALSE)
species |
character, common name of the target species. Note that scientific name is not used here. Default species="human". For a list of other species names, type in: data(bods); bods |
pkg.name |
character, the gene annotation package name for the target species. It is either one of the Bioconductor OrgDb packages or a user supplied package with similar data structures created using AnnotationForge package. Default species=NULL, the right annotation package will be located for the specified speices in Bioconductor automatically. Otherwise, the user need to prepare and supply a usable annotation package in the same format. |
id.type |
character, target ID type for the get sets, case insensitive. Default gene.idtype="eg", i.e. Entrez Gene. Entrez Gene is the primary GO gene ID for many common model organisms, like human, mouse, rat etc. For other species may use "orf" (open reading frame) and other specific gene IDs, please type in: data(bods); bods to check the details. |
keep.evidence |
boolean, whether to keep the GO evidence codes for genes assigned to each GO term. The evidence codes describe what type of evidence was present in that reference to make the annotation. Default keep.evidence = FALSE, which removes all the evidence codes and redundant gene entries due to multiple evidence codes in a gene set. |
The updated GO gene sets are derived using Bioconductor or user supplied gene annotation packages. This way, we can create gene set data for GO analysis for 19 common species annotated in Bioconductor and more others by the users. Note that we have generated GO gene set for 4 species, human, mouse, rat and yeast, and provided the data in package gageData.
A named list with the following elements:
go.sets |
GO gene sets, a named list. Each element is a character vector of member gene IDs for a single GO. The number of elements of this list is the total number of GO terms defined for the specified species. |
go.subs |
go.subs is a named list of three elements: "BP", "CC" and "MF", corresponding to biological process, cellular component and molecular function subtrees. It may be more desirable to conduct separated GO enrichment test on each of these 3 subtrees as shown in the example code. |
go.mains |
go.mains is a named vector of three elements: "BP", "CC" and "MF", corresponding to the root node of biological process, cellular component and molecular function subtree, i.e. their corresponding indecies in go.sets. |
Weijun Luo <[email protected]>
Luo, W., Friedman, M., Shedden K., Hankenson, K. and Woolf, P GAGE: Generally Applicable Gene Set Enrichment for Pathways Analysis. BMC Bioinformatics 2009, 10:161
go.gs
for precompiled GO and other common gene set
data collection
#GAGE analysis use the updated GO definitions, instead of #go.gs. #the following lines are not run due to long excution time. #The data generation steop may take 10-30 seconds. You may want #save the gene set data, i.e. go.hs, for future uses. #go.hs=go.gsets(species="human") #data(gse16873) #hn=(1:6)*2-1 #dcis=(1:6)*2 #go.bp=go.hs$go.sets[go.hs$go.subs$BP] #gse16873.bp.p <- gage(gse16873, gsets = go.bp, # ref = hn, samp = dcis) #Yeast and few othr species gene Id is different from Entre Gene data(bods) bods #not run #go.sc=go.gsets("Yeast") #lapply(go.sc$go.sets[1:3], head, 3)
#GAGE analysis use the updated GO definitions, instead of #go.gs. #the following lines are not run due to long excution time. #The data generation steop may take 10-30 seconds. You may want #save the gene set data, i.e. go.hs, for future uses. #go.hs=go.gsets(species="human") #data(gse16873) #hn=(1:6)*2-1 #dcis=(1:6)*2 #go.bp=go.hs$go.sets[go.hs$go.subs$BP] #gse16873.bp.p <- gage(gse16873, gsets = go.bp, # ref = hn, samp = dcis) #Yeast and few othr species gene Id is different from Entre Gene data(bods) bods #not run #go.sc=go.gsets("Yeast") #lapply(go.sc$go.sets[1:3], head, 3)
These functions test for perturbation of gene sets relative to all genes
in the microarray data. They are the testing module for gage
and
single array analysis workflow.
They use different statistical tests: gs.tTest uses two-sample t-test, gs.zTest uses one-sample z-test, gs,KSTest uses Kolmogorov-Smirnov test.
gs.tTest(exprs, gsets, set.size = c(10, 500), same.dir = TRUE, ...) gs.zTest(exprs, gsets, set.size = c(10, 500), same.dir = TRUE, ...) gs.KSTest(exprs, gsets, set.size = c(10, 500), same.dir = TRUE, ...)
gs.tTest(exprs, gsets, set.size = c(10, 500), same.dir = TRUE, ...) gs.zTest(exprs, gsets, set.size = c(10, 500), same.dir = TRUE, ...) gs.KSTest(exprs, gsets, set.size = c(10, 500), same.dir = TRUE, ...)
exprs |
an expression matrix or matrix-like data structure, with genes as rows and samples as columns. |
gsets |
a named list, each element contains a gene set that is a character
vector of gene IDs or symbols. For example, type head(kegg.gs). A
gene set can also be a "smc" object defined in PGSEA package.
Make sure that the same gene ID system is used for both |
set.size |
gene set size (number of genes) range to be considered for enrichment test. Tests for too small or too big gene sets are not robust statistically or informative biologically. Default to be set.size = c(10, 500). |
same.dir |
whether to test for changes in a gene set toward a single direction (all genes up or down regulated) or changes towards both directions simultaneously. For experimentally derived gene sets, GO term groups, etc, coregulation is commonly the case, hence same.dir = TRUE (default); In KEGG, BioCarta pathways, genes frequently are not coregulated, hence it could be informative to let same.dir = FALSE. Although same.dir = TRUE could also be interesting for pathways. |
... |
other arguments to be passed into the secondary functions, not used currently. |
These functions are the gene set test module for gage
and
single array analysis workflow. When used in gage
function, the
function names are optional values for saaTest
argument. Check
help information for gage
for details.
These functions may also used independently without calling gage
function.
As the raw results of gene set tests, a list of 5 components is returned:
results |
matrix of test statistics, gene sets are rows, samp-ref pairs are columns |
p.results |
matrix of p-values for up-regulation (greater than) tests, gene sets are rows, samp-ref pairs are columns |
ps.results |
matrix of p-values for down-regulation (less than) tests, gene sets are rows, samp-ref pairs are columns |
mstat |
vector of test statistics mean for individual gene sets. Normally, its absoluate value measures the magnitude of gene-set level changes, and its sign indicates direction of the changes. For gs.KSTest, mstat is always positive. |
setsizes |
vector of effective set size (number of genes) individual gene sets |
Weijun Luo <[email protected]>
Luo, W., Friedman, M., Shedden K., Hankenson, K. and Woolf, P GAGE: Generally Applicable Gene Set Enrichment for Pathways Analysis. BMC Bioinformatics 2009, 10:161
gage
the main function for GAGE analysis
data(gse16873) cn=colnames(gse16873) hn=grep('HN',cn, ignore.case =TRUE) dcis=grep('DCIS',cn, ignore.case =TRUE) data(kegg.gs) #kegg test exprs.gage = gagePrep(gse16873, ref = hn, samp = dcis) str(exprs.gage) rawRes = gs.tTest(exprs.gage, gsets = kegg.gs) str(rawRes) head(rawRes$results) head(rawRes$p.results)
data(gse16873) cn=colnames(gse16873) hn=grep('HN',cn, ignore.case =TRUE) dcis=grep('DCIS',cn, ignore.case =TRUE) data(kegg.gs) #kegg test exprs.gage = gagePrep(gse16873, ref = hn, samp = dcis) str(exprs.gage) rawRes = gs.tTest(exprs.gage, gsets = kegg.gs) str(rawRes) head(rawRes$results) head(rawRes$p.results)
GSE16873 is a breast cancer study (Emery et al, 2009) downloaded from Gene Expression Omnibus (GEO). GSE16873 covers twelve patient cases, each with HN (histologically normal), ADH (ductal hyperplasia), and DCIS (ductal carcinoma in situ) RMA samples. Due to the size limit of this package, we split this GSE16873 into two halves, each including 6 patients with their HN and DCIS but not ADH tissue types. This gage package only includes the first half dataset for 6 patients as this example dataset gse16873. Most of our demo analyses are done on the first half dataset, except for the advanced analysis where we use both halves datasets with all 12 patients. Details section below gives more information.
data(gse16873)
data(gse16873)
Raw data for these two half datasets were processed separately using two different methods, FARMS and RMA, respectively to generate the non-biological data heterogeneity. The first half dataset is named as gse16873, the second half dataset named gse16873.2. We also have the full dataset, gse16873.full, which includes all HN, ADH and DCIS samples of all 12 patients, processed together using FARMS. The second half dataset plus the full dataset and the original BMP6 dataset used in GAGE paper and earlier versions of gage is supplied with another package, gageData.
GEO Dataset GSE16873: <URL: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE16873>
Emery LA, Tripathi A, King C, Kavanah M, Mendez J, Stone MD, de las Morenas A, Sebastiani P, Rosenberg CL: Early dysregulation of cell adhesion and extracellular matrix pathways in breast cancer progression. Am J Pathol 2009, 175:1292-302.
data(gse16873) #check the heterogenity of the two half datasets boxplot(data.frame(gse16873)) #column/smaple names cn=colnames(gse16873) hn=grep('HN',cn, ignore.case =TRUE) adh=grep('ADH',cn, ignore.case =TRUE) dcis=grep('DCIS',cn, ignore.case =TRUE) print(hn) print(dcis) data(kegg.gs) lapply(kegg.gs[1:3],head) head(rownames(gse16873)) gse16873.kegg.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis)
data(gse16873) #check the heterogenity of the two half datasets boxplot(data.frame(gse16873)) #column/smaple names cn=colnames(gse16873) hn=grep('HN',cn, ignore.case =TRUE) adh=grep('ADH',cn, ignore.case =TRUE) dcis=grep('DCIS',cn, ignore.case =TRUE) print(hn) print(dcis) data(kegg.gs) lapply(kegg.gs[1:3],head) head(rownames(gse16873)) gse16873.kegg.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis)
heter.gage
is a wrapper function of gage
for heterogeneous
data. pairData
prepares the heterogeneous data and related
arguments for GAGE analysis.
heter.gage(exprs, gsets, ref.list, samp.list, comp.list = "paired", use.fold = TRUE, ...) pairData(exprs, ref.list, samp.list, comp.list = "paired", use.fold = TRUE, ...)
heter.gage(exprs, gsets, ref.list, samp.list, comp.list = "paired", use.fold = TRUE, ...) pairData(exprs, ref.list, samp.list, comp.list = "paired", use.fold = TRUE, ...)
exprs |
an expression matrix or matrix-like data structure, with genes as rows and samples as columns. |
gsets |
a named list, each element contains a gene set that is a character
vector of gene IDs or symbols. For example, type head(kegg.gs). A
gene set can also be a "smc" object defined in PGSEA package.
Make sure that the same gene ID system is used for both |
ref.list |
a list of |
samp.list |
a list of |
comp.list |
a list or a vector of |
use.fold |
Boolean, whether to use fold changes or t-test statistics as per gene statistics. Default use.fold= TRUE. |
... |
other arguments to be passed into |
comp.list can be a list or vector of mixture values of 'paired' and
'unpaired' matching the experiment layouts of the heterogeneous data. In
such cases, each ref-samp pairs and corresponding columns in the result
data matrix after calling pairData
are assigned different weights
when calling gage
in the next step.
The inclusion of '1ongroup' and 'as.group' in comp.list would make
weight assignment very complicated especially when the sample sizes are
different for the individual experiments of the heterogeneous data.
The output of pairData
is a list of 2 elements:
exprs |
a data matrix derived from the input expression data matrix
|
weights |
weights assigned to columns of the output data matrix
|
The result returned by heter.gage
function is the same as
result of gage
, i.e. either a single data matrix
(same.dir = FALSE, test for two-directional changes) or
a named list of two data matrix (same.dir = TRUE, test for single-direction
changes) for the results of up- ($greater) and down- ($less) regulated
gene sets. Check help information for gage
for details.
Weijun Luo <[email protected]>
Luo, W., Friedman, M., Shedden K., Hankenson, K. and Woolf, P GAGE: Generally Applicable Gene Set Enrichment for Pathways Analysis. BMC Bioinformatics 2009, 10:161
gage
the main function for GAGE analysis;
gagePipe
pipeline for multiple GAGE analysis in a batch
data(gse16873) cn=colnames(gse16873) hn=grep('HN',cn, ignore.case =TRUE) dcis=grep('DCIS',cn, ignore.case =TRUE) data(kegg.gs) library(gageData) data(gse16873.2) cn2=colnames(gse16873.2) hn2=grep('HN',cn2, ignore.case =TRUE) dcis2=grep('DCIS',cn2, ignore.case =TRUE) #combined the two half dataset gse16873=cbind(gse16873, gse16873.2) refList=list(hn, hn2+12) sampList=list(dcis, dcis2+12) #quick look at the heterogeneity of the combined data summary(gse16873[,hn[c(1:2,7:8)]]) #if graphic devices open: #boxplot(data.frame(gse16873)) gse16873.kegg.heter.p <- heter.gage(gse16873, gsets = kegg.gs, ref.list = refList, samp.list = sampList) gse16873.kegg.heter.2d.p <- heter.gage(gse16873, gsets = kegg.gs, ref.list = refList, samp.list = sampList, same.dir = FALSE) str(gse16873.kegg.heter.p) head(gse16873.kegg.heter.p$greater[, 1:5])
data(gse16873) cn=colnames(gse16873) hn=grep('HN',cn, ignore.case =TRUE) dcis=grep('DCIS',cn, ignore.case =TRUE) data(kegg.gs) library(gageData) data(gse16873.2) cn2=colnames(gse16873.2) hn2=grep('HN',cn2, ignore.case =TRUE) dcis2=grep('DCIS',cn2, ignore.case =TRUE) #combined the two half dataset gse16873=cbind(gse16873, gse16873.2) refList=list(hn, hn2+12) sampList=list(dcis, dcis2+12) #quick look at the heterogeneity of the combined data summary(gse16873[,hn[c(1:2,7:8)]]) #if graphic devices open: #boxplot(data.frame(gse16873)) gse16873.kegg.heter.p <- heter.gage(gse16873, gsets = kegg.gs, ref.list = refList, samp.list = sampList) gse16873.kegg.heter.2d.p <- heter.gage(gse16873, gsets = kegg.gs, ref.list = refList, samp.list = sampList, same.dir = FALSE) str(gse16873.kegg.heter.p) head(gse16873.kegg.heter.p$greater[, 1:5])
The gene set data collections derived from KEGG, GO and BioCarta databases.
data(kegg.gs) data(kegg.gs.dise) data(go.gs) data(carta.gs)
data(kegg.gs) data(kegg.gs.dise) data(go.gs) data(carta.gs)
kegg.gs is a named list of 177 elements. Each element is a character
vector of member gene Entrez IDs for a single KEGG pathway. Type
head(kegg.gs, 3)
for the first 3 gene sets or pathways. Note
that kegg.gs has been updated since gage version 2.9.1. From then,
kegg.gs only include the subset of canonical signaling and metabolic
pathways from KEGG pathway database, and kegg.gs.dise is the subset of
disease pathways. And it is recommended to do KEGG pathway analysis
with either kegg.gs or kegg.gs.dise seperately (rather than combined
altogether) for better defined results. In the near feature, we will
also generate subsets of go.gs for refined analysis. Note that kegg.gs
in gageData package still keeps all KEGG pathways, where kegg.gs.sigmet
and kegg.gs.dise are two subsets of kegg.gs.
go.gs is a named list of 1000 elements in this package. It is
a truncated list in this package. The ful list of go.gs has
~10000 elements and is provided with an experimental data package
gageData. Each element is a character
vector of member gene Entrez IDs for a single Gene Ontology term. Type
head(go.gs, 3)
for the first 3 gene sets or GO terms.
carta.gs is a named list of 259 elements. Each element is a character
vector of member gene Entrez IDs for a single BioCarta pathway. Type
head(carta.gs, 3)
for the first 3 gene sets or pathways.
khier is a matrix of 442 rows and 3 columns. This records the category and subcategory assignments of all KEGG pathways. The data comes from KEGG BRITE Database. This data is used by kegg.gsets function to subset the KEGG pathways for more specific pathway analysis.
korg is a character matrix of ~3000 rows and 6 columns. First 3 columns are KEGG species code, scientific name and common name, followed columns on gene ID types used for each species: entrez.gnodes ("1" or "0", whether EntrezGene is the default gene ID) and representative KEGG gene ID and EntrezGene ID. This data comes from pathview package, and is used by kegg.gsets function internally.
bods is a character matrix of 19 rows and 4 columns on the mapping between gene annotation package names in Bioconductor, common name and KEGG code of most common research species. This data comes from pathview package, and is used by go.gsets function internally.
These gene set data were compiled using Entrez Gene IDs, gene set names and mapping information from multiple Bioconductor packages, including: org.Hs.eg.db, kegg.db, go.db and cMAP. Please check the corresponding packages for more information.
We only provide gene set data for human with this package. For other species, please check the experiment data package list of Bioconductor website or use the Bioconductor package GSEABase to build the users' own gene set collections.
Data from multiple Bioconductor packages, including: org.Hs.eg.db, kegg.db, go.db and cMAP.
Entrez Gene <URL: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene> KEGG pathways <URL: ftp://ftp.genome.ad.jp/pub/kegg/pathways> Gene Ontology <URL: http://www.geneontology.org/> cMAP <URL: http://cmap.nci.nih.gov/PW>
#load expression and gene set data data(gse16873) cn=colnames(gse16873) hn=grep('HN',cn, ignore.case =TRUE) dcis=grep('DCIS',cn, ignore.case =TRUE) data(kegg.gs) data(go.gs) #make sure the gene IDs are the same for expression data and gene set #data lapply(kegg.gs[1:3],head) lapply(go.gs[1:3],head) head(rownames(gse16873)) #GAGE analysis gse16873.kegg.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis) gse16873.go.p <- gage(gse16873, gsets = go.gs, ref = hn, samp = dcis)
#load expression and gene set data data(gse16873) cn=colnames(gse16873) hn=grep('HN',cn, ignore.case =TRUE) dcis=grep('DCIS',cn, ignore.case =TRUE) data(kegg.gs) data(go.gs) #make sure the gene IDs are the same for expression data and gene set #data lapply(kegg.gs[1:3],head) lapply(go.gs[1:3],head) head(rownames(gse16873)) #GAGE analysis gse16873.kegg.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis) gse16873.go.p <- gage(gse16873, gsets = go.gs, ref = hn, samp = dcis)
Generate up-to-date KEGG pathway gene sets for any specified KEGG species.
kegg.gsets(species = "hsa", id.type = "kegg", check.new=FALSE)
kegg.gsets(species = "hsa", id.type = "kegg", check.new=FALSE)
species |
character, either the kegg code, scientific name or the common name of the target species. This applies to both pathway and gene.data or cpd.data. When KEGG ortholog pathway is considered, species="ko". Default species="hsa", it is equivalent to use either "Homo sapiens" (scientific name) or "human" (common name). |
id.type |
character, desired ID type for the get sets, case insensitive. Default idtype="kegg", i.e. the primary KEGG gene ID. The other valid option is "entrez", i.e. Entrez Gene. Entrez Gene is the primary KEGG gene ID for many common model organisms, like human, mouse, rat etc, hence these two options have the same effect. For other species, primary KEGG gene ID is not Entrez Gene. |
check.new |
logical, whether to check for any newly added reference pathways online. Default to FALSE. Because such online checking takes time and new reference pathways are rarely added, it is usually not suggested to set this argument TRUE, unless this is really needed. |
The latest KEGG pathway gene sets are derived by connecting to the database in real time. This way, we can create high quality gene set data for pathway analysis for over 2400 KEGG species.
Note that we have generated GO gene set for 4 species, human, mouse, rat, yeast as well as KEGG Ortholog, and provided the data in package gageData.
A named list with the following elements:
kg.sets |
KEGG gene sets, a named list. Each element is a character vector of member gene IDs for a single KEGG pathway. The number of elements of this list is the total number of KEGG pathways defined for the specified species. |
sigmet.idx |
integer indice, which elements in kg.sets are signaling or metabolism pathways. |
sig.idx |
integer indice, which elements in kg.sets are signaling pathways. |
met.idx |
integer indice, which elements in kg.sets are metabolism pathways. |
dise.idx |
integer indice, which elements in kg.sets are disease pathways. |
The *.idx elements here are all used to subset kg.sets for more specific type pathway aanlysis.
Weijun Luo <[email protected]>
Luo, W., Friedman, M., Shedden K., Hankenson, K. and Woolf, P GAGE: Generally Applicable Gene Set Enrichment for Pathways Analysis. BMC Bioinformatics 2009, 10:161
kegg.gs
for precompiled KEGG and other common gene set
data collection
#GAGE analysis use the latest KEGG pathway definitions, instead of #kegg.gs kg.hsa=kegg.gsets() data(gse16873) hn=(1:6)*2-1 dcis=(1:6)*2 kegg.sigmet=kg.hsa$kg.sets[kg.hsa$sigmet.idx] gse16873.kegg.p <- gage(gse16873, gsets = kegg.sigmet, ref = hn, samp = dcis) #E coli KEGG Id is different from Entre Gene kg.eco=kegg.gsets("eco") kg.eco.eg=kegg.gsets("eco", id.type="entrez") head(kg.eco$kg.sets,2) head(kg.eco.eg$kg.sets,2)
#GAGE analysis use the latest KEGG pathway definitions, instead of #kegg.gs kg.hsa=kegg.gsets() data(gse16873) hn=(1:6)*2-1 dcis=(1:6)*2 kegg.sigmet=kg.hsa$kg.sets[kg.hsa$sigmet.idx] gse16873.kegg.p <- gage(gse16873, gsets = kegg.sigmet, ref = hn, samp = dcis) #E coli KEGG Id is different from Entre Gene kg.eco=kegg.gsets("eco") kg.eco.eg=kegg.gsets("eco", id.type="entrez") head(kg.eco$kg.sets,2) head(kg.eco.eg$kg.sets,2)
This is a wrapper function of read.delim
for reading in
expression data matrix in tab-delimited format.
readExpData(file = "arrayData.txt", ...)
readExpData(file = "arrayData.txt", ...)
file |
character string, the full path name to the expression data file in tab-delimited format. Rows are genes, columns are array samples. |
... |
other arguments to be passed into |
readExpData
is a wrapper function of read.delim
. Please
check help information of read.delim
for more details.
A data.frame (matrix-like) of gene expression data. Rows are genes, columns are array samples.
Weijun Luo <[email protected]>
Luo, W., Friedman, M., Shedden K., Hankenson, K. and Woolf, P GAGE: Generally Applicable Gene Set Enrichment for Pathways Analysis. BMC Bioinformatics 2009, 10:161
readList
read in gene set list
filename=system.file("extdata/gse16873.demo", package = "gage") demo.data=readExpData(filename, row.names=1) head(demo.data)
filename=system.file("extdata/gse16873.demo", package = "gage") demo.data=readExpData(filename, row.names=1) head(demo.data)
This function reads in gene set data in GMT (.gmt) format as a named list. GMT is defined originally by GSEA program. The code may be slightly revised for reading in gene set data in other tab-delimited formats too.
readList(file)
readList(file)
file |
character string, the full path name to the gene set data file in GMT format. |
A named list, each element is a character vector giving the gene IDs of a gene set.
Weijun Luo <[email protected]>
Luo, W., Friedman, M., Shedden K., Hankenson, K. and Woolf, P GAGE: Generally Applicable Gene Set Enrichment for Pathways Analysis. BMC Bioinformatics 2009, 10:161
readExpData
read in gene expression data
#an example GMT gene set data derived from MSigDB data filename=system.file("extdata/c2.demo.gmt", package = "gage") demo.gs=readList(filename) demo.gs[1:3] #to use these gene sets with gse16873, need to convert the gene symbols #to Entrez IDs first data(egSymb) demo.gs.sym<-lapply(demo.gs, sym2eg) demo.gs.sym[1:3]
#an example GMT gene set data derived from MSigDB data filename=system.file("extdata/c2.demo.gmt", package = "gage") demo.gs=readList(filename) demo.gs[1:3] #to use these gene sets with gse16873, need to convert the gene symbols #to Entrez IDs first data(egSymb) demo.gs.sym<-lapply(demo.gs, sym2eg) demo.gs.sym[1:3]
This function sorts and counts signcant gene sets based on q- or p-value cutoff.
sigGeneSet(setp, cutoff = 0.1, dualSig = (0:2)[2], qpval = c("q.val", "p.val")[1],heatmap=TRUE, outname="array", pdf.size = c(7,7), p.limit=c(0.5, 5.5), stat.limit=5, ...)
sigGeneSet(setp, cutoff = 0.1, dualSig = (0:2)[2], qpval = c("q.val", "p.val")[1],heatmap=TRUE, outname="array", pdf.size = c(7,7), p.limit=c(0.5, 5.5), stat.limit=5, ...)
setp |
the result object returned by |
cutoff |
numeric, q- or p-value cutoff, between 0 and 1. Default 0.1 (for q-value). When p-value is used, recommended cutoff value is 0.001 for data with more than 2 replicates per condition or 0.01 for les sample sizes. |
dualSig |
integer, switch argument controlling how dual-signficant gene sets should be treated. This argument is only useful when Stouffer method is not used in gage function (use.stouffer=FALSE), hence makes no difference normally. 0: discard such gene sets from the final significant gene set list; 1: keep such gene sets in the more significant direction and remove them from the less significant direction; 2: keep such gene sets in the lists for both directions. default to 1. Dual-signficant means a gene set is called significant simultaneously in both 1-direction tests (up- and down-regulated). Check the details for more information. |
qpval |
character, specifies the column name used for gene set selection, i.e. what type of q- or p-value to use in gene set selection. Default to be "q.val" (q-value using BH procedure). "p.val" is the unadjusted global p-value and may be used as selection criterion sometimes. |
heatmap |
boolean, whether to plot heatmap for the selected gene data as a PDF file. Default to be FALSE. |
outname |
a character string, to be used as the prefix of the output data files. Default to be "array". |
pdf.size |
a numeric vector to specify the the width and height of PDF graphics region in inches. Default to be c(7, 7). |
stat.limit |
numeric vector of length 1 or 2 to specify the value range of gene set statistics to visualize using the heatmap. Statistics beyong will be reset to equal the proximal limit. Default to 5, i.e. plot all gene set statistics within (-5, 5) range. May also be NULL, i.e. plot all statistics without limit. This argument allows optimal differentiation between most gene set statistic values when extremely positive/negative values exsit and squeeze the normal-value region. |
p.limit |
numeric vector of length 1 or 2 to specify the value range of gene set -log10(p-values) to visualize using the heatmap. Values beyong will be reset to equal the proximal limit. Default to c(0.5,5.5), i.e. plot all -log10(p-values) within this range. This argument is similar to argument stat.limit. |
... |
other arguments to be passed into the inside |
By default, heatmaps are produced to show the gene set perturbations using either -log10(p-value) or statistics.
Since gage package version 2.2.0, Stouffer's method is used as the default procedure for more robust p-value summarization. With the original p-value summarization, i.e. negative log sum following a Gamma distribution as the Null hypothesis, the global p-value could be heavily affected by a small subset of extremely small individual p-values from pair-wise comparisons. Such sensitive global p-value leads to the "dual signficance" phenomenon. In other words, Gene sets are signficantly up-regulated in a subset of experiments, but down-regulated in another subset. Note that dual-signficant gene sets are not the same as gene sets called signficant in 2-directional tests, although they are related.
sigGeneSet
function returns a named list of the same structure
as gage
result. Check gage
help information for
details.
Weijun Luo <[email protected]>
Luo, W., Friedman, M., Shedden K., Hankenson, K. and Woolf, P GAGE: Generally Applicable Gene Set Enrichment for Pathways Analysis. BMC Bioinformatics 2009, 10:161
gage
the main function for GAGE analysis;
esset.grp
non-redundant signcant gene set list;
essGene
essential member genes in a gene set;
data(gse16873) cn=colnames(gse16873) hn=grep('HN',cn, ignore.case =TRUE) dcis=grep('DCIS',cn, ignore.case =TRUE) data(kegg.gs) #kegg test for 1-directional changes gse16873.kegg.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis) #kegg test for 2-directional changes gse16873.kegg.2d.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis, same.dir = FALSE) gse16873.kegg.sig<-sigGeneSet(gse16873.kegg.p, outname="gse16873.kegg") str(gse16873.kegg.sig) gse16873.kegg.2d.sig<-sigGeneSet(gse16873.kegg.2d.p, outname="gse16873.kegg") str(gse16873.kegg.2d.sig) #also check the heatmaps in pdf files named "*.heatmap.pdf".
data(gse16873) cn=colnames(gse16873) hn=grep('HN',cn, ignore.case =TRUE) dcis=grep('DCIS',cn, ignore.case =TRUE) data(kegg.gs) #kegg test for 1-directional changes gse16873.kegg.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis) #kegg test for 2-directional changes gse16873.kegg.2d.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis, same.dir = FALSE) gse16873.kegg.sig<-sigGeneSet(gse16873.kegg.p, outname="gse16873.kegg") str(gse16873.kegg.sig) gse16873.kegg.2d.sig<-sigGeneSet(gse16873.kegg.2d.p, outname="gse16873.kegg") str(gse16873.kegg.2d.sig) #also check the heatmaps in pdf files named "*.heatmap.pdf".