Title: | Annotation-Driven Clustering |
---|---|
Description: | This package implements clustering of microarray gene expression profiles according to functional annotations. For each term genes are annotated to, splits into two subclasses are computed and a significance of the supporting gene set is determined. |
Authors: | Claudio Lottaz, Joern Toedling |
Maintainer: | Claudio Lottaz <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.77.0 |
Built: | 2024-11-29 03:25:13 UTC |
Source: | https://github.com/bioc/adSplit |
This function searches for annotation-driven splits of patients in microarray data. A split is a partitioning of patients into two groups. In order to do so it refers to GO terms and KEGG pathways. In addition, a significance measure can be computed by simulating a random distribution of scores. DLD-scores are used to judge the quality of a split.
adSplit(mydata, annotation.ids, chip.name, min.probes = 20, max.probes = NULL, B = NULL, min.group.size = 5, ngenes = 50, ignore.genes = 5)
adSplit(mydata, annotation.ids, chip.name, min.probes = 20, max.probes = NULL, B = NULL, min.group.size = 5, ngenes = 50, ignore.genes = 5)
mydata |
either an expression set as defined by the package
|
annotation.ids |
a vector of GO or KEGG identifiers in the form "GO:..." or "KEGG:..." respectively. The prefix "KEGG:" is removed from the KEGG-identifiers before accessing the chip's "...PATH2PROBES" hash. |
chip.name |
the name of the chip by which the expression set is
measured. |
min.probes |
annotation identifiers with fewer than this associated genes are skipped. |
max.probes |
annotation identifiers with more than this associated genes are skipped. The default is ten percent of the genes on the chip. |
B |
the number of random gene set samplings to be performed to compute empirical p-values. |
min.group.size |
filter criteria to avoid splits suggesting tiny groups. Splits where one of the two suggested groups are smaller than this number are removed from the split set. |
ngenes |
number of genes used to compute DLD scores. |
ignore.genes |
number of best scoring genes to be ignored when computing DLD scores. |
This function applies the same splitting procedure to all annotation
identifiers provided. Firstly, the associated genes for one identifier
are determined and extracted from the expression data. Then the
diana2means
function is applied to the restricted data and the
different splits generated are collected into a single splitSet
object.
As annotation identifiers vectors of identifiers of the
KEGG:nnnnn
and GO:nnnnnn
are valid. In addition, the
keywords "KEGG", "GO" and "all" are allowed, representing all terms in
the corresponding ontology.
If B
is set to a integer number this number of samplings are
used to generate a null-distribution of DLD-scores. This
distribution is used to compute empirical p-values for each
split. If more than one valid split is found, multiple testing is
corrected for by applying Benjamini-Hochbergs correction from the
multtest package.
Returns an object of class splitSet
with the following list
elements:
cuts |
a matrix of split attributions. One row per annotation identifier (GO term or KEGG pathway for which a split has been generated. One column per object in the dataset. |
score |
one score per generated split. |
pvalue |
one empirical p-value per generated split, or |
qvalue |
one q-value computed according Benjamini-Hochberg's
correction for multiple testing per generated split, or |
Claudio Lottaz, Joern Toedling
diana2means
, randomDiana2means
,
image.splitSet
# prepare data library(golubEsets) data(Golub_Merge) # generate annotation-driven splits for apoptosis and signal transduction x <- adSplit(Golub_Merge, "GO:0006915", "hu6800") x <- adSplit(Golub_Merge, c("GO:0007165","GO:0006915"), "hu6800", max.probes=7000) # generate a split for alanine, aspartate and glutamate metabolism including # an empirical p-value x <- adSplit(Golub_Merge, "KEGG:00250", "hu6800", B=100) # generate splits for all KEGG pathways. x <- adSplit(Golub_Merge, "KEGG", "hu6800") image(x)
# prepare data library(golubEsets) data(Golub_Merge) # generate annotation-driven splits for apoptosis and signal transduction x <- adSplit(Golub_Merge, "GO:0006915", "hu6800") x <- adSplit(Golub_Merge, c("GO:0007165","GO:0006915"), "hu6800", max.probes=7000) # generate a split for alanine, aspartate and glutamate metabolism including # an empirical p-value x <- adSplit(Golub_Merge, "KEGG:00250", "hu6800", B=100) # generate splits for all KEGG pathways. x <- adSplit(Golub_Merge, "KEGG", "hu6800") image(x)
Split a set of data points into two coherent groups using the k-means algorithm. Instead of random initialization, divisive hierarchical clustering is used to determine initial groups and the corresponding centroids.
diana2means(mydata, mingroupsize = 5, ngenes = 50, ignore.genes = 5, return.cut = FALSE)
diana2means(mydata, mingroupsize = 5, ngenes = 50, ignore.genes = 5, return.cut = FALSE)
mydata |
either an expression set as defined by the package
|
mingroupsize |
report only splits where both groups are larger than this size. |
ngenes |
number of genes used to compute cluster quality DLD-score. |
ignore.genes |
number of best scoring genes to be ignored when computing DLD-scores. |
return.cut |
logical, whether to return the attributions of samples to groups. |
This function uses divisive hierarchical clustering (diana) to generate a first split of the data. Thereby, each column of the data matrix is considered to represent a data element. From the thus generated temptative groups, centroids are deduced and used to initialize the k-means clustering algorithm.
For the split optimized by k-means the DLD-score is determined using
the ngenes
and ignore.genes
arguments.
If the logical return.cut
is set to FALSE
(the
default), a single number is representing the DLD-score for the
generated split is returned. Otherwise an object of class
split
containing the following elements is returned:
cut |
one number out of 0 and 1 per column in the original data, specifying the split attribution. |
score |
the DLD-score achieved by the split. |
Joern Toedling, Claudio Lottaz
# get golub data library(vsn) library(golubEsets) data(Golub_Merge) # use 10% most variable genes e <- exprs(Golub_Merge) vars <- apply(e, 1, var) e <- e[vars > quantile(vars,0.9),] # use diana2means to get splits and scores diana2means(e) diana2means(e, return.cut=TRUE)
# get golub data library(vsn) library(golubEsets) data(Golub_Merge) # use 10% most variable genes e <- exprs(Golub_Merge) vars <- apply(e, 1, var) e <- e[vars > quantile(vars,0.9),] # use diana2means to get splits and scores diana2means(e) diana2means(e, return.cut=TRUE)
This function draws a given number of probe-sets randomly, such that probe-sets referring to the same are either included or excluded as a whole.
drawRandomPS(nps, EID2PSenv, allEIDs)
drawRandomPS(nps, EID2PSenv, allEIDs)
nps |
number of probe-sets to be drawn. |
EID2PSenv |
a hash mapping EntrezGene to probe-set identifiers. |
allEIDs |
vector of all EntrezGene identifiers represented on a chip. |
A named vector of probe-set identifiers. The names correspond to the EntrezGene identifiers.
Claudio Lottaz
# draw ten random probe-sets from hu6800 library(hu6800.db) EID2PSenv <- makeEID2PROBESenv(hu6800ENTREZID) drawRandomPS(10, EID2PSenv, ls(EID2PSenv))
# draw ten random probe-sets from hu6800 library(hu6800.db) EID2PSenv <- makeEID2PROBESenv(hu6800ENTREZID) drawRandomPS(10, EID2PSenv, ls(EID2PSenv))
This is a data object precomputed by adSplit
for
illustration.
data(golubKEGGSplits)
data(golubKEGGSplits)
Annotation-driven split set holds 70 splits on 72 elements, scores range is: 3.382672 17.31385, empirical p-values range is: 0.005 0.955, q-value range is: 0.1633333 0.955.
This object is generated by the following call:
golubKEGGSplits <- adSplit(golubNorm, "KEGG", "hu6800", B=1000)
where golubNorm
is a normalized version of Golub_Merge
from the golubEsets
package.
data(golubKEGGSplits)
data(golubKEGGSplits)
Draws a histogram of empirical p-values and shows the corresponding q-values corrected for multiple testing.
## S3 method for class 'splitSet' hist(x, main = "Distribution of p-Values", xlab = "p-values", col = "grey", xlim = c(0, 1), ...)
## S3 method for class 'splitSet' hist(x, main = "Distribution of p-Values", xlab = "p-values", col = "grey", xlim = c(0, 1), ...)
x |
object of type |
main |
main title of the histogram. |
xlab |
legend for the x-axis. |
col |
color for the histogram bars. |
xlim |
limits for the x-axis (p-values). |
... |
further parameters passed on to the default |
This function draws a regular histogram of empirical p-values observed in the splitSet at hand. The corresponding q-values, corrected by the method suggested by Benjamini-Hochberg, are plotted into the same graph. The scale for the q-values is shown at the left hand side of the plot.
Claudio Lottaz
data(golubKEGGSplits) hist(golubKEGGSplits, col="red")
data(golubKEGGSplits) hist(golubKEGGSplits, col="red")
Draws an image of all splits, one per row, of a splitSet
object. Each column corresponds to a patient.
## S3 method for class 'splitSet' image(x, filter.fdr = 1, main = "", max.label.length = 50, full.names = TRUE, xlab = NULL, sample.labels = FALSE, col = c("yellow", "red"), invert = FALSE, outfile = NULL, res = 72, pointsize = 7, ...)
## S3 method for class 'splitSet' image(x, filter.fdr = 1, main = "", max.label.length = 50, full.names = TRUE, xlab = NULL, sample.labels = FALSE, col = c("yellow", "red"), invert = FALSE, outfile = NULL, res = 72, pointsize = 7, ...)
x |
the object of class |
filter.fdr |
worst acceptable false discovery rate for the shown set of splits. All splits with q-values below this level are dropped from the image. |
main |
a title for the image. |
max.label.length |
Maximal length of the annotations shown to the right of the image. Longer annotations are truncated. |
full.names |
Show full names for annotations instead of their identifiers only. |
xlab |
additional annotation on the x-axis. |
sample.labels |
whether names of samples are to be shown on the x-axis. |
col |
two strings encoding the colors to be used to illustrate to which group a sample is attributed. |
invert |
whether to draw in white on black background. |
outfile |
the filename on which to draw the image in postscript
format. The default is |
res |
resolution for bitmap output on postscript. |
pointsize |
size of font. |
... |
further arguments passed to |
The set of splits given is illustrated as an image. Each row corresponds to an annotation, each column to a patient. In position (x,y), the association of patient x to a group with respect to annotation y is coded as colors (yellow and red by default). The image is ordered by hierarchical clustering such that similar patients and similar splits are brought closer together.
Always returns NULL.
Claudio Lottaz
data(golubKEGGSplits) image(golubKEGGSplits, filter.fdr=0.5)
data(golubKEGGSplits) image(golubKEGGSplits, filter.fdr=0.5)
Make hash containing probe-sets per EntrezGene identifier.
makeEID2PROBESenv(EIDenv)
makeEID2PROBESenv(EIDenv)
EIDenv |
an environment containing one entry per probe-set holding all corresponding EntrezGene identifiers. |
An environment containing one entry per EntrezGene identifier holding all corresponding probe-sets.
Joern Toedling, Claudio Lottaz
library(hu6800.db) makeEID2PROBESenv(hu6800ENTREZID)
library(hu6800.db) makeEID2PROBESenv(hu6800ENTREZID)
Draws a number of random sets of probe-sets consisting of the needed
size and applies diana2means
to compute DLD scores.
randomDiana2means(nprobes, data, chip, ndraws = 10000, ngenes = 50, ignore.genes = 5)
randomDiana2means(nprobes, data, chip, ndraws = 10000, ngenes = 50, ignore.genes = 5)
nprobes |
the size of gene sets. |
data |
a matrix of expression data, rows correspond to genes, columns to samples. |
chip |
the name of the used chip. |
ndraws |
the number of DLD scores computed. |
ngenes |
the number of genes used to compute DLD scores (passed
to |
ignore.genes |
the number of best scoring genes to be ignored
when computing DLD scores (passed to |
This function uses drawRandomPS
to draw ndraws
gene
sets. On these it applies diana2means
to determine a
null-distribution of DLD-scores.
A vector of DLD-scores.
Joern Toedling, Claudio Lottaz
# prepare data library(vsn) library(golubEsets) data(Golub_Merge) # generate DLD scores scores <- randomDiana2means(20, exprs(Golub_Merge), "hu6800", ndraws = 500)
# prepare data library(vsn) library(golubEsets) data(Golub_Merge) # generate DLD scores scores <- randomDiana2means(20, exprs(Golub_Merge), "hu6800", ndraws = 500)