Title: | Infers biological signatures from gene expression data |
---|---|
Description: | Pigengene package provides an efficient way to infer biological signatures from gene expression profiles. The signatures are independent from the underlying platform, e.g., the input can be microarray or RNA Seq data. It can even infer the signatures using data from one platform, and evaluate them on the other. Pigengene identifies the modules (clusters) of highly coexpressed genes using coexpression network analysis, summarizes the biological information of each module in an eigengene, learns a Bayesian network that models the probabilistic dependencies between modules, and builds a decision tree based on the expression of eigengenes. |
Authors: | Habil Zare, Amir Foroushani, Rupesh Agrahari, Meghan Short, Isha Mehta, Neda Emami, and Sogand Sajedi |
Maintainer: | Habil Zare <[email protected]> |
License: | GPL (>=2) |
Version: | 1.33.0 |
Built: | 2024-12-18 04:00:43 UTC |
Source: | https://github.com/bioc/Pigengene |
Pigengene identifies gene modules (clusters), computes an eigengene for each module, and uses these biological signatures as features for classification. The resulting biological signatures are very robust with respect to the profiling platform. For instance, if Pigenegene computes a biological signature using a microarray dataset, it can infer the same signature in an RNA Seq dataset such that it is directly comparable across the two datasets.
Package: | Pigengene |
Type: | Package |
Version: | 0.99.0 |
Date: | 2016-04-25 |
License: | GPL (>= 2) |
The main function is one.step.pigengene
which requires a gene
expression profile and the corresponding conditions (types).
Individual functions are provided to facilitate running the pipeline in a
customized way. Also, the inferred biological signatures (computed eigengenes)
are useful for other supervised or unsupervised analyses.
In most functions of this package, eigenegenes are computed or used as robust biological signatures. Briefly, each eigengene is a weighted average of the expression of all genes in a module (cluster), where the weights are adjusted in a way that the explained variance is maximized.
Amir Foroushani, Habil Zare, and Rupesh Agrahari
Maintainer: Habil Zare <[email protected]>
Foroushani, Amir, et al. "Large-scale gene network analysis reveals the significance of extracellular matrix pathway and homeobox genes in acute myeloid leukemia: an introduction to the Pigengene package and its applications." BMC medical genomics 10.1 (2017): 1-15.
Pigengene-package
,
one.step.pigengene
, compute.pigengene
,
project.eigen
,
WGCNA::blockwiseModules
data(aml) data(mds) d1 <- rbind(aml,mds) Labels <- c(rep("AML",nrow(aml)),rep("MDS",nrow(mds))) names(Labels) <- rownames(d1) p1 <- one.step.pigengene(Data=d1,saveDir='pigengene', bnNum=10, verbose=1, seed=1, Labels=Labels, toCompact=FALSE, doHeat=FALSE) plot(p1$c5treeRes$c5Trees[["34"]]) ## See pigengene for results.
data(aml) data(mds) d1 <- rbind(aml,mds) Labels <- c(rep("AML",nrow(aml)),rep("MDS",nrow(mds))) names(Labels) <- rownames(d1) p1 <- one.step.pigengene(Data=d1,saveDir='pigengene', bnNum=10, verbose=1, seed=1, Labels=Labels, toCompact=FALSE, doHeat=FALSE) plot(p1$c5treeRes$c5Trees[["34"]]) ## See pigengene for results.
Gene expression profile of 202 acute myeloid leukemia (AML) cases from Mills et al. study. The profile was compared with the profile of 164 myelodysplastic syndromes (MDS) cases and only the 1000 most differentially expressed genes are included.
data("aml")
data("aml")
A numeric matrix
The columns and rows are named according to the genes Entrez, and patient IDs, respectively. The original data was produced using Affymetrix Human Genome U133 Plus 2.0 Miccoaray. Mills et al. study is part of the MILE Study (Microarray Innovations In LEukemia) program, and aimed at prediction of AML transformation in MDS.
It is a 202*1000
numeric matrix.
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15061
Mills, Ken I., et al. (2009). Microarray-based classifiers and prognosis models identify subgroups with distinct clinical outcomes and high risk of AML transformation of myelodysplastic syndrome. Blood 114.5: 1063-1072.
Pigengene-package
,
one.step.pigengene
,
mds
, pigengene
library(pheatmap) data(aml) pheatmap(aml[,1:20],show_rownames=FALSE)
library(pheatmap) data(aml) pheatmap(aml[,1:20],show_rownames=FALSE)
Takes as input gamma
and epsilon
values and a filter graph, which is represented by
an adjacency matrix named filt
. Applies the filter on the data in either of the two ways:
a) with normalization of the filter by degrees in the graph,
b) without normalization.
apply.filter(gamma, filt, Data, doNormalize=FALSE)
apply.filter(gamma, filt, Data, doNormalize=FALSE)
gamma |
This value is in the [0,1] range and determines the weight of the filter data. Setting to 0 will result in not filtering at all. |
filt |
It is a binary matrix computed by the |
Data |
A matrix or data frame (or list of matrices or data frames) containing the expression data, with genes corresponding to columns and rows corresponding to samples. Rows and columns must be named. For example, for RNA-Seq data, log(RPKM+1) can be used. |
doNormalize |
If |
filtered |
A filtered matrix computed using the gamma*sData
formula, where sData is the scaled |
Habil Zare and Neda Emami.
make.filter
,
determine.modules
data(aml) data(mds) d1 <- rbind(aml,mds)[, 1:200] Labels <- c(rep("AML",nrow(aml)),rep("MDS",nrow(mds))) names(Labels) <- rownames(d1) p0 <- one.step.pigengene(Data=d1, saveDir=".", verbose=1, seed=1, Labels=Labels, naTolerance=0.5, RsquaredCut=0.8, doNetOnly=TRUE) ##Making the filter made <- make.filter(network=p0$Network, epsilon=0.7, outPath=".") ##Applying the filter f1 <- apply.filter(gamma=0.5, filt=made$filt, Data=d1)
data(aml) data(mds) d1 <- rbind(aml,mds)[, 1:200] Labels <- c(rep("AML",nrow(aml)),rep("MDS",nrow(mds))) names(Labels) <- rownames(d1) p0 <- one.step.pigengene(Data=d1, saveDir=".", verbose=1, seed=1, Labels=Labels, naTolerance=0.5, RsquaredCut=0.8, doNetOnly=TRUE) ##Making the filter made <- make.filter(network=p0$Network, epsilon=0.7, outPath=".") ##Applying the filter f1 <- apply.filter(gamma=0.5, filt=made$filt, Data=d1)
Oversamples data by repeating rows such that each condition has roughly the same number of samples.
balance(Data, Labels, amplification = 5, verbose = 0, naTolerance=0.05)
balance(Data, Labels, amplification = 5, verbose = 0, naTolerance=0.05)
Data |
A matrix or data frame containing the expression data, with genes corresponding to columns and rows corresponding to samples. Rows and columns must be named. |
Labels |
A (preferably named) vector containing the Labels (condition types) for
|
amplification |
An integer that controls the number of repeats for each condition. The number of all samples roughly will be multiplied by this factor after oversampling. |
verbose |
The integer level of verbosity. 0 means silent and higher values produce more details of computation. |
naTolerance |
Upper threshold on the fraction of entries per gene that
can be missing. Genes with a larger fraction of missing
entries are ignored. For genes with smaller fraction of NA
entries, the missing values are imputed from their average
expression in the other samples.
See |
A list of:
balanced |
The matrix of oversampled data |
Reptimes |
A vector of integers named by conditions reporting the number of repeats for each condition. |
origSampleInds |
The indices of rows in |
Habil Zare
Pigengene-package
,
one.step.pigengene
, wgcna.one.step
,
compute.pigengene
data(aml) data(mds) d1 <- rbind(aml,mds) Labels <- c(rep("AML",nrow(aml)),rep("MDS",nrow(mds))) names(Labels) <- rownames(d1) b1 <- balance(Data=d1, Labels=Labels) d2 <- b1$balanced
data(aml) data(mds) d1 <- rbind(aml,mds) Labels <- c(rep("AML",nrow(aml)),rep("MDS",nrow(mds))) names(Labels) <- rownames(d1) b1 <- balance(Data=d1, Labels=Labels) d2 <- b1$balanced
The WGCNA
package assumes that in the coexpression network
the genes are connected with a power-law distribution. Therefore, it need a
soft-thresholding power for network construction, which is estimated
by this auxiliary function.
calculate.beta(saveFile = NULL, RsquaredCut = 0.8, Data, doThreads=FALSE, verbose = 0)
calculate.beta(saveFile = NULL, RsquaredCut = 0.8, Data, doThreads=FALSE, verbose = 0)
saveFile |
The file to save the results in. Set to |
RsquaredCut |
A threshold in the range [0,1] used to estimate the power. A higher value
can increase power. For technical use only. See |
Data |
A matrix or data frame containing the expression data, with genes corresponding to columns and rows corresponding to samples. Rows and columns must be named. |
doThreads |
Boolean. Allows WGCNA to run a little faster using multi-threading but might not work on all systems. |
verbose |
The integer level of verbosity. 0 means silent and higher values produce more details of computation. |
A list of:
sft |
The full output of |
power |
The estimated power (beta) value |
powers |
The numeric vector of all tried powers |
RsquaredCut |
The value of input argument |
Langfelder P and Horvath S, WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 2008, 9:559
pickSoftThreshold
,
blockwiseModules
,
one.step.pigengene
,
wgcna.one.step
data(aml) p1 <- calculate.beta(Data=aml[,1:200])
data(aml) p1 <- calculate.beta(Data=aml[,1:200])
NA
s from a data matrixChecks Data
for NA
values.
check.nas(Data, naTolerance=0.05, na.rm=TRUE)
check.nas(Data, naTolerance=0.05, na.rm=TRUE)
Data |
A matrix or data frame containing the expression data, with genes corresponding to columns and rows corresponding to samples. Rows and columns must be named. |
naTolerance |
A number in the 0-1 range. If the frequency of |
na.rm |
If |
A list of:
cleaned |
The cleaned data with no |
tooNaGenes |
A character vector of those genes (i.e., column
names of |
replacedNaNum |
The number of |
Habil Zare
check.pigengene.input
, Pigengene-package
data(aml) dim(aml) aml[1:410]<-NA c1 <- check.nas(Data=aml) dim(c1$cleaned) c1$tooNaGenes rm(aml)
data(aml) dim(aml) aml[1:410]<-NA c1 <- check.nas(Data=aml) dim(c1$cleaned) c1$tooNaGenes rm(aml)
Checks Data
and Labels
for NA
values, row and column
names, etc.
check.pigengene.input(Data, Labels, na.rm = FALSE, naTolerance=0.05)
check.pigengene.input(Data, Labels, na.rm = FALSE, naTolerance=0.05)
Data |
A matrix or data frame containing the expression data, with genes corresponding to columns and rows corresponding to samples. Rows and columns must be named. |
Labels |
A (preferably named) vector containing the Labels (condition
types) for |
na.rm |
If |
naTolerance |
Upper threshold on the fraction of entries per gene that
can be missing. Genes with a larger fraction of missing
entries are ignored. For genes with smaller fraction of NA
entries, the missing values are imputed from their average
expression in the other samples.
See |
A list of:
Data |
The checked |
Labels |
The checked vector of |
Habil Zare
check.nas
, one.step.pigengene
, Pigengene-package
data(aml) Labels <- c(rep("AML",nrow(aml))) names(Labels) <- rownames(aml) c1 <- check.pigengene.input(Data=aml, Labels=Labels,na.rm=TRUE) Data <- c1$Data Labels <- c1$Labels
data(aml) Labels <- c(rep("AML",nrow(aml))) names(Labels) <- rownames(aml) c1 <- check.pigengene.input(Data=aml, Labels=Labels,na.rm=TRUE) Data <- c1$Data Labels <- c1$Labels
Takes as input two or more adjacency matrices, and the corresponding contributions. Computes a combined network (weighted graph) in which the weight on an edge between two nodes is an average of the weights on the same edge in the input networks.
combine.networks(nets, contributions, outPath, midfix="", powerVector=1:20, verbose=1, RsquaredCut=0.75, minModuleSize=5, doRemoveTOM=TRUE, datExpr, doReturNetworks=FALSE, doSave=FALSE, doIdentifyModule=TRUE)
combine.networks(nets, contributions, outPath, midfix="", powerVector=1:20, verbose=1, RsquaredCut=0.75, minModuleSize=5, doRemoveTOM=TRUE, datExpr, doReturNetworks=FALSE, doSave=FALSE, doIdentifyModule=TRUE)
nets |
A list of adjacency matrices (networks), which can be generated using e.g.,
the |
contributions |
A numeric vector with the same length as nets. In computing the average weight on each edge in the combined network, first the edge weights from individual networks are multiplied by their corresponding contributions, then the result will be divided by the sum of weights of all networks containing this edge. |
outPath |
A string to the path where plots and results will be saved. |
midfix |
An optional string used in the output file names. |
powerVector |
A numeric vector of power values that are tried to
find the best one.
See |
verbose |
The integer level of verbosity. 0 means silent and higher values produce more details of computation. |
RsquaredCut |
A threshold in the range [0,1] used to estimate the power. A higher value
can increase power. For technical use only. See |
minModuleSize |
The value that controls the minimum number of genes per module.
See |
doRemoveTOM |
A boolean determining the big TOM file must remove or not. |
datExpr |
The expression matrix that
|
doReturNetworks |
A boolean value to determine whether to return |
doSave |
A boolean value to determine whether the whole output of this
function (typically 1-2 GBs) should be saved as |
doIdentifyModule |
A boolean value to determine whether modules should be
identified. Set it to |
A list with following components
call |
The command that created the results |
midfix |
The input argument |
Network |
The adjacency matrix of the combined network |
denominators |
A matrix, each cell of which is the sum of weights of all networks contributing to the edge corresponding to that cell |
power |
The power (beta) value used for the combined network |
fits |
The fit indices calculated for the combined network |
net |
The output of |
modules |
The output of |
combinedNetworkFile |
The path to the saved file containing |
If the networks have different node sets, the combined network will be computed on the union of nodes.
WGCNA::blockwiseModules
,
WGCNA::TOMsimilarity
, and
WGCNA::pickSoftThreshold.fromSimilarity
data(aml) data(mds) nets <- list() ## Make the coexpression networks: nets[["aml"]] <- abs(stats::cor(aml[,1:200])) nets[["mds"]] <- abs(stats::cor(mds[,1:200])) ## Combine them: combined <- combine.networks(nets=nets, contributions=c(nrow(aml), nrow(mds)), outPath=".", datExpr=rbind(aml, mds)[,1:200]) print(table(combined$modules))
data(aml) data(mds) nets <- list() ## Make the coexpression networks: nets[["aml"]] <- abs(stats::cor(aml[,1:200])) nets[["mds"]] <- abs(stats::cor(mds[,1:200])) ## Combine them: combined <- combine.networks(nets=nets, contributions=c(nrow(aml), nrow(mds)), outPath=".", datExpr=rbind(aml, mds)[,1:200]) print(table(combined$modules))
In a greedy way, this function removes the genes with smaller weight one-by-one, while assessing the accuracy of the predictions of the resulting trees.
compact.tree(c5Tree, pigengene, Data=pigengene$Data, Labels=pigengene$Labels, testD=NULL, testL=NULL, saveDir=".", verbose=0)
compact.tree(c5Tree, pigengene, Data=pigengene$Data, Labels=pigengene$Labels, testD=NULL, testL=NULL, saveDir=".", verbose=0)
c5Tree |
A decision tree of class |
pigengene |
A object of |
Data |
A matrix or data frame containing the expression data, with genes corresponding to columns and rows corresponding to samples. Rows and columns must be named. |
Labels |
Labels (condition types) for the (training) expression data. It is a
named vector of characters. |
testD |
The test expression data, for example, from an independent dataset. Optional. |
testL |
Labels (condition types) for the (test) expression data. Optional. |
saveDir |
Where to save the plots of the tree(s) |
verbose |
Integer level of verbosity. 0 means silent and higher values produce more details of computation. |
A list with following elements is invisibly returned:
call |
The call that created the results |
predTrain |
Prediction using projected data without compacting |
predTrainCompact |
Prediction after compacting |
genes |
A character vector of all genes in the full tree before compacting |
genesCompacted |
A character vector of all genes in the compacted tree |
trainErrors |
A matrix reporting errors on the train data. The rows are named according to the number of removed genes. Each column reports the number of misclassified samples in one condition (type) except the last column that reports the total. |
testErrors |
A matrix reporting errors on the test data similar to |
queue |
A numeric vector named by all genes contributing to the full tree before compacting. The numeric values are weights increasingly ordered by absolute value. |
pos |
The number of removed genes |
txtFile |
Confusion matrices and other details on compacting are reported in this text file |
Large-scale gene network analysis reveals the significance of extracellular matrix pathway and homeobox genes in acute myeloid leukemia, Foroushani A, Agrahari R, Docking R, Karsan A, and Zare H. In preparation.
Gene shaving as a method for identifying distinct sets of genes with similar expression patterns, Hastie, Trevor, et al. Genome Biol 1.2 (2000): 1-0003.
Pigengene-package
, compute.pigengene
,
make.decision.tree
, C5.0
,
Pigengene-package
## Data: data(aml) data(mds) data(pigengene) d1 <- rbind(aml,mds) ## Fiting the trees: trees <- make.decision.tree(pigengene=pigengene, Data=d1, saveDir="trees", minPerLeaf=14:15, doHeat=FALSE,verbose=3, toCompact=FALSE) c1 <- compact.tree(c5Tree=trees$c5Trees[["15"]], pigengene=pigengene, saveDir="compacted", verbose=1)
## Data: data(aml) data(mds) data(pigengene) d1 <- rbind(aml,mds) ## Fiting the trees: trees <- make.decision.tree(pigengene=pigengene, Data=d1, saveDir="trees", minPerLeaf=14:15, doHeat=FALSE,verbose=3, toCompact=FALSE) c1 <- compact.tree(c5Tree=trees$c5Trees[["15"]], pigengene=pigengene, saveDir="compacted", verbose=1)
This function takes as input the expression data and module assignments, and
computes an eigengene for each module using PCA. If you already have a
Pigengene object, you can use the project.eigen
function to infer
the values of your eigengenes in a new expression dataset.
compute.pigengene(Data, Labels, modules, saveFile = "pigengene.RData", selectedModules = "All", amplification = 5, doPlot = TRUE, verbose = 0, dOrderByW = TRUE, naTolerance=0.05, doWgcna=FALSE, doMinimize=FALSE)
compute.pigengene(Data, Labels, modules, saveFile = "pigengene.RData", selectedModules = "All", amplification = 5, doPlot = TRUE, verbose = 0, dOrderByW = TRUE, naTolerance=0.05, doWgcna=FALSE, doMinimize=FALSE)
Data |
A matrix or data frame containing the training expression data, with genes corresponding to columns and rows corresponding to samples. Rows and columns must be named. |
Labels |
A (preferably named) vector containing the Labels (condition types) for the training Data.
Names must agree with rows of |
modules |
A numeric vector, named by |
saveFile |
The file to save the results. |
selectedModules |
A numeric vector determining which modules to use, or set to "All" (default) to include every module. |
amplification |
An integer that controls the number of repeats for each condition.
The number of all samples roughly will be multiplied by this factor
after oversampling. See |
doPlot |
Boolean determining whether heatmaps of expression of eigengenes
should be ploted and saved. Set it to |
verbose |
The integer level of verbosity. 0 means silent and higher values produce more details of computation. |
dOrderByW |
If |
naTolerance |
Upper threshold on the fraction of entries per gene that
can be missing. Genes with a larger fraction of missing
entries are ignored. For genes with smaller fraction of NA
entries, the missing values are imputed from their average
expression in the other samples.
See |
doWgcna |
If |
doMinimize |
If |
Rows of Data
are oversampled using balance
so that
each condition has roughly the same number of samples.
For each module, an eigengene is computed using PCA.
An object of pigengene-class
.
Habil Zare and Amir Foroushani
Large-scale gene network analysis reveals the significance of extracellular matrix pathway and homeobox genes in acute myeloid leukemia, Foroushani A, Agrahari R, Docking R, Karsan A, and Zare H. In preparation.
Pigengene-package
,
one.step.pigengene
, wgcna.one.step
,
project.eigen
,
make.decision.tree
, moduleEigengenes
## Data: data(aml) data(mds) data(eigengenes33) d1 <- rbind(aml,mds) Labels <- c(rep("AML",nrow(aml)),rep("MDS",nrow(mds))) names(Labels) <- rownames(d1) modules33 <- eigengenes33$modules[colnames(d1)] ## Computing: pigengene <- compute.pigengene(Data=d1, Labels=Labels, modules=modules33, saveFile="pigengene.RData", doPlot=TRUE, verbose=3) class(pigengene) plot(pigengene, fontsize=12) ## If you need the pigengene object only to compute eigengenes ## in a new dataset, you can make is much smaller. pigengeneM <- compute.pigengene(Data=d1, Labels=Labels, modules=modules33, saveFile="pigengene.RData", doPlot=TRUE, verbose=1, doMinimize=TRUE) object.size(pigengene)/10^6 ## MB object.size(pigengeneM)/10^6 ## MB
## Data: data(aml) data(mds) data(eigengenes33) d1 <- rbind(aml,mds) Labels <- c(rep("AML",nrow(aml)),rep("MDS",nrow(mds))) names(Labels) <- rownames(d1) modules33 <- eigengenes33$modules[colnames(d1)] ## Computing: pigengene <- compute.pigengene(Data=d1, Labels=Labels, modules=modules33, saveFile="pigengene.RData", doPlot=TRUE, verbose=3) class(pigengene) plot(pigengene, fontsize=12) ## If you need the pigengene object only to compute eigengenes ## in a new dataset, you can make is much smaller. pigengeneM <- compute.pigengene(Data=d1, Labels=Labels, modules=modules33, saveFile="pigengene.RData", doPlot=TRUE, verbose=1, doMinimize=TRUE) object.size(pigengene)/10^6 ## MB object.size(pigengeneM)/10^6 ## MB
This function computes the distance correlation between every pair of columns of the input data matrix.
dcor.matrix(Data)
dcor.matrix(Data)
Data |
A matrix containing the data |
Using for loops, all pairs of columns are passed to link[energy]{dcor}
function from link[energy]{energy-package}
.
A numeric square matrix. The number of rows and columns is equal to the number
of columns of Data
and they are named accordingly.
This function uses for loops, which are not efficient for an input matrix with too many columns.
Habil Zare
Szekely, G.J., Rizzo, M.L., and Bakirov, N.K. (2007), Measuring and Testing Dependence by Correlation of Distances, _Annals of Statistics_, Vol. 35 No. 6, pp. 2769-2794.
<URL: http://dx.doi.org/10.1214/009053607000000505>
Szekely, G.J. and Rizzo, M.L. (2009), Brownian Distance Covariance, _Annals of Applied Statistics_, Vol. 3, No. 4, 1236-1265.
<URL: http://dx.doi.org/10.1214/09-AOAS312>
Szekely, G.J. and Rizzo, M.L. (2009), Rejoinder: Brownian Distance Covariance, _Annals of Applied Statistics_, Vol. 3, No. 4, 1303-1308.
link[energy]{dcor}
## Data: data(aml) dcor1 <- dcor.matrix(Data=aml[,1:5]) dcor1 ## Comparison with Pearson: cor1 <- abs(stats::cor(aml[,1:5])) ## With 202 samples, distance and Pearson correlations do not differ much: dcor1-cor1 dcor2 <- dcor.matrix(Data=aml[1:20,1:5]) cor2 <- abs(stats::cor(aml[1:20,1:5])) ## Distance correlation is more robust if fewer samples are available: dcor2-cor2 plot(dcor2-cor1,cor1-cor2,xlim=c(-0.5,0.5),ylim=c(-0.5,0.5))
## Data: data(aml) dcor1 <- dcor.matrix(Data=aml[,1:5]) dcor1 ## Comparison with Pearson: cor1 <- abs(stats::cor(aml[,1:5])) ## With 202 samples, distance and Pearson correlations do not differ much: dcor1-cor1 dcor2 <- dcor.matrix(Data=aml[1:20,1:5]) cor2 <- abs(stats::cor(aml[1:20,1:5])) ## Distance correlation is more robust if fewer samples are available: dcor2-cor2 plot(dcor2-cor1,cor1-cor2,xlim=c(-0.5,0.5),ylim=c(-0.5,0.5))
Takes as input a network
(i.e., weighted graph) and identifies modules
(i.e., clusters of similar genes) using WGCNA::blockwiseModules
.
It also produces a plot showing the number of genes in each module.
determine.modules(network, outPath, midfix="", powerVector=1:20, verbose=1, RsquaredCut=0.75, minModuleSize=5, doRemoveTOM=FALSE, datExpr, doSave=FALSE)
determine.modules(network, outPath, midfix="", powerVector=1:20, verbose=1, RsquaredCut=0.75, minModuleSize=5, doRemoveTOM=FALSE, datExpr, doSave=FALSE)
network |
An adjacency matrix of the network that is built using |
outPath |
A string to the path where plots and results will be saved. |
midfix |
An optional string used in the output file names. |
powerVector |
A numeric vector of integer values that are tried to find the best power.
See |
verbose |
The integer level of verbosity, where 0 means silent and higher values produce more details. |
RsquaredCut |
A threshold in the range [0,1] used to estimate the power. A higher value
can increase power. For technical use only. See |
minModuleSize |
The value that controls the minimum number of genes per module.
See |
doRemoveTOM |
A boolean determining whether the big TOM file must remove or not. |
datExpr |
The expression matrix that |
doSave |
A boolean value to determine whether the whole output of this
function (typically 1-2 GBs) should be saved as |
A list with the following components:
call |
The call that created the results. |
midfix |
The |
power |
The integer value of the estimated power computed by
|
fits |
The |
modules |
A vector that representing the identified modules. Its length is equal to the number
of nodes in the network, named by node names (i.e., row names of |
net |
The full output of the |
Neda Emami and Habil Zare.
apply.filter
,
combine.networks
,
make.filter
data(aml) ##Making the coexpression network network <- abs(stats::cor(aml[,1:200])) ##Identifying modules identifiedMod <- determine.modules(network=network, outPath=".", datExpr=aml[,1:200]) print(table(identifiedMod$modules))
data(aml) ##Making the coexpression network network <- abs(stats::cor(aml[,1:200])) ##Identifying modules identifiedMod <- determine.modules(network=network, outPath=".", datExpr=aml[,1:200]) print(table(identifiedMod$modules))
Draws the BN using appropriate colors and font size.
draw.bn(BN, plotFile = NULL, inputType = "ENTREZIDat", edgeColor = "blue", DiseaseCol = "darkgreen", DiseaseFill = "red", DiseaseChildFill = "pink", nodeCol = "darkgreen", nodeFill = "yellow", moduleNamesFile = NULL, mainText = NULL, nodeFontSize = 14 * 1.1, verbose = 0)
draw.bn(BN, plotFile = NULL, inputType = "ENTREZIDat", edgeColor = "blue", DiseaseCol = "darkgreen", DiseaseFill = "red", DiseaseChildFill = "pink", nodeCol = "darkgreen", nodeFill = "yellow", moduleNamesFile = NULL, mainText = NULL, nodeFontSize = 14 * 1.1, verbose = 0)
BN |
An object of |
plotFile |
If not |
inputType |
The type of gene IDs in |
edgeColor |
The color of edges |
DiseaseCol |
The color of the border of the Disease node |
DiseaseFill |
The color of the area inside the Disease node |
DiseaseChildFill |
The color of the area inside the children of the Disease node |
nodeCol |
The color of the border of the usual nodes excluding Disease and its children |
nodeFill |
The color of the area inside the usual nodes |
moduleNamesFile |
An optional csv file including the information to rename the nodes name. See coderename.node. |
mainText |
The main text shown at the top of the plot |
nodeFontSize |
Adjusts the size of nodes |
verbose |
The integer level of verbosity. 0 means silent and higher values produce more details of computation. |
A list with following components:
call |
The call that created the results |
BN |
An echo of input |
renamedBN |
An object of |
gr |
The full output of |
Habil Zare
bnlearn-package
, Pigengene-package
,
learn.bn
, graph-class
## See lear.bn function.
## See lear.bn function.
This list contains partial eigengenes computed from
AML and MDS gene expression profiles provided by Mills et al.
These data are included to illustrate how to use Pigengene-package
and also to facilitate reproducing the results presented in the corresponding
paper.
data(eigengenes33)
data(eigengenes33)
A list
The top 9166 differentially expressed genes were identified and their expressions
in AML were used for identifying 33 modules. The first column, ME0, corresponds
to module 0 (outliers) and is usually ignored. The eigengene for each module was
obtained using compute.pigengene
function. Oversampling
was performed with amplification=5
to adjust for unbalanced sample-size.
It is a list of 3 objects:
aml
A 202 by 34 matrix.
Each column reports the values of a module eigengene for AML cases.
mds
A 164 by 34 matrix for MDS cases with columns similar to aml.
modules
A numeric vector of length 9166 labeling members of each module. Named by Entrez ID.
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15061
Mills, Ken I., et al. (2009). Microarray-based classifiers and prognosis models identify subgroups with distinct clinical outcomes and high risk of AML transformation of myelodysplastic syndrome. Blood 114.5: 1063-1072.
Pigengene-package
, compute.pigengene
,
aml
, mds
,
learn.bn
library(pheatmap) data(eigengenes33) pheatmap(eigengenes33$aml,show_rownames=FALSE) ## See Pigengene::learn.bn() documentation for more examples.
library(pheatmap) data(eigengenes33) pheatmap(eigengenes33$aml,show_rownames=FALSE) ## See Pigengene::learn.bn() documentation for more examples.
Takse as input gene IDs in a convention, say REFSEQ, and converts them to another convention.
gene.mapping(ids, inputType = "REFSEQ", outputType = "SYMBOL", leaveNA = FALSE, inputDb = "Human", outputDb = inputDb, verbose = 0)
gene.mapping(ids, inputType = "REFSEQ", outputType = "SYMBOL", leaveNA = FALSE, inputDb = "Human", outputDb = inputDb, verbose = 0)
ids |
A character vector of input gene IDs |
inputType |
The type of input IDs. |
outputType |
The type of output IDs. If it is a character vector, mapping will be done for each element. |
leaveNA |
If |
inputDb |
The input data base. Use |
outputDb |
The output data base. If it is a list, mapping will be done for each element. |
verbose |
The integer level of verbosity. 0 means silent and higher values produce more details of computation. |
It can map homologous genes between species e.g. from mouse to human. If more than 1 ID found for an input gene, only one of them is returned.
A matrix of characters with 3 columns: input, output1, and
output2. The last one is guaranteed not to be NA
if leaveNA=FALSE
.
Amir Foroushani, Habil Zare, and Rupesh Agrahari
Pages H, Carlson M, Falcon S and Li N. AnnotationDbi: Annotation Database Interface. R package version 1.32.3.
AnnotationDb-class
,
org.Hs.eg.db
org.Mm.eg.db
library(org.Hs.eg.db) g1 <- gene.mapping(ids="NM_001159995") print(g1) ## Mapping to multiple convention library(org.Mm.eg.db) g2 <- gene.mapping(ids=c("NM_170730", "NM_001013580"), inputType="REFSEQ", inputDb=org.Mm.eg.db, outputType=c("SYMBOL","ENTREZID"), outputDb=list(org.Hs.eg.db,org.Mm.eg.db), verbose=1) print(g2)
library(org.Hs.eg.db) g1 <- gene.mapping(ids="NM_001159995") print(g1) ## Mapping to multiple convention library(org.Mm.eg.db) g2 <- gene.mapping(ids=c("NM_170730", "NM_001013580"), inputType="REFSEQ", inputDb=org.Mm.eg.db, outputType=c("SYMBOL","ENTREZID"), outputDb=list(org.Hs.eg.db,org.Mm.eg.db), verbose=1) print(g2)
Takes as input a vector or list of gene IDs in any convention, and performs over representation analysis.
get.enriched.pw(genes, idType, pathwayDb, ont=c("BP", "MF", "CC"), Org="Human", OrgDb=NULL, outPath, pvalueCutoff=0.05, pAdjustMethod="BH", fontSize=14, verbose=0)
get.enriched.pw(genes, idType, pathwayDb, ont=c("BP", "MF", "CC"), Org="Human", OrgDb=NULL, outPath, pvalueCutoff=0.05, pAdjustMethod="BH", fontSize=14, verbose=0)
genes |
A character vector or a named list of genes for which pathway over representation analysis to be done. |
idType |
A string describing the type of input gene ID e.g., "ENTREZID", "REFSEQ", "SYMBOL". |
pathwayDb |
A character vector determining which enrichment database to be used e.g., "GO", "KEGG", "REACTOME", or "NCG". |
ont |
GO ontology terms to be analysed e.g., "BP", "MF" or "CC". Default is all three. |
Org |
A character string equal to "Human" or "Mouse" determining the reference
organism to be used. For "Human" and "Mouse" |
OrgDb |
The reference data base to be used. Use e.g. |
outPath |
A file path where results will be saved. |
pvalueCutoff |
A numerical value that determines a cutoff of adjusted pValue. |
pAdjustMethod |
A string passed to the clusterProfiler:: |
fontSize |
A numerical value that determines the font size of the y-axis and the title in the plot. |
verbose |
The integer level of verbosity. 0 means silent and higher values produce more details of computation. |
A list:
enriched |
A list of output of enrichment analysis for different database analyzed. |
noEnrichment |
A vector of database names in which no enriched pathways were found. |
The output is saved for each selected module under the
moduleName_enrichment
folder.
There is a subfolder that includes an excel file and plot(s). Each sheet
in the excel file corresponds to a pathway database (KEGG in the below example).
Each row is an overrepresented pathway.
Isha Mehta, Habil Zare, and Sogand Sajedi
Guangchuang Yu, Li-Gen Wang, Yanyan Han and Qing-Yu He, clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 2012, 16(5):284-287
Guangchuang Yu, Qing-Yu He. ReactomePA: an R/Bioconductor package for reactome pathway analysis and visualization. Molecular BioSystems 2016, 12(2):477-479
enrichGO
,
enrichKEGG
,
enrichNCG
, enrichPathway
library(org.Hs.eg.db) genes <- c("NM_170730", "NM_001013580", "NM_002142", "NM_003417", "NM_000082", "NM_006158", "NM_006047", "NM_022356", "NM_003979", "NM_001030", "NM_022872") p1 <- get.enriched.pw(genes=genes, idType="REFSEQ", pathwayDb="GO", Org="Human", outPath=getwd(), verbose=1)
library(org.Hs.eg.db) genes <- c("NM_170730", "NM_001013580", "NM_002142", "NM_003417", "NM_000082", "NM_006158", "NM_006047", "NM_022356", "NM_003979", "NM_001030", "NM_022872") p1 <- get.enriched.pw(genes=genes, idType="REFSEQ", pathwayDb="GO", Org="Human", outPath=getwd(), verbose=1)
Taking as input a tree and data, this function determines the leaf each sample will fall in.
get.fitted.leaf(c5Tree, inpDTemp, epsi = 10^(-7))
get.fitted.leaf(c5Tree, inpDTemp, epsi = 10^(-7))
c5Tree |
A decision tree of class |
inpDTemp |
The possibly new data matrix with samples on rows |
epsi |
A small perturbation to resolve the boundary issue |
A numeric vector of node indices named by samples (rows of inpDTemp
)
This function is tricky because C50 uses a global variable.
Amir Foroushani
Pigengene-package
,
make.decision.tree
, compact.tree
,
compute.pigengene
, module.heatmap
,
get.used.features
, preds.at
## Data: data(aml) data(mds) data(pigengene) d1 <- rbind(aml,mds) ## Fiting the trees: trees <- make.decision.tree(pigengene=pigengene, Data=d1, saveDir="trees", minPerLeaf=15, doHeat=FALSE,verbose=3, toCompact=FALSE) f1 <- get.fitted.leaf(c5Tree=trees$c5Trees[["15"]], inpDTemp=pigengene$eigengenes)
## Data: data(aml) data(mds) data(pigengene) d1 <- rbind(aml,mds) ## Fiting the trees: trees <- make.decision.tree(pigengene=pigengene, Data=d1, saveDir="trees", minPerLeaf=15, doHeat=FALSE,verbose=3, toCompact=FALSE) f1 <- get.fitted.leaf(c5Tree=trees$c5Trees[["15"]], inpDTemp=pigengene$eigengenes)
This function returns all genes that are left after shrinking (compacting )
a given tree. If enhance
is set to TRUE
, it makes sure that
the output contains at least two genes from each used module.
get.genes(c5Tree = NULL, pigengene = NULL, queue = NULL, modules = NULL, pos=0, enhance = TRUE)
get.genes(c5Tree = NULL, pigengene = NULL, queue = NULL, modules = NULL, pos=0, enhance = TRUE)
queue |
A character vector. The membership queue for a decsision tree. |
pos |
Number of genes that are considered from removal. Same interpretation as in
|
enhance |
If |
modules |
Named character vector listing the module assignments. |
c5Tree |
A decision tree of class |
pigengene |
An object in |
This function needs modules
and queue
, or alternatively,
c5Tree
and pigengene
.
A character vector containing the names of the genes involved in the
modules whose eigengenes are used in the tree. If pos > 0
,
the first pos
such genes with lowest absolute membership in their
respective modules are filtered.
Pigengene-package
,
compact.tree
,preds.at
,
get.used.features
, make.decision.tree
## Data: data(aml) data(mds) data(pigengene) d1 <- rbind(aml,mds) ## Fiting the trees: trees <- make.decision.tree(pigengene=pigengene, Data=d1, saveDir="trees", minPerLeaf=15, doHeat=FALSE,verbose=3, toCompact=FALSE) g1 <- get.genes(c5Tree=trees$c5Trees[["15"]],pigengene=pigengene)
## Data: data(aml) data(mds) data(pigengene) d1 <- rbind(aml,mds) ## Fiting the trees: trees <- make.decision.tree(pigengene=pigengene, Data=d1, saveDir="trees", minPerLeaf=15, doHeat=FALSE,verbose=3, toCompact=FALSE) g1 <- get.genes(c5Tree=trees$c5Trees[["15"]],pigengene=pigengene)
Only some of the features will be automatically selected and used in a
decision tree. However, an object of class C5.0
does not have the
selected feature names explicitly. This function parses the tree
component and extracts the names of features contributing to the tree.
get.used.features(c5Tree)
get.used.features(c5Tree)
c5Tree |
A decision tree of class |
A character vector of the names of features (module eigengenes) contributing to the input decision tree.
Amir Foroushani
Pigengene-package
,
make.decision.tree
, compact.tree
,
compute.pigengene
, module.heatmap
,
get.fitted.leaf
, preds.at
,
Pigengene-package
## Data: data(aml) data(mds) data(pigengene) d1 <- rbind(aml,mds) ## Fiting the trees: trees <- make.decision.tree(pigengene=pigengene, Data=d1, saveDir="trees", minPerLeaf=15, doHeat=FALSE,verbose=3, toCompact=FALSE) get.used.features(c5Tree=trees$c5Trees[["15"]])
## Data: data(aml) data(mds) data(pigengene) d1 <- rbind(aml,mds) ## Fiting the trees: trees <- make.decision.tree(pigengene=pigengene, Data=d1, saveDir="trees", minPerLeaf=15, doHeat=FALSE,verbose=3, toCompact=FALSE) get.used.features(c5Tree=trees$c5Trees[["15"]])
This function takes as input the eigengenes of all modules and learns a Bayesian network using bnlearn package. It builds several individual networks from random staring networks by optimizing their score. Then, it infers a consensus network from the ones with relatively "higher" scores. The default hyper-parameters and arguments should be fine for most applications.
learn.bn(pigengene=NULL, Data=NULL, Labels=NULL, bnPath = "bn", bnNum = 100, consensusRatio = 1/3, consensusThresh = "Auto", doME0 = FALSE, selectedFeatures = NULL, trainingCases = "All", algo = "hc", scoring = "bde", restart = 0, pertFrac = 0.1, doShuffle = TRUE, use.Hartemink = TRUE, bnStartFile = "None", use.Disease = TRUE, use.Effect = FALSE, dummies = NULL, tasks = "All", onCluster = !(which.cluster()$cluster == "local"), inds = 1:ceiling(bnNum/perJob), perJob = 2, maxSeconds = 5 * 60, timeJob = "00:10:00", bnCalculationJob = NULL, seed = NULL, verbose = 0, naTolerance=0.05)
learn.bn(pigengene=NULL, Data=NULL, Labels=NULL, bnPath = "bn", bnNum = 100, consensusRatio = 1/3, consensusThresh = "Auto", doME0 = FALSE, selectedFeatures = NULL, trainingCases = "All", algo = "hc", scoring = "bde", restart = 0, pertFrac = 0.1, doShuffle = TRUE, use.Hartemink = TRUE, bnStartFile = "None", use.Disease = TRUE, use.Effect = FALSE, dummies = NULL, tasks = "All", onCluster = !(which.cluster()$cluster == "local"), inds = 1:ceiling(bnNum/perJob), perJob = 2, maxSeconds = 5 * 60, timeJob = "00:10:00", bnCalculationJob = NULL, seed = NULL, verbose = 0, naTolerance=0.05)
pigengene |
An object from |
Data |
A matrix or data frame containing the training data with eigengenes corresponding to columns and rows corresponding to samples. Rows and columns must be named. |
Labels |
A (preferably named) vector containing the Labels (condition types) for
the training data. Names must agree with rows of |
bnPath |
The path to save the results |
bnNum |
The total number of individual networks. In practice, the
number of learnt networks can be less than |
consensusRatio |
A numeric in the range |
consensusThresh |
A vector of thresholds in the range |
doME0 |
If |
selectedFeatures |
A character vector. If not |
trainingCases |
A character vector that determines which cases (samples) should be considered for learning the network. |
algo |
The algorithm that bnlean uses for optimizing the score. The
default is "hc" (hill climbing).
See |
scoring |
A character determining the scoring criteria. Use 'bde' and 'bic' for
the Bayesian Dirichlet equivalent and Bayesian Information Criterion scores,
respectively. See |
restart |
The number of random restarts. For technical use only.
See |
pertFrac |
A numeric in the range |
doShuffle |
The ordering of the features (eigengenes) is important in
making the initial network. If |
use.Hartemink |
If |
bnStartFile |
Optionally, learning can start from a Bayesian network instead of a random network.
|
use.Disease |
If |
use.Effect |
If |
dummies |
A vector of numeric values in the range |
tasks |
A character vector and a subset of |
onCluster |
A Boolean variable that is |
inds |
The indices of the jobs that are included in the analysis. |
perJob |
The number of individual networks that are learnt by 1 job. |
maxSeconds |
An integer limiting computation time for each training job that runs locally,
i.e., when |
timeJob |
The time in |
bnCalculationJob |
An R script used to submit jobs to the cluster. Set to |
seed |
The random seed that can be set to an integer to reproduce the same results. |
verbose |
Integer level of verbosity. 0 means silent and higher values produce more details of computation. |
naTolerance |
Upper threshold on the fraction of entries per gene that
can be missing. Genes with a larger fraction of missing
entries are ignored. For genes with smaller fraction of NA
entries, the missing values are imputed from their average
expression in the other samples.
See |
For learning a Bayesian network with tens of nodes (eigengenes), bnNum=1000
or higher is recommended. Increasing consensusThresh
generally results
in a network with fewer arcs. Nagarajan et al. proposed a fundamental approach that
determines this hyper-parameter based on the background noise. They use non-parametric
bootstrapping, which is not implemented in the current package yet.
The default values for the rest of the hyper-parameters should be fine for most applications.
A list of:
consensusThresh |
The vector of thresholds as described in the arguments. |
indvPath |
The path where the individual networks were saved. |
moduleFile |
The file containing data in appropriate format for bnlearn package and the blacklist arcs. |
scoreFile |
The file containing the record of the successively jobs and the scores of the corresponding individual networks. |
consensusFile |
The file containing the consensus network and its BDe and BIC scores. |
bnModuleRes |
The result of |
runs |
A list containing the record of successful jobs. |
scores |
The list saved in |
consensusThreshRes |
The full output of |
consensus1 |
The consensus Bayesian network corresponding to the first threshold.
It is the output of |
scorePlot |
The output of |
graphs |
The output of |
timeTaken |
An object of |
use.Disease , use.Effect , use.Hartemink
|
Some of the input arguments. |
Running the jobs on a cluster needs a proper bnCalculationJob
script. Also, the unexported function sbatch()
is adopted for a
particular cluster and may need generalization on other clusters.
Amir Foroushani, Habil Zare, and Rupesh Agrahari
Hartemink A (2001). Principled Computational Methods for the Validation and Discovery of Genetic Regulatory Networks. Ph.D. thesis, School of Electrical Engineering and Computer Science, Massachusetts Institute of Technology.
Nagarajan, Radhakrishnan, et al. (2010) Functional relationships between genes associated with differentiation potential of aged myogenic progenitors. Frontiers in Physiology 1.
bnlearn-package
, Pigengene-package
,
compute.pigengene
data(eigengenes33) ms <- 10:20 ## A subset of modules for quick demonstration amlE <- eigengenes33$aml[,ms] mdsE <- eigengenes33$mds[,ms] eigengenes <- rbind(amlE,mdsE) Labels <- c(rep("AML",nrow(amlE)),rep("MDS",nrow(mdsE))) names(Labels) <- rownames(eigengenes) learnt <- learn.bn(Data=eigengenes, Labels=Labels, bnPath="bnExample", bnNum=10, seed=1) bn <- learnt$consensus1$BN ## Visualize: d1 <- draw.bn(BN=bn,nodeFontSize=14) ## What are the children of the Disease node? childrenD <- bnlearn::children(x=bn, node="Disease") print(childrenD) ## Fit the parameters of the Bayesian network: fit <- bnlearn::bn.fit(x=bn, data=learnt$consensus1$Data, method="bayes",iss=10) ## The conditional probability table for a child of the Disease node: fit[[childrenD[1]]] ## The fitted Bayesian network can be used for predicting the labels ## (i.e., values of the Disease node). l2 <- predict(object=fit, node="Disease", data=learnt$consensus1$Data, method="bayes-lw") table(Labels, l2)
data(eigengenes33) ms <- 10:20 ## A subset of modules for quick demonstration amlE <- eigengenes33$aml[,ms] mdsE <- eigengenes33$mds[,ms] eigengenes <- rbind(amlE,mdsE) Labels <- c(rep("AML",nrow(amlE)),rep("MDS",nrow(mdsE))) names(Labels) <- rownames(eigengenes) learnt <- learn.bn(Data=eigengenes, Labels=Labels, bnPath="bnExample", bnNum=10, seed=1) bn <- learnt$consensus1$BN ## Visualize: d1 <- draw.bn(BN=bn,nodeFontSize=14) ## What are the children of the Disease node? childrenD <- bnlearn::children(x=bn, node="Disease") print(childrenD) ## Fit the parameters of the Bayesian network: fit <- bnlearn::bn.fit(x=bn, data=learnt$consensus1$Data, method="bayes",iss=10) ## The conditional probability table for a child of the Disease node: fit[[childrenD[1]]] ## The fitted Bayesian network can be used for predicting the labels ## (i.e., values of the Disease node). l2 <- predict(object=fit, node="Disease", data=learnt$consensus1$Data, method="bayes-lw") table(Labels, l2)
A decision tree in Pigengene-package
uses module eigengenes to
build a classifier that distinguishes the different classes.
Briefly, each eigengene is a weighted average of the expression of
all genes in the module, where the weight of each gene corresponds
to its membership in the module.
make.decision.tree(pigengene, Data, Labels = structure(pigengene$annotation[rownames(pigengene$eigengenes), 1], names = rownames(pigengene$eigengenes)), testD = NULL, testL = NULL, selectedFeatures = NULL, saveDir = "C5Trees", minPerLeaf = NULL, useMod0 = FALSE, costRatio = 1, toCompact = NULL, noise = 0, noiseRepNum = 10, doHeat=TRUE, verbose = 0, naTolerance=0.05)
make.decision.tree(pigengene, Data, Labels = structure(pigengene$annotation[rownames(pigengene$eigengenes), 1], names = rownames(pigengene$eigengenes)), testD = NULL, testL = NULL, selectedFeatures = NULL, saveDir = "C5Trees", minPerLeaf = NULL, useMod0 = FALSE, costRatio = 1, toCompact = NULL, noise = 0, noiseRepNum = 10, doHeat=TRUE, verbose = 0, naTolerance=0.05)
pigengene |
The pigengene object that is used to build the decision tree.
See |
Data |
The training expression data |
Labels |
Labels (condition types) for the (training) expression data. It is a
named vector of characters. |
testD |
The test expression data, for example, from an independent dataset. Optional. |
testL |
Labels (condition types) for the (test) expression data. Optional. |
selectedFeatures |
A numeric vector determining the subset of eigengenes
that should be used as potential predictors. By default
("All"), eigengenes for all modules are considered.
See also |
saveDir |
Where to save the plots of the tree(s). |
minPerLeaf |
Vector of integers. For each value, a tree will be
built requiring at least that many nodes on each leaf. By
default ( |
useMod0 |
Boolean. Wether to allow the tree(s) to use the eigengene of module 0, which corresponds to the set of outlier, as a proper predictor. |
costRatio |
A numeric value effective only for 2 groups classification. The default value (1) considers the misclassification of both conditions as equally disadvatageous. Change this value to a larger or smaller value if you are more interested in the specificity of predictions for condition 1 or condition 2, respectively. |
toCompact |
An integer. The tree with this |
noise , noiseRepNum
|
For development purposes only. These parameters allow investigating the effect of gaussian noise in the expression data on the accurracy of the tree for test data. |
doHeat |
Boolean. Set to |
verbose |
The integer level of verbosity. 0 means silent and higher values produce more details of computation. |
naTolerance |
Upper threshold on the fraction of entries per gene that
can be missing. Genes with a larger fraction of missing
entries are ignored. For genes with smaller fraction of NA
entries, the missing values are imputed from their average
expression in the other samples.
See |
This function passes the inut eigengenes and appropriate arguments
to C5.0
function from C50
package.
The effect of test data:
Only when both testD
and testL
are provided,
the test data will be used for a) compacting the trees,
b) plotting heatmaps of expression of genes in the
compacted and full trees, and
c) the noise analysis.
If either of testD
or testL
is NULL
, then
Data
and Labels
are instead used for these purposes.
A list with following elements:
call |
The call that created the results |
c5Trees |
A list, with one element of class |
minPerLeaf |
A numeric vector enumerating all of the attempted minPerLeaf values. |
compacted |
The full output of |
heat |
The output of |
heatCompact |
The output of |
noisy |
The full output of |
leafLocs |
A matrix reporting the leaf for each sample on 1 row. The columns are named
according to the correspoding |
toCompact |
Echos the |
costs |
The cost matrix |
saveDir |
The directory where plots are saved in |
For faster computation in an initial, explanatory run, turn off
compacting, which can take a few minutes, with toCompact=FALSE
.
Pigengene-package
, compute.pigengene
,
compact.tree
, C5.0
,
Pigengene-package
## Data: data(aml) data(mds) data(pigengene) d1 <- rbind(aml,mds) ## Fiting the trees: trees <- make.decision.tree(pigengene=pigengene, Data=d1, saveDir="trees", minPerLeaf=14:15, doHeat=FALSE,verbose=3, toCompact=15)
## Data: data(aml) data(mds) data(pigengene) d1 <- rbind(aml,mds) ## Fiting the trees: trees <- make.decision.tree(pigengene=pigengene, Data=d1, saveDir="trees", minPerLeaf=14:15, doHeat=FALSE,verbose=3, toCompact=15)
Takes as input the similarity matrix of a graph (i.e., network
) and an epsilon
value.
It computes a filter graph using the epsilon
threshold.
The dimention of the output filter matrix is the same as the input similarity network.
It also produces two plots showing the weighted degrees of the input graph and degrees of the filter,
respectively.
make.filter(network, epsilon, outPath=NULL)
make.filter(network, epsilon, outPath=NULL)
network |
A matrix of similarity for the network. |
epsilon |
A threshold for deciding which edges to keep. If the similarity is less than 1/ |
outPath |
A string determining the path where plots and results will be saved. |
A list with the following components:
filt |
A matrix representing adjacency matrix of the computed filter graph. If the distance between two nodes in the similarity matrix is higher than epsilon, those nodes are connected in the filter graph (i.e., the corresponding entry in the adjacency matrix is 1). Otherwise, the corresponding entry is 0. |
epsilon |
The |
Habil Zare and Neda Emami.
one.step.pigengene
,
apply.filter
data(aml) data(mds) d1 <- rbind(aml,mds)[, 1:200] Labels <- c(rep("AML",nrow(aml)),rep("MDS",nrow(mds))) names(Labels) <- rownames(d1) p0 <- one.step.pigengene(Data=d1, saveDir=".", verbose=1, seed=1, Labels=Labels, naTolerance=0.5, RsquaredCut=0.8, doNetOnly=TRUE) ##making the filter made <- make.filter(network=p0$Network, epsilon=0.7, outPath=".")
data(aml) data(mds) d1 <- rbind(aml,mds)[, 1:200] Labels <- c(rep("AML",nrow(aml)),rep("MDS",nrow(mds))) names(Labels) <- rownames(d1) p0 <- one.step.pigengene(Data=d1, saveDir=".", verbose=1, seed=1, Labels=Labels, naTolerance=0.5, RsquaredCut=0.8, doNetOnly=TRUE) ##making the filter made <- make.filter(network=p0$Network, epsilon=0.7, outPath=".")
Gene expression profile of 164 myelodysplastic syndromes (MDS) cases from Mills et al. study. The profile was compared with the profile of 202 acute myeloid leukemia (AML) cases and only the 1000 most differentially expressed genes are included.
data("mds")
data("mds")
A numeric matrix
The columns and rows are named according to the genes Entrez, and patient IDs, respectively. The original data was produced using Affymetrix Human Genome U133 Plus 2.0 Miccoaray.Mills et al. study is part of the MILE Study (Microarray Innovations In LEukemia) program, and aimed at prediction of AML transformation in MDS.
It is a 164*1000
numeric matrix.
This profile includes data of the 25 chronic myelomonocytic leukemia (CMLL) cases that can have different expression signatures according to Mills et al.
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15061
Mills, Ken I., et al. (2009). Microarray-based classifiers and prognosis models identify subgroups with distinct clinical outcomes and high risk of AML transformation of myelodysplastic syndrome. Blood 114.5: 1063-1072.
Pigengene-package
,
one.step.pigengene
,
aml
, compute.pigengene
library(pheatmap) data(mds) pheatmap(mds[,1:20],show_rownames=FALSE)
library(pheatmap) data(mds) pheatmap(mds[,1:20],show_rownames=FALSE)
Messages only if verbose is more than 0 and write in a text file if provided.
message.if(me=NULL, verbose=0, txtFile=NULL, append=TRUE, ...)
message.if(me=NULL, verbose=0, txtFile=NULL, append=TRUE, ...)
me |
The Message. Can be a character vector. |
verbose |
A integer. |
txtFile |
The text file in which the message will be written. Set to
|
append |
logical. Set to |
... |
Arguments to be passed to |
NULL
Amir Foroushani
message.if("Hello world!", verbose=1)
message.if("Hello world!", verbose=1)
This function takes as input a tree and an object from
pigengene-class
and per any module used in the tree,
it plots one gene expression heatmap. Alternatively, it can plot
a heatmap for every module in the given pigengene
object.
module.heatmap(c5Tree=NULL, pigengene, mes=NULL, saveDir, testD = NULL, testL = NULL, pos = 0, verbose=0, doAddEigengene=TRUE, scalePngs=1, ...)
module.heatmap(c5Tree=NULL, pigengene, mes=NULL, saveDir, testD = NULL, testL = NULL, pos = 0, verbose=0, doAddEigengene=TRUE, scalePngs=1, ...)
c5Tree |
A decision tree of class |
pigengene |
A object of |
mes |
A character vector that determines which modules to plot, e.g.,
c("ME3","ME5"). Set it to |
saveDir |
Directory to save the plots |
testD , testL
|
Optional. The matrix of (independent) test
expression data, and the corresponding vector of labels. |
pos |
Number of genes to discard. Interpreted the same way as in
|
verbose |
The integer level of verbosity. 0 means silent and higher values produce more details of computation. |
doAddEigengene |
If |
scalePngs |
If not 1, the size of pngs will be adjusted using this parameter. A typical value would be 7. |
... |
Additional arguments. Passed to |
A list of:
call |
The call that created the results |
saveDir |
An echo of the input argument determining where the plots are saved |
Pigengene-package
,
make.decision.tree
, compact.tree
,
compute.pigengene
## Data: data(aml) data(mds) data(pigengene) d1 <- rbind(aml,mds) ## Plotting the heatmaps of all modules: module.heatmap(pigengene=pigengene, saveDir="heatmaps", pos=0, verbose=1) ## Fiting the trees: trees <- make.decision.tree(pigengene=pigengene, Data=d1, saveDir="trees", minPerLeaf=14:15, doHeat=FALSE,verbose=3, toCompact=15) ## Plotting the heatmaps of only the modules in the tree: module.heatmap(c5Tree=trees$c5Trees[["15"]], pigengene=pigengene, saveDir="treeHeatmaps", pos=0, verbose=1)
## Data: data(aml) data(mds) data(pigengene) d1 <- rbind(aml,mds) ## Plotting the heatmaps of all modules: module.heatmap(pigengene=pigengene, saveDir="heatmaps", pos=0, verbose=1) ## Fiting the trees: trees <- make.decision.tree(pigengene=pigengene, Data=d1, saveDir="trees", minPerLeaf=14:15, doHeat=FALSE,verbose=3, toCompact=15) ## Plotting the heatmaps of only the modules in the tree: module.heatmap(c5Tree=trees$c5Trees[["15"]], pigengene=pigengene, saveDir="treeHeatmaps", pos=0, verbose=1)
Runs the entire Pigengene pipeline, from gene expression to compact decision trees in a single function. It identifies the gene modules using coexpression network analysis, computes eigengenes, learns a Bayesian network, fits decision trees, and compact them.
one.step.pigengene(Data, saveDir="Pigengene", Labels, testD=NULL, testLabels=NULL, doBalance=TRUE, RsquaredCut=0.8, costRatio=1, toCompact=FALSE, bnNum=0, bnArgs=NULL, useMod0=FALSE, mit="All", verbose=0, doHeat=TRUE, seed=NULL, dOrderByW=TRUE, naTolerance=0.05, doNetOnly=FALSE, doReturNetworks=doNetOnly, idType="ENTREZID", pathwayDb=NULL, OrgDb=org.Hs.eg.db)
one.step.pigengene(Data, saveDir="Pigengene", Labels, testD=NULL, testLabels=NULL, doBalance=TRUE, RsquaredCut=0.8, costRatio=1, toCompact=FALSE, bnNum=0, bnArgs=NULL, useMod0=FALSE, mit="All", verbose=0, doHeat=TRUE, seed=NULL, dOrderByW=TRUE, naTolerance=0.05, doNetOnly=FALSE, doReturNetworks=doNetOnly, idType="ENTREZID", pathwayDb=NULL, OrgDb=org.Hs.eg.db)
Data |
A matrix or data frame (or list of matrices or data frames) containing the training expression data, with genes corresponding to columns and rows corresponding to samples. Rows and columns must be named. For example, from RNA-Seq data, log(RPKM+1) can be used. |
Labels |
A (preferably named) vector containing the Labels (condition types)
for the training Data. Or, if Data is a list, a list of label vectors
corresponding to the data sets in Data.
Names must agree with rows of |
saveDir |
Directory to save the results. |
testD |
Test expression data with syntax similar to |
testLabels |
A (preferably named) vector containing the Labels (condition types) for
the test Data. This argument is optional and can be set to |
doBalance |
Boolean. Whether the data should be oversampled before identifying the modules so that each condition contribute roughly the same number of samples to clustering. |
RsquaredCut |
A threshold in the range [0,1] used to estimate the power. A higher value
can increase power. For technical use only. See |
costRatio |
A numeric value, the relative cost of misclassifying a sample from the first condition vs. misclassifying a sample from the second condition. |
toCompact |
An integer value determining which decision tree to shrink.
It is the minimum number of genes per leaf imposed when fitting the tree.
Set to |
bnNum |
Desired number of bootstraped Baysian networks.
Set to |
bnArgs |
A list of arguments passed to |
useMod0 |
Boolean, whether to allow module zero (the set of outliers) to be used as a predictor in the decision tree(s). |
mit |
The "module identification type", a character vector determining the reference conditions for clustering. If 'All' (default), clustering is performed using the entire data regardless of condition. |
verbose |
The integer level of verbosity. 0 means silent and higher values produce more details of computation. |
doHeat |
If |
seed |
Random seed to ensure reproducibility. |
dOrderByW |
If |
naTolerance |
Upper threshold on the fraction of entries per gene that
can be missing. Genes with a larger fraction of missing
entries are ignored. For genes with smaller fraction of NA
entries, the missing values are imputed from their average
expression in the other samples. See |
doNetOnly |
If |
doReturNetworks |
A boolean value to determine whether to return |
idType |
A string describing the type of input gene ID e.g., "ENTREZID", "REFSEQ", "SYMBOL". |
pathwayDb |
A character vector determining which enrichment database to be used by the
|
OrgDb |
The reference data base to be used. Use e.g. |
This is the main function of the package Pigengene and performs several steps: First, modules are identified in the training expression data, according to mit argument i.e. based on coexpression behaviour in the corresponding conditions. Set it to "All" to use all training data for this step regardless of the condition. If a list of data frames is provided in Data, similarity networks on the data sets are computed and combined into one similarity network for the union of nodes across data sets.
Then, the eigengenes for each module and each sample are calculated, where the expression of an eigengene of a module in a sample is the weighted average of the expression of the genes in that module in the sample. Technically, an eigengene is the first principal component of the gene expression in a module. PCA ensures that the maximum variance accross all the training samples is explained by the eigengene.
Next, (optionally –if bnNum is set to a value greater than 0), several bootstrapped Bayesian networks are learned and combined into a consensus network, in order to detect and illustrate the probabilistic dependencies between the eigengenes and the disease subtype.
Next, decisision tree(s) are built that use the module eigengenes, or
a subset of them, to distinguish the classes (Labels
).
The accurracy of trees is assessed on the train and (if provided) test data.
Finally, the number of required genes for the calculation of the relevant
eigengenes is reduced (the tree is 'compacted'). The accuracy of the tree
is reassessed after removal of each gene.
Along the way, several
self explanatory directories, heatmaps and plots are created and stored under
saveDir
. See make.decision.tree
for the effect of
test data in the process.
A list with the following components:
call |
The call that created the results. |
modules |
A named vector. Names are genes IDs and values are the corresponding module number. |
wgRes |
A list. The results of WGCNA clustering of the Data by
|
betaRes |
A list. The automatically selected beta (power) parameter
which was used for the WGCNA clustering. It is the result of the call to
|
pigengene |
The pigengene object computed for the clusters, result
of |
leanrtBn |
A list. The results of |
selectedFeatures |
A vector of the names of module eigengenes that
were considered during the construction of decision trees.
If |
c5treeRes |
A list. The results of |
The individual functions are exported to facilitated running the pipeline step-by-step in a customized way.
Amir Foroushani, Habil Zare, Rupesh Agrahari, and Meghan Short
Large-scale gene network analysis reveals the significance of extracellular matrix pathway and homeobox genes in acute myeloid leukemia, Foroushani A, Agrahari R, Docking R, Karsan A, and Zare H. In preparation.
check.pigengene.input
,
balance
,
calculate.beta
,
wgcna.one.step
,
compute.pigengene
,
project.eigen
,
learn.bn
, make.decision.tree
,
blockwiseModules
library(org.Hs.eg.db) data(aml) data(mds) d1 <- rbind(aml,mds) Labels <- c(rep("AML",nrow(aml)),rep("MDS",nrow(mds))) names(Labels) <- rownames(d1) p1 <- one.step.pigengene(Data=d1,saveDir=".", bnNum=10, verbose=1, seed=1, Labels=Labels, toCompact=FALSE, doHeat=FALSE) plot(p1$c5treeRes$c5Trees[["34"]])
library(org.Hs.eg.db) data(aml) data(mds) d1 <- rbind(aml,mds) Labels <- c(rep("AML",nrow(aml)),rep("MDS",nrow(mds))) names(Labels) <- rownames(d1) p1 <- one.step.pigengene(Data=d1,saveDir=".", bnNum=10, verbose=1, seed=1, Labels=Labels, toCompact=FALSE, doHeat=FALSE) plot(p1$c5treeRes$c5Trees[["34"]])
This function first performs hierarchical clustering on samples (rows of data) within each condition. Then, plots a heatmap without further clustering of rows.
pheatmap.type(Data, annRow, type = colnames(annRow)[1], doTranspose=FALSE, conditions="Auto",...)
pheatmap.type(Data, annRow, type = colnames(annRow)[1], doTranspose=FALSE, conditions="Auto",...)
Data |
A matrix with samples on rows and features (genes) on columns. |
annRow |
A data frame with 1 column or more. Row names must be the same as row names of Data. |
type |
The column of |
doTranspose |
If |
conditions |
A character vector that determines the conditions, and their order, to be included in the heatmap. By default ("Auto"), an alphabetical order of all available conditions in annRow will be used. |
... |
Additional arguments passed to |
A list of:
pheatmapS |
The results of pheatmap function for each condition |
pheat |
The output of final pheatmap function applied on all data |
ordering |
The ordering of the rows in the final heatmap |
annRowAll |
The row annotation used in the final heatmap |
If type
is not determined, by default the first column of
annRow
is used.
Habil Zare
data(eigengenes33) d1 <- eigengenes33$aml d2 <- eigengenes33$mds Disease <- c(rep("AML",nrow(d1)), rep("MDS",nrow(d2))) Disease <- as.data.frame(Disease) rownames(Disease) <- c(rownames(d1), rownames(d2)) p1 <- pheatmap.type(Data=rbind(d1,d2),annRow=Disease,show_rownames=FALSE)
data(eigengenes33) d1 <- eigengenes33$aml d2 <- eigengenes33$mds Disease <- c(rep("AML",nrow(d1)), rep("MDS",nrow(d2))) Disease <- as.data.frame(Disease) rownames(Disease) <- c(rownames(d1), rownames(d2)) p1 <- pheatmap.type(Data=rbind(d1,d2),annRow=Disease,show_rownames=FALSE)
Pigengene
This is a toy example object of class pigengene-class
.
It is used in examples of Pigengene-package
.
Gene expression profile of 202 acute myeloid leukemia (AML) cases
from Mills et al. study. The profile was compared with the profile of 164
myelodysplastic syndromes (MDS) cases and only the 1000 most differentially expressed
genes are included.
data("aml")
data("aml")
An object of pigengene-class
.
The object is made using compute.pigengene
function
from aml
and mds
data as shown
in the examples. The R CMD build --resave-data
trick was used
to reduce the size of saved object from 3.1 MB to 1.4 MB.
It is an object of pigengene-class
.
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15061
Mills, Ken I., et al. (2009). Microarray-based classifiers and prognosis models identify subgroups with distinct clinical outcomes and high risk of AML transformation of myelodysplastic syndrome. Blood 114.5: 1063-1072.
Pigengene-package
, pigengene-class
,
one.step.pigengene
,
mds
, aml
, compute.pigengene
,
project.eigen
library(pheatmap) data(pigengene) plot(pigengene,fontsize=12) ## To reproduce: data(aml) data(mds) data(eigengenes33) d1 <- rbind(aml,mds) Labels <- c(rep("AML",nrow(aml)),rep("MDS",nrow(mds))) names(Labels) <- rownames(d1) modules33 <- eigengenes33$modules[colnames(d1)] ## Computing: computed <- compute.pigengene(Data=d1, Labels=Labels, modules=modules33, saveFile="pigengene.RData", doPlot=FALSE, verbose=3) class(computed) plot(computed, fontsize=12, main="Reproduced")
library(pheatmap) data(pigengene) plot(pigengene,fontsize=12) ## To reproduce: data(aml) data(mds) data(eigengenes33) d1 <- rbind(aml,mds) Labels <- c(rep("AML",nrow(aml)),rep("MDS",nrow(mds))) names(Labels) <- rownames(d1) modules33 <- eigengenes33$modules[colnames(d1)] ## Computing: computed <- compute.pigengene(Data=d1, Labels=Labels, modules=modules33, saveFile="pigengene.RData", doPlot=FALSE, verbose=3) class(computed) plot(computed, fontsize=12, main="Reproduced")
A pigengene object holds the eigengenes, weights (memberships) and other related information.
A object of class pigengene
is the output of
compute.pigengene
function. It is a list.
If doMinimize=TRUE
, only the minimal elements needed to project
eigengenes in a new dataset are included
(i.e., see project.eigen(pigengene=NULL)[["projectionaries"]]
).
Otherwise, it contains at least the following components:
call
The call that created the results.
Reptimes
A named vector reporting the number of repeats for each condition
in the oversampling process, which is done by the balance
function.
eigenResults
A list including at least eigengenes
and
varExplained
. If doWgcna=TRUE
, then
this list will be the full output of
moduleEigengenes
function with some fixes, e.g.,
we change eigengenes
to a matrix,
and use genes as its row names. Also, varExplained
is named
according to modules. Setting doWgcna=TRUE
leads to more
memory usage and a larger Pigengene
object likely, with no
advantage.
Data
The data matrix of gene expression.
Labels
A character vector giving the condition (type) for each sample
(row of Data
).
eigengenes
The matrix of eigengenes ordered based on selectedModules
if provided. Rows correspond to samples.
membership
The matrix of weights of genes (rows) in all modules (columns).
orderedModules
The module assignment numeric vector named with genes and ordered
based on module number.
annotation
A data frame containing labeling information useful in plotting.
It has a column named "Condition". Rows have sample names.
saveFile
The file where the pigengene
object is saved.
weightsCsvFile
The file containing the weights in csv format. See dOrderByW=TRUE
.
weights
The weight matrix, which is also saved in csv format. It has
more columns than membership
but rows may be in a different
order if dOrderByW=TRUE
.
heavyToLow
If dOrderByW=TRUE
, this will be the ordering of genes
according to the modules the belong to, where the genes in each
module are ordered based on the absolute value of the weights in
that module. Also, the genes in the csv file are in this order.
For 2 or more groups (conditions), additional (optional) components include:
pvalues
A numeric matrix with columns "pValue", "FDR", and "Bonferroni".
Rows correspond to modules. The null hypothesis is that the eigengene is
expressed with the same distribution in all groups (conditions).
log.pvalues
A data frame with 1 column containing the logarithm of
Bonferroni-adjusted pvalues in base 10.
Pigengene-package
, plot.pigengene
,
wgcna.one.step
, compute.pigengene
,
learn.bn
, make.decision.tree
pigengene
objectPlots a couple of heatmaps of expression of the eigengenes, weights (memberships), and so on. Saves the plots in png format.
## S3 method for class 'pigengene' plot(x, saveDir = NULL, DiseaseColors="Auto", fontsize = 35, doShowColnames = TRUE, fontsizeCol = 25, doClusterCols = ncol(pigengene$eigengenes) > 1, verbose = 2, doShowRownames = "Auto", pngfactor = max(2, ncol(pigengene$eigengenes)/16), do0Mem = FALSE, ...)
## S3 method for class 'pigengene' plot(x, saveDir = NULL, DiseaseColors="Auto", fontsize = 35, doShowColnames = TRUE, fontsizeCol = 25, doClusterCols = ncol(pigengene$eigengenes) > 1, verbose = 2, doShowRownames = "Auto", pngfactor = max(2, ncol(pigengene$eigengenes)/16), do0Mem = FALSE, ...)
x |
The object from |
saveDir |
The dirctory for saving the plots |
DiseaseColors |
A vector of characters determining color for each disease. Names
should match the values in the first column of
|
fontsize |
Passd to |
doShowColnames |
Boolean |
fontsizeCol |
Numeric |
doClusterCols |
Boolean |
verbose |
The integer level of verbosity. 0 means silent and higher values produce more details of computation. |
doShowRownames |
Boolean |
pngfactor |
A numeric adjusting the size of the png files |
do0Mem |
If |
... |
Passd to |
Many of the arguments are passed to pheatmap
.
A list of:
heat |
The full output of |
heatNotRows |
The full output of |
Habil Zare ad Amir Foroushani
Large-scale gene network analysis reveals the significance of extracellular matrix pathway and homeobox genes in acute myeloid leukemia, Foroushani A, Agrahari R, Docking R, Karsan A, and Zare H. In preparation.
Pigengene-package
,
compute.pigengene
,
pheatmap.type
## Data: data(aml) data(mds) data(eigengenes33) d1 <- rbind(aml,mds) Labels <- c(rep("AML",nrow(aml)),rep("MDS",nrow(mds))) names(Labels) <- rownames(d1) Labels <- c(rep("AML",nrow(eigengenes33$aml)),rep("MDS",nrow(eigengenes33$mds))) names(Labels) <- rownames(d1) toyModules <- eigengenes33$modules[colnames(d1)] ## Computing: p1 <- compute.pigengene(Data=d1, Labels=Labels, modules=toyModules, saveFile="pigengene.RData", doPlot=TRUE, verbose=3) plot(p1,saveDir="plots")
## Data: data(aml) data(mds) data(eigengenes33) d1 <- rbind(aml,mds) Labels <- c(rep("AML",nrow(aml)),rep("MDS",nrow(mds))) names(Labels) <- rownames(d1) Labels <- c(rep("AML",nrow(eigengenes33$aml)),rep("MDS",nrow(eigengenes33$mds))) names(Labels) <- rownames(d1) toyModules <- eigengenes33$modules[colnames(d1)] ## Computing: p1 <- compute.pigengene(Data=d1, Labels=Labels, modules=toyModules, saveFile="pigengene.RData", doPlot=TRUE, verbose=3) plot(p1,saveDir="plots")
A decision tree in Pigengene uses module eigengenes to build a classifier that distincuishes two or more classes. Each eigengene is a weighted average of the expression of all genes in the module, where the weight of each gene corresponds to its membership in the module. Each modules might contain dozens to hundreds of genes, and hence the final classifier might depend on the expression of a large number of genes. In practice, it can be desireable to reduce the number of necessary genes used by a decision tree. This function is helpful in observing changes to the classification output after removing genes with lower weights membership. It determines how a given decision tree would classify the expression data after removing a certain number of genes from consideration.
preds.at(c5Tree, pigengene, pos=0, Data)
preds.at(c5Tree, pigengene, pos=0, Data)
c5Tree |
A decision tree that uses eigengenes from the pigengene object to classify the samples from the expression data. |
pigengene |
A object of |
pos |
Number of genes to be removed from the consideration. Genes are removed in ascending order of their absolute weight in the relevant modules. If 0 (default), the prediction will be done without compacting. |
Data |
The expression possibly new data used for classification |
A list with following components:
predictions |
The vector of predictions after neglecting |
eigengenes |
The values for the eigenges after neglecting |
Pigengene-package
, pigengene-class
,
make.decision.tree
, compact.tree
,
compute.pigengene
, module.heatmap
,
get.used.features
, get.fitted.leaf
,
Pigengene-package
## Data: data(aml) data(mds) data(pigengene) d1 <- rbind(aml,mds) ## Fiting the trees: trees <- make.decision.tree(pigengene=pigengene, Data=d1, saveDir="trees", minPerLeaf=15, doHeat=FALSE,verbose=3, toCompact=FALSE) preds1 <- preds.at(c5Tree=trees$c5Trees[["15"]], pigengene=pigengene, pos=0, Data=d1)
## Data: data(aml) data(mds) data(pigengene) d1 <- rbind(aml,mds) ## Fiting the trees: trees <- make.decision.tree(pigengene=pigengene, Data=d1, saveDir="trees", minPerLeaf=15, doHeat=FALSE,verbose=3, toCompact=FALSE) preds1 <- preds.at(c5Tree=trees$c5Trees[["15"]], pigengene=pigengene, pos=0, Data=d1)
This function projects (new) expression data onto the eigengenes of modules from another dataset. It is useful for comparing the expression behaviour of modules accross (biologically related yet independent) datasets, for evaluating the performance of a classifier on new datasets, and for examining the robustness of a pattern with regards to missing genes.
project.eigen(Data, saveFile = NULL, pigengene, naTolerance = 0.05, verbose = 0, ignoreModules = c())
project.eigen(Data, saveFile = NULL, pigengene, naTolerance = 0.05, verbose = 0, ignoreModules = c())
Data |
A matrix or data frame of expression data to be projected. Genes correspond to columns, and rows correspond to samples. Rows and columns must be named. It is OK to miss a few genes originally used to compute the eigengenes, thereby, projection is robust to choose of platform. |
saveFile |
If not |
pigengene |
An object of |
naTolerance |
Upper threshold on the fraction of entries per gene that
can be missing. Genes with a larger fraction of missing
entries are ignored. For genes with smaller fraction of NA
entries, the missing values are imputed from their average
expression in the other samples.
See |
verbose |
The integer level of verbosity. 0 means silent and higher values produce more details of computation. |
ignoreModules |
A vector of integers. In order to speed up the projection, it may be desirable to focus only on the eigengenes of a few interesting modules. In that case, the remaining modules can be listed here and will be ignored during projection (Optional). |
For each module, from the pigengene object, the weight
(membership
) of each gene is retrieved. The eigengene is
computed (inferred) on the new data as alinear combination using
the corresponding weights. The inferred eigengene vector will be
normalized so that it has the same Euclidean norm as the original
eigengene vector.
A list of:
projectionaries |
The character vector of names of minimal elements needed to be in
the |
projected |
The matrix of inferred (projected) eigengenes |
replacedNaNum |
The number of |
tooNaGenes |
A character vector of genes that were ignored because they
had too many |
notMatched |
A character vector of genes in the original eigengene that
could not be matched in the given input |
The new data should use the same type of biolocal identifiers (e.g. Gene Symbols or ENTREZIDs) as the original data for which the pigengene was constructed. It is, however, not required that the new data originate from the same type of technology, e.g. the eigengenes can be based on microarray experiments, whereas the new data comes from an RNA-Seq experiment. Nor is it necessary that the new datset contains measurements for all of the genes from the original modules.
Pigengene-package
, compute.pigengene
moduleEigengenes
## Data: data(aml) data(mds) data(eigengenes33) d1 <- rbind(aml,mds) Labels <- c(rep("AML",nrow(aml)),rep("MDS",nrow(mds))) names(Labels) <- rownames(d1) toyModules <- eigengenes33$modules[colnames(d1)] ## Computing: p1 <- compute.pigengene(Data=d1, Labels=Labels, modules=toyModules, saveFile="pigengene.RData", doPlot=TRUE, verbose=3, doMinimize=TRUE) ## How robust projecting is? p2 <- project.eigen(Data=d1, pigengene = p1, verbose = 1) plot(p1$eigengenes[,"ME1"],p2$projected[,"ME1"]) stats::cor(p1$eigengenes[,"ME1"],p2$projected[,"ME1"])
## Data: data(aml) data(mds) data(eigengenes33) d1 <- rbind(aml,mds) Labels <- c(rep("AML",nrow(aml)),rep("MDS",nrow(mds))) names(Labels) <- rownames(d1) toyModules <- eigengenes33$modules[colnames(d1)] ## Computing: p1 <- compute.pigengene(Data=d1, Labels=Labels, modules=toyModules, saveFile="pigengene.RData", doPlot=TRUE, verbose=3, doMinimize=TRUE) ## How robust projecting is? p2 <- project.eigen(Data=d1, pigengene = p1, verbose = 1) plot(p1$eigengenes[,"ME1"],p2$projected[,"ME1"]) stats::cor(p1$eigengenes[,"ME1"],p2$projected[,"ME1"])
Passes the arguments to manova
, which performs multi-class
analysis of variance.
pvalues.manova(Data, Labels)
pvalues.manova(Data, Labels)
Data |
A matrix or data frame containing the (expression) data, with genes corresponding to columns and rows corresponding to samples. Rows and columns must be named. |
Labels |
A (preferably named) vector containing the Labels (condition
types). Names must agree with rows of |
A list with following elements:
call |
The call that created the results |
pvals |
The matrix of pvalues with columns "pValue", "FDR", "Bonferroni".
Rows are named according to genes, the columns of |
manovaFit |
The full output of |
oneway.test
function is a better generalizatoion to
Welch's t-tst from 2-calsses to multi-class because it dose not assume that
the variaces are necessarly equal. However, in practice, with "enough number
of samples", the two approaches will lead to similar p-values.
Amir Foroushani
Krzanowski, W. J. (1988) _Principles of Multivariate Analysis. A User's Perspective._ Oxford.
Hand, D. J. and Taylor, C. C. (1987) _Multivariate Analysis of Variance and Repeated Measures._ Chapman and Hall.
B. L. Welch (1951), On the comparison of several mean values: an alternative approach.
oneway.test
, manova
,
compute.pigengene
data(eigengenes33) d1 <- rbind(eigengenes33$aml,eigengenes33$mds) Labels <- c(rep("AML",nrow(eigengenes33$aml)),rep("MDS",nrow(eigengenes33$mds))) names(Labels) <- rownames(d1) ps <- pvalues.manova(Data=d1, Labels=Labels) plot(log10(ps$pvals[,"Bonferroni"]))
data(eigengenes33) d1 <- rbind(eigengenes33$aml,eigengenes33$mds) Labels <- c(rep("AML",nrow(eigengenes33$aml)),rep("MDS",nrow(eigengenes33$mds))) names(Labels) <- rownames(d1) ps <- pvalues.manova(Data=d1, Labels=Labels) plot(log10(ps$pvals[,"Bonferroni"]))
Saves an R object, and reports the size of the saved object in memory and on file.
save.if(x1, file, compress=TRUE, verbose=1, ...)
save.if(x1, file, compress=TRUE, verbose=1, ...)
x1 |
The object to be saved. |
file |
Where to save. If |
compress |
A Boolean or character sent to the |
verbose |
A numeric determining how much detail will be printed. |
... |
Optional arguments to be passed to the |
A list including file, and a vector of sizes of the object in memory and on file.
Amir Foroushani, and Habil Zare
m1 <- matrix(0, nrow=1000, ncol=1000) save.if(m1, file="./m1.RData", verbose=3)
m1 <- matrix(0, nrow=1000, ncol=1000) save.if(m1, file="./m1.RData", verbose=3)
This function is a wrapper function for
WGCNA::blockwiseModules
and passes its arguments to it.
Some other arguments are fixed.
wgcna.one.step(Data, power, saveDir=".", blockSize = "All", saveTOMs = FALSE, doThreads=FALSE, verbose = 0, seed = NULL)
wgcna.one.step(Data, power, saveDir=".", blockSize = "All", saveTOMs = FALSE, doThreads=FALSE, verbose = 0, seed = NULL)
Data |
A matrix or data frame containing the expression data, with genes corresponding to columns and rows corresponding to samples. Rows and columns must be named. |
power |
Soft-thresholding power for network construction |
saveDir |
The directory to save the results and plots. |
blockSize |
The size of block when the data is too big. If not "All" (default) may introduce artifacts. |
saveTOMs |
Boolean determining if the TOM data should be saved, which can be hundreds of MBs and useful for identifying hubs. |
doThreads |
Boolean. Allows WGCNA to run a little faster using multi-threading but might not work on all systems. |
verbose |
The integer level of verbosity. 0 means silent and higher values produce more details of computation. |
seed |
Random seed to ensure reproducibility. |
Data, power, blockSize, saveTOMs, verbose,
and seed
are passed to WGCNA::blockwiseModules
.
A list with following components
call |
The command that created the results |
genes |
The names of |
modules |
A numeric vector, named by |
moduleColors |
A character vector, named by |
net |
The full output of |
netFile |
The file in which the net object is saved |
power |
An echo of the |
Langfelder P and Horvath S, WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 2008, 9:559
blockwiseModules
,
pickSoftThreshold
,
calculate.beta
data(aml) wgRes <- wgcna.one.step(Data=aml[,1:200], seed=1, power=7, saveDir="wgcna", verbose=1)
data(aml) wgRes <- wgcna.one.step(Data=aml[,1:200], seed=1, power=7, saveDir="wgcna", verbose=1)