This is the documentation for the R implementation of GENIE3.
The GENIE3 method is described in:
Huynh-Thu V. A., Irrthum A., Wehenkel L., and Geurts P. (2010) Inferring regulatory networks from expression data using tree-based methods. PLoS ONE, 5(9):e12776.
The GENIE3()
function takes as input argument a gene
expression matrix exprMatr
. Each row of that matrix must
correspond to a gene and each column must correspond to a sample. The
gene names must be specified in rownames(exprMatr)
. The
sample names can be specified in colnames(exprMatr)
, but
this is not mandatory. For example, the following command lines generate
a fake expression matrix (for the purpose of this tutorial only):
exprMatr <- matrix(sample(1:10, 100, replace=TRUE), nrow=20)
rownames(exprMatr) <- paste("Gene", 1:20, sep="")
colnames(exprMatr) <- paste("Sample", 1:5, sep="")
head(exprMatr)
## Sample1 Sample2 Sample3 Sample4 Sample5
## Gene1 1 1 6 10 8
## Gene2 9 2 8 10 1
## Gene3 9 4 7 3 10
## Gene4 10 1 1 9 10
## Gene5 6 7 7 8 10
## Gene6 3 6 9 4 7
This matrix contains the expression data of 20 genes from 5 samples. The expression data does not need to be normalised in any particular way (but whether it is normalized/filtered/log-transformed WILL affect the results!).
The following command runs GENIE3 on the expression data
exprMatr
with the default parameters:
## [1] 20 20
## Gene1 Gene2 Gene3 Gene4 Gene5
## Gene1 0.000000000 0.02255463 0.01900663 0.02493799 0.08540018
## Gene2 0.020832784 0.00000000 0.04984650 0.02183917 0.09280144
## Gene3 0.012879129 0.04303619 0.00000000 0.02802560 0.08923393
## Gene4 0.005964072 0.01367879 0.11452255 0.00000000 0.02039693
## Gene5 0.057769795 0.06057299 0.03057792 0.02656094 0.00000000
The algorithm outputs a matrix containing the weights of the putative
regulatory links, with higher weights corresponding to more likely
regulatory links. weightMat[i,j]
is the weight of the link
directed from the i-th gene to
j-th gene.
By default, all the genes in exprMatr
are used as
candidate regulators. The list of candidate regulators can however be
restricted to a subset of genes. This can be useful when you know which
genes are transcription factors.
# Genes that are used as candidate regulators
regulators <- c(2, 4, 7)
# Or alternatively:
regulators <- c("Gene2", "Gene4", "Gene7")
weightMat <- GENIE3(exprMatr, regulators=regulators)
Here, only Gene2
, Gene4
and
Gene7
(respectively corresponding to rows 2, 4 and 7 in
exprMatr
) were used as candidate regulators. In the
resulting weightMat
, the links that are directed from genes
that are not candidate regulators have a weight equal to 0.
To request different regulators for each gene & return as list:
GENIE3 is based on regression trees. These trees can be learned using
either the Random Forest method 1 or the Extra-Trees method 2. The tree-based method
can be specified using the tree.method
parameter
(tree.method="RF"
for Random Forests, which is the default
choice, or tree.method="ET"
for Extra-Trees).
Each tree-based method has two parameters: K
and
ntrees
. K
is the number of candidate
regulators that are randomly selected at each tree node for the best
split determination. Let p be
the number of candidate regulators. K
must be either:
"sqrt"
, which sets $K=\sqrt{p}$. This is the default value."all"
, which sets K = p.The parameter ntrees
specifies the number of trees that
are grown per ensemble. It can be set to any strictly positive integer
(the default value is 1000).
An example is shown below:
To decrease the computing times, GENIE3 can be run on multiple cores.
The parameter ncores
specifies the number of cores you want
to use. For example:
set.seed(123) # For reproducibility of results
weightMat <- GENIE3(exprMatr, nCores=4, verbose=TRUE)
Note that seet.seed
allows to get the same results
across different runs, but only within nCores==1
or
nCores>1
. e.g. A run with set.seed(123)
and
nCores=1
and another with the same seed but
nCores>1
may provide different results.
You can obtain the list of all the regulatory links (from most likely to least likely) with this command:
## [1] 57 3
## regulatoryGene targetGene weight
## 1 Gene7 Gene2 0.7991929
## 2 Gene4 Gene16 0.7072811
## 3 Gene4 Gene8 0.6334684
## 4 Gene2 Gene14 0.6256370
## 5 Gene7 Gene12 0.6228297
## 6 Gene4 Gene17 0.5996338
The resulting linkList
matrix contains the ranking of
links. Each row corresponds to a regulatory link. The first column shows
the regulator, the second column shows the target gene, and the last
column indicates the weight of the link.
(Note that the ranking that is obtained will be slightly different from one run to another. This is due to the intrinsic randomness of the Random Forest and Extra-Trees methods. The variance of the ranking can be decreased by increasing the number of trees per ensemble.)
Usually, one is only interested in extracting the most likely
regulatory links. The optional parameter report.max
sets
the number of top-ranked links to report:
Alternatively, a threshold can be set on the weights of the links:
The weights of the links returned by GENIE3()
do
not have any statistical meaning and only provide a way to rank
the regulatory links. There is therefore no standard threshold value,
and caution must be taken when choosing one.
?getLinkList