Title: | Synthesis of microarray-based classification |
---|---|
Description: | This package provides a comprehensive collection of various microarray-based classification algorithms both from Machine Learning and Statistics. Variable Selection, Hyperparameter tuning, Evaluation and Comparison can be performed combined or stepwise in a user-friendly environment. |
Authors: | Martin Slawski <[email protected]>, Anne-Laure Boulesteix <[email protected]>, Christoph Bernau <[email protected]>. |
Maintainer: | Roman Hornung <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.65.0 |
Built: | 2024-11-07 05:58:35 UTC |
Source: | https://github.com/bioc/CMA |
The aim of the package is to provide a user-friendly
environment for the evaluation of classification methods using
gene expression data. A strong focus is on combined variable selection,
hyperparameter tuning, evaluation, visualization and comparison of (up to now) 21
classification methods from three main fields: Discriminant Analysis,
Neural Networks and Machine Learning. Although the package has been
created with the intention to be used for Microarray data, it can as well
be used in various (p > n)
-scenarios.
Package: | CMA |
Type: | Package |
Version: | 1.3.3 |
Date: | 2009-9-14 |
License: | GPL (version 2 or later) |
Most Important Steps for the workflow are:
Generate evaluation datasets using GenerateLearningsets
(Optionally): Perform variable selection using GeneSelection
(Optionally): Peform hyperparameter tuning using tune
Perform classification
using 1.-3.
Repeat 2.-4. based on 1. for several methods:
compBoostCMA
, dldaCMA
, ElasticNetCMA
,
fdaCMA
, flexdaCMA
, gbmCMA
,
knnCMA
, ldaCMA
, LassoCMA
,
nnetCMA
, pknnCMA
, plrCMA
,
pls_ldaCMA
, pls_lrCMA
, pls_rfCMA
,
pnnCMA
, qdaCMA
, rfCMA
,
scdaCMA
, shrinkldaCMA
, svmCMA
Evaluate the results from 5. using evaluation
and make a comparison
by calling compare
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Christoph Bernau [email protected]
Maintainer: Christoph Bernau [email protected].
Slawski, M. Daumer, M. Boulesteix, A.-L. (2008) CMA - A comprehensive Bioconductor package for supervised classification with high dimensional data. BMC Bioinformatics 9: 439
This method can be seen as a visual pendant to
toplist
. The plot visualizes
variable importance by a barplot. The height
of the barplots correspond to variable importance.
What variable importance exactly means depends
on the method chosen when calling GeneSelection
,
s. genesel
.
x |
An object of class |
top |
Number of top genes whose variable importance should be displayed. Defaults to 10. |
iter |
Iteration number ( |
... |
Further graphical options passed to |
No return.
Note the following
If scheme = "multiclass"
, only one plot will be made.
Otherwise, one plot will be made for each binary scenario
(depending on whether "scheme"
is "one-vs-all"
or "pairwise"
).
Variable importance do not make sense for variable selection (ranking) methods that are essentially discrete, such as the Wilcoxon-Rank sum statistic or the Kruskal-Wallis statistic.
For the methods "lasso", "elasticnet", "boosting"
the number of nonzero coefficients can be very small, resulting
in bars of height zero if top
has been chosen too
large.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Slawski, M. Daumer, M. Boulesteix, A.-L. (2008) CMA - A comprehensive Bioconductor package for supervised classification with high dimensional data. BMC Bioinformatics 9: 439
genesel
, GeneSelection
, toplist
In this package hyperparameter tuning is performed by
an inner cross-validation step for each learningset
.
A grid of values is tried and evaluated in terms of the
misclassification rate, the results are saved in an object
of class tuningresult
. This method displays
(separately for each learningset
) the hyperparameter/
hyperparameter combination that showed the best results. Note
that this must not be unique; in this case, only one combination
is displayed.
best(object, ...)
best(object, ...)
object |
An object of class |
... |
Currently unused argument. |
A list with elements equal to the number of different learningsets. Each element contains the best hyperparameter combination and the corresponding misclassification rate.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
tune
This method displays the slot scores
of performance
scores of an object of class evaloutput
.
x |
An object of class |
... |
Further graphical parameters passed to the
classical |
The only return is a boxplot.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Slawski, M. Daumer, M. Boulesteix, A.-L. (2008) CMA - A comprehensive Bioconductor package for supervised classification with high dimensional data. BMC Bioinformatics 9: 439
Most general function in the package, providing an interface
to perform variable selection, hyperparameter tuning and
classification in one step. Alternatively, the first two
steps can be performed separately and can then be plugged into
this function.
For S4 method information, s. classification-methods
.
classification(X, y, f, learningsets, genesel, genesellist = list(), nbgene, classifier, tuneres, tuninglist = list(), trace = TRUE, models=FALSE,...)
classification(X, y, f, learningsets, genesel, genesellist = list(), nbgene, classifier, tuneres, tuninglist = list(), trace = TRUE, models=FALSE,...)
X |
Gene expression data. Can be one of the following:
|
y |
Class labels. Can be one of the following:
WARNING: The class labels will be re-coded to
range from |
f |
A two-sided formula, if |
learningsets |
An object of class |
genesel |
Optional (but usually recommended) object of class
|
genesellist |
In the case that the argument |
nbgene |
Number of best genes to be kept for classification, based
on either
|
classifier |
Name of function ending with |
tuneres |
Analogous to the argument |
tuninglist |
Analogous to the argument |
trace |
Should progress be traced ? Default is |
models |
a logical value indicating whether the model object shall be returned |
... |
Further arguments passed to the function |
For details about hyperparameter tuning, consult tune
.
A list of objects of class cloutput
and clvarseloutput
,
respectively; its length equals the number of different learningsets
.
The single elements of the list can convenienly be combined using
the join
function. The results can be analyzed and
evaluated by various measures using the method evaluation
.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Christoph Bernau [email protected]
Slawski, M. Daumer, M. Boulesteix, A.-L. (2008) CMA - A comprehensive Bioconductor package for supervised classification with high dimensional data. BMC Bioinformatics 9: 439
GeneSelection
, tune
, evaluation
,
compBoostCMA
, dldaCMA
, ElasticNetCMA
,
fdaCMA
, flexdaCMA
, gbmCMA
,
knnCMA
, ldaCMA
, LassoCMA
,
nnetCMA
, pknnCMA
, plrCMA
,
pls_ldaCMA
, pls_lrCMA
, pls_rfCMA
,
pnnCMA
, qdaCMA
, rfCMA
,
scdaCMA
, shrinkldaCMA
, svmCMA
### a simple k-nearest neighbour example ### datasets ## Not run: plot(x) data(golub) golubY <- golub[,1] golubX <- as.matrix(golub[,-1]) ### learningsets set.seed(111) lset <- GenerateLearningsets(y=golubY, method = "CV", fold=5, strat =TRUE) ### 1. GeneSelection selttest <- GeneSelection(golubX, golubY, learningsets = lset, method = "t.test") ### 2. tuning tunek <- tune(golubX, golubY, learningsets = lset, genesel = selttest, nbgene = 20, classifier = knnCMA) ### 3. classification knn1 <- classification(golubX, golubY, learningsets = lset, genesel = selttest, tuneres = tunek, nbgene = 20, classifier = knnCMA) ### steps 1.-3. combined into one step: knn2 <- classification(golubX, golubY, learningsets = lset, genesellist = list(method = "t.test"), classifier = knnCMA, tuninglist = list(grids = list(k = c(1:8))), nbgene = 20) ### show and analyze results: knnjoin <- join(knn2) show(knn2) eval <- evaluation(knn2, measure = "misclassification") show(eval) summary(eval) boxplot(eval) ## End(Not run)
### a simple k-nearest neighbour example ### datasets ## Not run: plot(x) data(golub) golubY <- golub[,1] golubX <- as.matrix(golub[,-1]) ### learningsets set.seed(111) lset <- GenerateLearningsets(y=golubY, method = "CV", fold=5, strat =TRUE) ### 1. GeneSelection selttest <- GeneSelection(golubX, golubY, learningsets = lset, method = "t.test") ### 2. tuning tunek <- tune(golubX, golubY, learningsets = lset, genesel = selttest, nbgene = 20, classifier = knnCMA) ### 3. classification knn1 <- classification(golubX, golubY, learningsets = lset, genesel = selttest, tuneres = tunek, nbgene = 20, classifier = knnCMA) ### steps 1.-3. combined into one step: knn2 <- classification(golubX, golubY, learningsets = lset, genesellist = list(method = "t.test"), classifier = knnCMA, tuninglist = list(grids = list(k = c(1:8))), nbgene = 20) ### show and analyze results: knnjoin <- join(knn2) show(knn2) eval <- evaluation(knn2, measure = "misclassification") show(eval) summary(eval) boxplot(eval) ## End(Not run)
Perform classification for the following signatures:
signature 1
signature 2
signature 3
signature 4
For further argument and output information, consult
classification
.
Object returned by one of the classifiers (functions ending with CMA
)
learnind
:Vector of indices that indicates which observations where used in the learning set.
y
:Actual (true) class labels of predicted observations.
yhat
:Predicted class labels by the classifier.
prob
:A numeric
matrix
whose rows
equals the number of predicted observations (length of y
/yhat
)
and whose columns equal the number of different classes in the learning set.
Rows add up to one.
Entry j,k
of this matrix contains the probability for the j
-th
predicted observation to belong to class k
.
Can be a matrix of NA
s, if the classifier used does not
provide any probabilities
method
:Name of the classifer used.
mode
:character
, one of "binary"
(if the number of classes in the learning set is two)
or multiclass
(if it is more than two).
model
:List containing the constructed classifiers.
Use show(cloutput-object)
for brief information
Use ftable(cloutput-object)
to obtain a confusion matrix/cross-tabulation
of y
vs. yhat
, s. ftable,cloutput-method
.
Use plot(cloutput-object)
to generate a probability plot of the matrix
prob
described above, s. plot,cloutput-method
Use roc(cloutput-object)
to compute the empirical ROC curve and the
Area Under the Curve (AUC) based on the predicted probabilities, s.roc,cloutput-method
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
clvarseloutput
compBoostCMA
, dldaCMA
, ElasticNetCMA
,
fdaCMA
, flexdaCMA
, gbmCMA
,
knnCMA
, ldaCMA
, LassoCMA
,
nnetCMA
, pknnCMA
, plrCMA
,
pls_ldaCMA
, pls_lrCMA
, pls_rfCMA
,
pnnCMA
, qdaCMA
, rfCMA
,
scdaCMA
, shrinkldaCMA
, svmCMA
Object returned by all classifiers that can peform variable selection or compute variable importance. These are:
Random Forest, s. rfCMA
,
Componentwise Boosting, s. compBoostCMA
,
LASSO-logistic regression, s. LassoCMA
,
ElasticNet-logistic regression,
s. ElasticNetCMA
.
Objects of class clvarseloutput
extend both the class
cloutuput
and varsel
, s. below.
learnind
:Vector of indices that indicates which observations where used in the learning set.
y
:Actual (true) class labels of predicted observations.
yhat
:Predicted class labels by the classifier.
prob
:A numeric
matrix
whose rows
equals the number of predicted observations (length of y
/yhat
)
and whose columns equal the number of different classes in the learning set.
Rows add up to one.
Entry j,k
of this matrix contains the probability for the j
-th
predicted observation to belong to class k
.
Can be a matrix of NA
s, if the classifier used does not
provide any probabilities
method
:Name of the classifer used.
mode
:character
, one of "binary"
(if the number of classes in the learning set is two)
or multiclass
(if it is more than two).
varsel
:numeric
vector of variable importance measures (for Random Forest) or
absolute values of regression coefficients (for the other three methods mentionned above)
(from which the majority will be zero).
Class "cloutput"
, directly.
Class "varseloutput"
, directly.
Use show(cloutput-object)
for brief information
Use ftable(cloutput-object)
to obtain a confusion matrix/cross-tabulation
of y
vs. yhat
, s. ftable,cloutput-method
.
Use plot(cloutput-object)
to generate a probability plot of the matrix
prob
described above, s. plot,cloutput-method
Use roc(cloutput-object)
to compute the empirical ROC curve and the
Area Under the Curve (AUC) based on the predicted probabilities, s.roc,cloutput-method
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
rfCMA
, compBoostCMA
, LassoCMA
, ElasticNetCMA
Classifiers can be evaluated separately using the method
evaluation
. Normally, several classifiers
are used for the same dataset and their performance is
compared. This comparison procedure is essentially facilitated by
this method.
For S4 method information, s. compare-methods
compare(clresultlist, measure = c("misclassification", "sensitivity", "specificity", "average probability", "brier score", "auc"), aggfun = meanrm, plot = FALSE, ...)
compare(clresultlist, measure = c("misclassification", "sensitivity", "specificity", "average probability", "brier score", "auc"), aggfun = meanrm, plot = FALSE, ...)
clresultlist |
A list of lists (!) of objects of class |
measure |
A character vector containing one or more of the elements listed below.
By default, all measures are computed, using
|
aggfun |
Function that determines how performance among different iterations are aggregared.
Default is |
plot |
Should the performance of different classifiers be visualized by a joint boxplot ?
Default is |
... |
Further arguments passed to |
A data.frame
with rows corresponding to the compared classifiers
and columns to the performance measures, aggregated by aggfun
, s. above.
If more than one measure is computed and plot = TRUE
, one separate
plot is created for each of them.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Christoph Bernau [email protected]
Dudoit, S., Fridlyand, J., Speed, T. P. (2002)
Comparison of discrimination methods for the classification of tumors
using gene expression data.
Journal of the American Statistical Association 97, 77-87
Slawski, M. Daumer, M. Boulesteix, A.-L. (2008) CMA - A comprehensive Bioconductor package for supervised classification with high dimensional data. BMC Bioinformatics 9: 439
## Not run: ### compare the performance of several discriminant analysis methods ### for the Khan dataset: data(khan) khanX <- as.matrix(khan[,-1]) khanY <- khan[,1] set.seed(27611) fiveCV10iter <- GenerateLearningsets(y=khanY, method = "CV", fold = 5, niter = 2, strat = TRUE) ### candidate methods: DLDA, LDA, QDA, pls_LDA, sclda class_dlda <- classification(X = khanX, y=khanY, learningsets = fiveCV10iter, classifier = dldaCMA) ### peform GeneSlection for LDA, FDA, QDA (using F-Tests): genesel_da <- GeneSelection(X=khanX, y=khanY, learningsets = fiveCV10iter, method = "f.test") ### class_lda <- classification(X = khanX, y=khanY, learningsets = fiveCV10iter, classifier = ldaCMA, genesel= genesel_da, nbgene = 10) class_qda <- classification(X = khanX, y=khanY, learningsets = fiveCV10iter, classifier = qdaCMA, genesel = genesel_da, nbgene = 2) ### We now make a comparison concerning the performance (sev. measures): ### first, collect in a list: dalike <- list(class_dlda, class_lda, class_qda) ### use pre-defined compare function: comparison <- compare(dalike, plot = TRUE, measure = c("misclassification", "brier score", "average probability")) print(comparison) ## End(Not run)
## Not run: ### compare the performance of several discriminant analysis methods ### for the Khan dataset: data(khan) khanX <- as.matrix(khan[,-1]) khanY <- khan[,1] set.seed(27611) fiveCV10iter <- GenerateLearningsets(y=khanY, method = "CV", fold = 5, niter = 2, strat = TRUE) ### candidate methods: DLDA, LDA, QDA, pls_LDA, sclda class_dlda <- classification(X = khanX, y=khanY, learningsets = fiveCV10iter, classifier = dldaCMA) ### peform GeneSlection for LDA, FDA, QDA (using F-Tests): genesel_da <- GeneSelection(X=khanX, y=khanY, learningsets = fiveCV10iter, method = "f.test") ### class_lda <- classification(X = khanX, y=khanY, learningsets = fiveCV10iter, classifier = ldaCMA, genesel= genesel_da, nbgene = 10) class_qda <- classification(X = khanX, y=khanY, learningsets = fiveCV10iter, classifier = qdaCMA, genesel = genesel_da, nbgene = 2) ### We now make a comparison concerning the performance (sev. measures): ### first, collect in a list: dalike <- list(class_dlda, class_lda, class_qda) ### use pre-defined compare function: comparison <- compare(dalike, plot = TRUE, measure = c("misclassification", "brier score", "average probability")) print(comparison) ## End(Not run)
Compare different classifiers for the following signatures:
signature 1
For further argument and output information, consult
compare
Roughly speaking, Boosting combines 'weak learners' in a weighted manner in a stronger ensemble.
'Weak learners' here consist of linear functions in one component (variable), as proposed by Buehlmann and Yu (2003).
It also generates sparsity and can as well be as
used for variable selection alone. (s. GeneSelection
).
For S4
method information, see compBoostCMA-methods.
compBoostCMA(X, y, f, learnind, loss = c("binomial", "exp", "quadratic"), mstop = 100, nu = 0.1, models=FALSE, ...)
compBoostCMA(X, y, f, learnind, loss = c("binomial", "exp", "quadratic"), mstop = 100, nu = 0.1, models=FALSE, ...)
X |
Gene expression data. Can be one of the following:
|
y |
Class labels. Can be one of the following:
WARNING: The class labels will be re-coded to
range from |
f |
A two-sided formula, if |
learnind |
An index vector specifying the observations that
belong to the learning set. May be |
loss |
Character specifying the loss function - one of |
mstop |
Number of boosting iterations, i.e. number of updates
to perform. The default (100) does not necessarily produce
good results, therefore usage of |
nu |
Shrinkage factor applied to the update steps, defaults to 0.1.
In most cases, it suffices to set |
models |
a logical value indicating whether the model object shall be returned |
... |
Currently unused arguments. |
The method is partly based on code from the package mboost
from T. Hothorn and P. Buehlmann.
The algorithm for the multiclass case is described in Lutz and Buehlmann (2006) as 'rowwise updating'.
An object of class clvarseloutput
.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Buelmann, P., Yu, B. (2003).
Boosting with the L2 loss: Regression and Classification.
Journal of the American Statistical Association, 98, 324-339
Buehlmann, P., Hothorn, T.
Boosting: A statistical perspective.
Statistical Science (to appear)
Lutz, R., Buehlmann, P. (2006).
Boosting for high-multivariate responses in high-dimensional linear regression.
Statistica Sinica 16, 471-494.
dldaCMA
, ElasticNetCMA
,
fdaCMA
, flexdaCMA
, gbmCMA
,
knnCMA
, ldaCMA
, LassoCMA
,
nnetCMA
, pknnCMA
, plrCMA
,
pls_ldaCMA
, pls_lrCMA
, pls_rfCMA
,
pnnCMA
, qdaCMA
, rfCMA
,
scdaCMA
, shrinkldaCMA
, svmCMA
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression golubX <- as.matrix(golub[,-1]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run componentwise (logit)-boosting (not tuned) result <- compBoostCMA(X=golubX, y=golubY, learnind=learnind, mstop = 500) ### show results show(result) ftable(result) plot(result) ### multiclass example: ### load Khan data data(khan) ### extract class labels khanY <- khan[,1] ### extract gene expression khanX <- as.matrix(khan[,-1]) ### select learningset set.seed(111) learnind <- sample(length(khanY), size=floor(ratio*length(khanY))) ### run componentwise multivariate (logit)-boosting (not tuned) result <- compBoostCMA(X=khanX, y=khanY, learnind=learnind, mstop = 1000) ### show results show(result) ftable(result) plot(result)
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression golubX <- as.matrix(golub[,-1]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run componentwise (logit)-boosting (not tuned) result <- compBoostCMA(X=golubX, y=golubY, learnind=learnind, mstop = 500) ### show results show(result) ftable(result) plot(result) ### multiclass example: ### load Khan data data(khan) ### extract class labels khanY <- khan[,1] ### extract gene expression khanX <- as.matrix(khan[,-1]) ### select learningset set.seed(111) learnind <- sample(length(khanY), size=floor(ratio*length(khanY))) ### run componentwise multivariate (logit)-boosting (not tuned) result <- compBoostCMA(X=khanX, y=khanY, learnind=learnind, mstop = 1000) ### show results show(result) ftable(result) plot(result)
Roughly speaking, Boosting combines 'weak learners' in a weighted manner in a stronger ensemble.
'Weak learners' here consist of linear functions in one component (variable), as proposed by Buehlmann and Yu (2003).
It also generates sparsity and can as well be as
used for variable selection alone. (s. GeneSelection
.)
signature 1
signature 2
signature 3
signature 4
For further argument and output information, consult
compBoostCMA.
Performs a diagonal discriminant analysis under the assumption of a multivariate normal distribution in each classes (with equal, diagonally structured) covariance matrices. The method is also known under the name 'naive Bayes' classifier.
For S4
method information, see dldaCMA-methods.
dldaCMA(X, y, f, learnind, models=FALSE, ...)
dldaCMA(X, y, f, learnind, models=FALSE, ...)
X |
Gene expression data. Can be one of the following:
|
y |
Class labels. Can be one of the following:
WARNING: The class labels will be re-coded to
range from |
f |
A two-sided formula, if |
learnind |
An index vector specifying the observations that
belong to the learning set. May be |
models |
a logical value indicating whether the model object shall be returned |
... |
Currently unused argument. |
An object of class cloutput
.
As opposed to linear or quadratic discriminant analysis, variable selection is not strictly necessary.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
McLachlan, G.J. (1992).
Discriminant Analysis and Statistical Pattern Recognition.
Wiley, New York
compBoostCMA
, ElasticNetCMA
,
fdaCMA
, flexdaCMA
, gbmCMA
,
knnCMA
, ldaCMA
, LassoCMA
,
nnetCMA
, pknnCMA
, plrCMA
,
pls_ldaCMA
, pls_lrCMA
, pls_rfCMA
,
pnnCMA
, qdaCMA
, rfCMA
,
scdaCMA
, shrinkldaCMA
, svmCMA
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression golubX <- as.matrix(golub[,-1]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run DLDA dldaresult <- dldaCMA(X=golubX, y=golubY, learnind=learnind) ### show results show(dldaresult) ftable(dldaresult) plot(dldaresult) ### multiclass example: ### load Khan data data(khan) ### extract class labels khanY <- khan[,1] ### extract gene expression khanX <- as.matrix(khan[,-1]) ### select learningset set.seed(111) learnind <- sample(length(khanY), size=floor(ratio*length(khanY))) ### run LDA ldaresult <- dldaCMA(X=khanX, y=khanY, learnind=learnind) ### show results show(dldaresult) ftable(dldaresult) plot(dldaresult)
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression golubX <- as.matrix(golub[,-1]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run DLDA dldaresult <- dldaCMA(X=golubX, y=golubY, learnind=learnind) ### show results show(dldaresult) ftable(dldaresult) plot(dldaresult) ### multiclass example: ### load Khan data data(khan) ### extract class labels khanY <- khan[,1] ### extract gene expression khanX <- as.matrix(khan[,-1]) ### select learningset set.seed(111) learnind <- sample(length(khanY), size=floor(ratio*length(khanY))) ### run LDA ldaresult <- dldaCMA(X=khanX, y=khanY, learnind=learnind) ### show results show(dldaresult) ftable(dldaresult) plot(dldaresult)
Performs a diagonal discriminant analysis under the assumption of a multivariate normal distribution in each classes (with equal, diagonally structured) covariance matrices. The method is also known under the name 'naive Bayes' classifier.
signature 1
signature 2
signature 3
signature 4
For further argument and output information, consult
dldaCMA
.
Zou and Hastie (2004) proposed a combined L1/L2 penalty
for regularization and variable selection.
The Elastic Net penalty encourages a grouping
effect, where strongly correlated predictors tend to be in or out of the model together.
The computation is done with the function glmpath
from the package
of the same name.
The method can be used for variable selection alone, s. GeneSelection
.
For S4
method information, see ElasticNetCMA-methods
.
ElasticNetCMA(X, y, f, learnind, norm.fraction = 0.1, alpha=0.5, models=FALSE, ...)
ElasticNetCMA(X, y, f, learnind, norm.fraction = 0.1, alpha=0.5, models=FALSE, ...)
X |
Gene expression data. Can be one of the following:
|
y |
Class labels. Can be one of the following:
WARNING: The class labels will be re-coded to
range from |
f |
A two-sided formula, if |
learnind |
An index vector specifying the observations that
belong to the learning set. May be |
norm.fraction |
L1 Shrinkage intensity, expressed as the fraction
of the coefficient L1 norm compared to the
maximum possible L1 norm (corresponds to |
alpha |
The elasticnet mixing parameter, with 0<alpha<= 1. The penalty is defined as (1-alpha)/2||beta||_2^2+alpha||beta||_1.
|
models |
a logical value indicating whether the model object shall be returned |
... |
Further arguments passed to the function |
An object of class clvarseloutput
.
For a strongly related method, s. LassoCMA
.
Up to now, this method can only be applied to binary classification.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Christoph Bernau [email protected]
Zhou, H., Hastie, T. (2004).
Regularization and variable selection via the elastic net.
Journal of the Royal Statistical Society B, 67(2),301-320
Young-Park, M., Hastie, T. (2007)
L1-regularization path algorithm for generalized linear models.
Journal of the Royal Statistical Society B, 69(4), 659-677
compBoostCMA
, dldaCMA
,
fdaCMA
, flexdaCMA
, gbmCMA
,
knnCMA
, ldaCMA
, LassoCMA
,
nnetCMA
, pknnCMA
, plrCMA
,
pls_ldaCMA
, pls_lrCMA
, pls_rfCMA
,
pnnCMA
, qdaCMA
, rfCMA
,
scdaCMA
, shrinkldaCMA
, svmCMA
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression golubX <- as.matrix(golub[,-1]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run ElasticNet - penalized logistic regression (no tuning) result <- ElasticNetCMA(X=golubX, y=golubY, learnind=learnind, norm.fraction = 0.2, alpha=0.5) show(result) ftable(result) plot(result)
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression golubX <- as.matrix(golub[,-1]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run ElasticNet - penalized logistic regression (no tuning) result <- ElasticNetCMA(X=golubX, y=golubY, learnind=learnind, norm.fraction = 0.2, alpha=0.5) show(result) ftable(result) plot(result)
Zou and Hastie (2004) proposed a combined L1/L2 penalty
for regularization and variable selection.
The Elastic Net penalty encourages a grouping
effect, where strongly correlated predictors tend to be in or out of the model together.
The computation is done with the function glmpath
from the package
of the same name.
signature 1
signature 2
signature 3
signature 4
For references, further argument and output information, consult
ElasticNetCMA
Object returned by the method evaluation
.
score
:A numeric vector of performance scores whose length depends
on "scheme"
, s.below. It equals the number of
iterations (number of different datasets) if
"scheme = iterationwise"
and the number
of all observations in the complete dataset otherwise.
As not necessarily all observation must be predicted
at least one time, score
can also contain NA
s
for those observations not classified at all.
measure
:performance measure used, s. evaluation
.
scheme
:scheme used, s. evaluation
method
:name of the classifier that has been evaluated.
Use show(evaloutput-object)
for brief information.
Use summary(evaloutput-object)
to apply the
classic summary()
function to the slot score
,
s. summary,evaloutput-method
Use boxplot(evaloutput-object)
to display
a boxplot of the slot score
, s. boxplot,evaloutput-method
.
Use obsinfo(evaloutput-object, threshold)
to display all
observations consistenly correctly or incorrectly classified
(depending on the value of the argument threshold
), s.
obsinfo
.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
The performance of classifiers can be evaluted by six different
measures and two different schemes that are described more precisely
below.
For S4 method information, s. evaluation-methods
.
evaluation(clresult, cltrain = NULL, cost = NULL, y = NULL, measure = c("misclassification", "sensitivity", "specificity", "average probability", "brier score", "auc", "0.632", "0.632+"), scheme = c("iterationwise", "observationwise", "classwise"))
evaluation(clresult, cltrain = NULL, cost = NULL, y = NULL, measure = c("misclassification", "sensitivity", "specificity", "average probability", "brier score", "auc", "0.632", "0.632+"), scheme = c("iterationwise", "observationwise", "classwise"))
clresult |
A list of objects of class |
cltrain |
An object of class |
cost |
An optional cost matrix used if |
y |
A vector containing the true class labels. Only needed if |
measure |
Peformance measure to be used:
|
scheme |
|
An object of class evaloutput
.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Christoph Bernau [email protected]
Efron, B. and Tibshirani, R. (1997).
Improvements on cross-validation: The .632+ bootstrap method.
Journal of the American Statistical Association, 92, 548-560.
Slawski, M. Daumer, M. Boulesteix, A.-L. (2008) CMA - A comprehensive Bioconductor package for supervised classification with high dimensional data. BMC Bioinformatics 9: 439
evaloutput
, classification
, compare
### simple linear discriminant analysis example using bootstrap datasets: ### datasets: data(golub) golubY <- golub[,1] ### extract gene expression from first 10 genes golubX <- as.matrix(golub[,2:11]) ### generate 25 bootstrap datasets set.seed(333) bootds <- GenerateLearningsets(y = golubY, method = "bootstrap", ntrain = 30, niter = 10, strat = TRUE) ### run classification() ldalist <- classification(X=golubX, y=golubY, learningsets = bootds, classifier=ldaCMA) ### Evaluation: eval_iter <- evaluation(ldalist, scheme = "iter") eval_obs <- evaluation(ldalist, scheme = "obs") show(eval_iter) show(eval_obs) summary(eval_iter) summary(eval_obs) ### auc with boxplot eval_auc <- evaluation(ldalist, scheme = "iter", measure = "auc") boxplot(eval_auc) ### which observations have often been misclassified ? obsinfo(eval_obs, threshold = 0.75)
### simple linear discriminant analysis example using bootstrap datasets: ### datasets: data(golub) golubY <- golub[,1] ### extract gene expression from first 10 genes golubX <- as.matrix(golub[,2:11]) ### generate 25 bootstrap datasets set.seed(333) bootds <- GenerateLearningsets(y = golubY, method = "bootstrap", ntrain = 30, niter = 10, strat = TRUE) ### run classification() ldalist <- classification(X=golubX, y=golubY, learningsets = bootds, classifier=ldaCMA) ### Evaluation: eval_iter <- evaluation(ldalist, scheme = "iter") eval_obs <- evaluation(ldalist, scheme = "obs") show(eval_iter) show(eval_obs) summary(eval_iter) summary(eval_obs) ### auc with boxplot eval_auc <- evaluation(ldalist, scheme = "iter", measure = "auc") boxplot(eval_auc) ### which observations have often been misclassified ? obsinfo(eval_obs, threshold = 0.75)
Evaluate classifiers for the following signatures:
signature 1
For further argument and output information, consult
evaluation
.
Fisher's Linear Discriminant Analysis constructs a subspace of
'optimal projections' in which classification is performed.
The directions of optimal projections are computed by the
function cancor
from the package stats
. For
an exhaustive treatment, see e.g. Ripley (1996).
For S4
method information, see fdaCMA-methods.
fdaCMA(X, y, f, learnind, comp = 1, plot = FALSE,models=FALSE)
fdaCMA(X, y, f, learnind, comp = 1, plot = FALSE,models=FALSE)
X |
Gene expression data. Can be one of the following:
|
y |
Class labels. Can be one of the following:
|
f |
A two-sided formula, if |
learnind |
An index vector specifying the observations that
belong to the learning set. May be |
comp |
Number of discriminant coordinates (projections) to compute.
Default is one, must be smaller than or equal to |
plot |
Should the projections onto the space spanned by the optimal
projection directions be plotted ? Default is |
models |
a logical value indicating whether the model object shall be returned |
An object of class cloutput
.
Excessive variable selection has usually to performed before
fdaCMA
can be applied in the p > n
setting.
Not reducing the number of variables can result in an error
message.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Ripley, B.D. (1996)
Pattern Recognition and Neural Networks.
Cambridge University Press
compBoostCMA
, dldaCMA
, ElasticNetCMA
,
fdaCMA
, flexdaCMA
, gbmCMA
,
knnCMA
, ldaCMA
, LassoCMA
,
nnetCMA
, pknnCMA
, plrCMA
,
pls_ldaCMA
, pls_lrCMA
, pls_rfCMA
,
pnnCMA
, qdaCMA
, rfCMA
,
scdaCMA
, shrinkldaCMA
, svmCMA
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression from first 10 genes golubX <- as.matrix(golub[,2:11]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run FDA fdaresult <- fdaCMA(X=golubX, y=golubY, learnind=learnind, comp = 1, plot = TRUE) ### show results show(fdaresult) ftable(fdaresult) plot(fdaresult) ### multiclass example: ### load Khan data data(khan) ### extract class labels khanY <- khan[,1] ### extract gene expression from first 10 genes khanX <- as.matrix(khan[,2:11]) ### select learningset set.seed(111) learnind <- sample(length(khanY), size=floor(ratio*length(khanY))) ### run FDA fdaresult <- fdaCMA(X=khanX, y=khanY, learnind=learnind, comp = 2, plot = TRUE) ### show results show(fdaresult) ftable(fdaresult) plot(fdaresult)
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression from first 10 genes golubX <- as.matrix(golub[,2:11]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run FDA fdaresult <- fdaCMA(X=golubX, y=golubY, learnind=learnind, comp = 1, plot = TRUE) ### show results show(fdaresult) ftable(fdaresult) plot(fdaresult) ### multiclass example: ### load Khan data data(khan) ### extract class labels khanY <- khan[,1] ### extract gene expression from first 10 genes khanX <- as.matrix(khan[,2:11]) ### select learningset set.seed(111) learnind <- sample(length(khanY), size=floor(ratio*length(khanY))) ### run FDA fdaresult <- fdaCMA(X=khanX, y=khanY, learnind=learnind, comp = 2, plot = TRUE) ### show results show(fdaresult) ftable(fdaresult) plot(fdaresult)
Fisher's Linear Discriminant Analysis constructs a subspace of
'optimal projections' in which classification is performed.
The directions of optimal projections are computed by the
function cancor
from the package stats
. For
an exhaustive treatment, see e.g. Ripley (1996).
signature 1
signature 2
signature 3
signature 4
For references, further argument and output information, consult
fdaCMA
.
The functions listed above are usually not called by the
user but via GeneSelection
.
ttest(X, y, learnind, ...) welchtest(X, y, learnind, ...) ftest(X, y, learnind,...) kruskaltest(X, y, learnind,...) limmatest(X, y, learnind,...) golubcrit(X, y, learnind,...) rfe(X, y, learnind,...) shrinkcat(X,y,learnind,...)
ttest(X, y, learnind, ...) welchtest(X, y, learnind, ...) ftest(X, y, learnind,...) kruskaltest(X, y, learnind,...) limmatest(X, y, learnind,...) golubcrit(X, y, learnind,...) rfe(X, y, learnind,...) shrinkcat(X,y,learnind,...)
X |
A |
y |
A |
learnind |
An index vector specifying the observations that belong to the learning set. |
... |
Currently unused argument. |
An object of class varseloutput
.
Slawski, M. Daumer, M. Boulesteix, A.-L. (2008) CMA - A comprehensive Bioconductor package for supervised classification with high dimensional data. BMC Bioinformatics 9: 439
This method is experimental.
It is easy to show that, after appropriate scaling of the predictor matrix X
,
Fisher's Linear Discriminant Analysis is equivalent to Discriminant Analysis
in the space of the fitted values from the linear regression of the
nlearn x K
indicator matrix of the class labels on X
.
This gives rise to 'nonlinear discrimant analysis' methods that expand
X
in a suitable, more flexible basis. In order to avoid overfitting,
penalization is used. In the implemented version, the linear model is replaced
by a generalized additive one, using the package mgcv
.
For S4
method information, s. flexdaCMA-methods
.
flexdaCMA(X, y, f, learnind, comp = 1, plot = FALSE, models=FALSE, ...)
flexdaCMA(X, y, f, learnind, comp = 1, plot = FALSE, models=FALSE, ...)
X |
Gene expression data. Can be one of the following:
|
y |
Class labels. Can be one of the following:
WARNING: The class labels will be re-coded to
range from |
f |
A two-sided formula, if |
learnind |
An index vector specifying the observations that
belong to the learning set. May be |
comp |
Number of discriminant coordinates (projections) to compute.
Default is one, must be smaller than or equal to |
plot |
Should the projections onto the space spanned by the optimal
projection directions be plotted ? Default is |
models |
a logical value indicating whether the model object shall be returned |
... |
Further arguments passed to the function |
An object of class cloutput
.
Excessive variable selection has usually to performed before
flexdaCMA
can be applied in the p > n
setting.
Recall that the original predictor dimension is even enlarged,
therefore, it should be applied only with very few variables.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Ripley, B.D. (1996)
Pattern Recognition and Neural Networks.
Cambridge University Press
compBoostCMA
, dldaCMA
, ElasticNetCMA
,
fdaCMA
, gbmCMA
,
knnCMA
, ldaCMA
, LassoCMA
,
nnetCMA
, pknnCMA
, plrCMA
,
pls_ldaCMA
, pls_lrCMA
, pls_rfCMA
,
pnnCMA
, qdaCMA
, rfCMA
,
scdaCMA
, shrinkldaCMA
, svmCMA
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression from first 5 genes golubX <- as.matrix(golub[,2:6]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run flexible Discriminant Analysis result <- flexdaCMA(X=golubX, y=golubY, learnind=learnind, comp = 1) ### show results show(result) ftable(result) plot(result)
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression from first 5 genes golubX <- as.matrix(golub[,2:6]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run flexible Discriminant Analysis result <- flexdaCMA(X=golubX, y=golubY, learnind=learnind, comp = 1) ### show results show(result) ftable(result) plot(result)
This method is experimental.
It is easy to show that, after appropriate scaling of the predictor matrix X
,
Fisher's Linear Discriminant Analysis is equivalent to Discriminant Analysis
in the space of the fitted values from the linear regression of the
nlearn x K
indicator matrix of the class labels on X
.
This gives rise to 'nonlinear discrimant analysis' methods that expand
X
in a suitable, more flexible basis. In order to avoid overfitting,
penalization is used. In the implemented version, the linear model is replaced
by a generalized additive one, using the package mgcv
.
signature 1
signature 2
signature 3
signature 4
For further argument and output information, consult
flexdaCMA
.
An object of class cloutput
contains (among others)
the slot y
and yhat
. The former contains the true,
the last the predicted class labels. Both are cross-tabulated in
order to obtain a so-called confusion matrix. Counts out of the
diagonal are misclassifications.
x |
An object of class |
... |
Currently unused argument. |
No return.
Martin Slawski [email protected]
Anne-Laure Boulesteix http://www.slcmsr.net/boulesteix
For more advanced evaluation: evaluation
Roughly speaking, Boosting combines 'weak learners'
in a weighted manner in a stronger ensemble. This
method calls the function gbm.fit
from the
package gbm
. The 'weak learners' are
simple trees that need only very few splits (default: 1).
For S4
method information, see gbmCMA-methods
.
gbmCMA(X, y, f, learnind, models=FALSE,...)
gbmCMA(X, y, f, learnind, models=FALSE,...)
X |
Gene expression data. Can be one of the following:
|
y |
Class labels. Can be one of the following:
WARNING: The class labels will be re-coded to
range from |
f |
A two-sided formula, if |
learnind |
An index vector specifying the observations that
belong to the learning set. May be |
models |
a logical value indicating whether the model object shall be returned |
... |
Further arguments passed to the function
|
An onject of class cloutput
.
Up to now, this method can only be applied to binary classification.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Ridgeway, G. (1999).
The state of boosting.
Computing Science and Statistics, 31:172-181
Friedman, J. (2001).
Greedy Function Approximation: A Gradient Boosting Machine.
Annals of Statistics 29(5):1189-1232.
compBoostCMA
, dldaCMA
, ElasticNetCMA
,
fdaCMA
, flexdaCMA
,
knnCMA
, ldaCMA
, LassoCMA
,
nnetCMA
, pknnCMA
, plrCMA
,
pls_ldaCMA
, pls_lrCMA
, pls_rfCMA
,
pnnCMA
, qdaCMA
, rfCMA
,
scdaCMA
, shrinkldaCMA
, svmCMA
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression golubX <- as.matrix(golub[,-1]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run tree-based gradient boosting (no tuning) gbmresult <- gbmCMA(X=golubX, y=golubY, learnind=learnind, n.trees = 500) show(gbmresult) ftable(gbmresult) plot(gbmresult)
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression golubX <- as.matrix(golub[,-1]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run tree-based gradient boosting (no tuning) gbmresult <- gbmCMA(X=golubX, y=golubY, learnind=learnind, n.trees = 500) show(gbmresult) ftable(gbmresult) plot(gbmresult)
Roughly speaking, Boosting combines 'weak learners'
in a weighted manner in a stronger ensemble. This
method calls the function gbm.fit
from the
package gbm
. The 'weak learners' are
simple trees that need only very few splits (default: 1).
signature 1
signature 2
signature 3
signature 4
For further argument and output information, consult
gbmCMA.
Due to very small sample sizes, the classical division learnset/testset does not give accurate information about the classification performance. Therefore, several different divisions should be used and aggregated. The implemented methods are discussed in Braga-Neto and Dougherty (2003) and Molinaro et al. (2005) whose terminology is adopted.
This function is usually the basis for all deeper analyses.
GenerateLearningsets(n, y, method = c("LOOCV", "CV", "MCCV", "bootstrap"), fold = NULL, niter = NULL, ntrain = NULL, strat = FALSE)
GenerateLearningsets(n, y, method = c("LOOCV", "CV", "MCCV", "bootstrap"), fold = NULL, niter = NULL, ntrain = NULL, strat = FALSE)
n |
The total number of observations in the available data set. May be |
y |
A vector of class labels, either |
method |
Which kind of scheme should be used to generate divisions into learning sets and test sets ? Can be one of the following:
|
fold |
Gives the number of CV-groups. Used only when |
niter |
Number of iterations (s. |
ntrain |
Number of observations in the learning sets. Used
only when |
strat |
Logical. Should stratified sampling be performed, i.e. the proportion of observations from each class in the learning sets be the same as in the whole data set ? Does not apply for |
When method="CV"
, niter
gives the number of times
the whole CV-procedure is repeated. The output matrix has then fold
xniter
rows.
When method="MCCV"
or method="bootstrap"
, niter
is simply the number of considered
learning sets.
Note that method="CV",fold=n
is equivalent to method="LOOCV"
.
An object of class learningsets
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Christoph Bernau [email protected]
Braga-Neto, U.M., Dougherty, E.R. (2003).
Is cross-validation valid for small-sample microarray classification ?
Bioinformatics, 20(3), 374-380
Molinaro, A.M., Simon, R., Pfeiffer, R.M. (2005).
Prediction error estimation: a comparison of resampling methods.
Bioinformatics, 21(15), 3301-3307
Slawski, M. Daumer, M. Boulesteix, A.-L. (2008) CMA - A comprehensive Bioconductor package for supervised classification with high dimensional data. BMC Bioinformatics 9: 439
learningsets
, GeneSelection
, tune
,
classification
# LOOCV loo <- GenerateLearningsets(n=40, method="LOOCV") show(loo) # five-fold-CV CV5 <- GenerateLearningsets(n=40, method="CV", fold=5) show(loo) # MCCV mccv <- GenerateLearningsets(n=40, method = "MCCV", niter=3, ntrain=30) show(mccv) # Bootstrap boot <- GenerateLearningsets(n=40, method="bootstrap", niter=3) # stratified five-fold-CV set.seed(113) classlabels <- sample(1:3, size = 50, replace = TRUE, prob = c(0.3, 0.5, 0.2)) CV5strat <- GenerateLearningsets(y = classlabels, method="CV", fold=5, strat = TRUE) show(CV5strat)
# LOOCV loo <- GenerateLearningsets(n=40, method="LOOCV") show(loo) # five-fold-CV CV5 <- GenerateLearningsets(n=40, method="CV", fold=5) show(loo) # MCCV mccv <- GenerateLearningsets(n=40, method = "MCCV", niter=3, ntrain=30) show(mccv) # Bootstrap boot <- GenerateLearningsets(n=40, method="bootstrap", niter=3) # stratified five-fold-CV set.seed(113) classlabels <- sample(1:3, size = 50, replace = TRUE, prob = c(0.3, 0.5, 0.2)) CV5strat <- GenerateLearningsets(y = classlabels, method="CV", fold=5, strat = TRUE) show(CV5strat)
Object returned from a call to GeneSelection
rankings
:A list of matrices.
For the two-class case and the multi-class case
where a genuine multi-class method has been used
for variable selection, the length of the list is one.
Otherwise, it is named according to the different
binary scenarios (e.g. 1 vs 3
). Each list
element is a matrix
with rows corresponding to
iterations (different learningsets
) and columns
to variables.
Each row thus contains an index vector representing the order of the
variables with respect to their variable importance
(s. slot importance
)
importance
:A list of matrices, with the same structure as
described for the slot rankings
.
Each row of these matrices are ordered according to
rankings
and contain the variable importance
measure (absolute value of test statistic or regression
coefficient).
method
:Name of the method used for variable selection, s. GeneSelection
.
scheme
:The scheme used in the case of a non-binary
response, one of "pairwise"
, "one-vs-all"
or "multiclass"
.
Use show(genesel-object)
for brief information
Use toplist(genesel-object, k=10, iter = 1)
to display
the top first 10 variables and their variable importance
for the first iteration (first learningset),
s.toplist
.
Use plot(genesel-object, k=10, iter=1)
to display
a barplot of the variable importance of the top first 10
variables, s. plot,genesel-method
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
For different learning data sets as defined by the argument learningsets
,
this method ranks the genes from the most relevant to the less relevant using
one of various 'filter' criteria or provides a sparse collection of variables
(Lasso, ElasticNet, Boosting). The results are typically used for variable selection for
the classification procedure that follows.
For S4 class information, s. GeneSelection-methods
.
GeneSelection(X, y, f, learningsets, method = c("t.test", "welch.test", "wilcox.test", "f.test", "kruskal.test", "limma", "rfe", "rf", "lasso", "elasticnet", "boosting", "golub", "shrinkcat"), scheme, trace = TRUE, ...)
GeneSelection(X, y, f, learningsets, method = c("t.test", "welch.test", "wilcox.test", "f.test", "kruskal.test", "limma", "rfe", "rf", "lasso", "elasticnet", "boosting", "golub", "shrinkcat"), scheme, trace = TRUE, ...)
X |
Gene expression data. Can be one of the following:
|
y |
Class labels. Can be one of the following:
|
f |
A two-sided formula, if |
learningsets |
An object of class |
method |
A character specifying the method to be used:
|
scheme |
The scheme to be used in the case of a non-binary response. Must be one
of |
trace |
Should the progress be traced ? Default is |
... |
Further arguments passed to the function performing variable selection, s. |
An object of class genesel
.
most of the methods described above are only apt for the binary classification case. The only ones that can be used without restriction in the multiclass case are
f.test
kruskal.test
rf
boosting
For the rest, pairwise or one-vs-all schemes are used.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Christoph Bernau [email protected]
Smyth, G. K., Yang, Y.-H., Speed, T. P. (2003).
Statistical issues in microarray data analysis.
Methods in Molecular Biology 224, 111-136.
Guyon, I., Weston, J., Barnhill, S., Vapnik, V. (2002).
Gene Selection for Cancer Classification using support
vector machines.
Journal of Machine Learning Research, 46, 389-422
Zhou, H., Hastie, T. (2004).
Regularization and variable selection via the elastic net.
Journal of the Royal Statistical Society B, 67(2),301-320
Buelmann, P., Yu, B. (2003).
Boosting with the L2 loss: Regression and Classification.
Journal of the American Statistical Association, 98, 324-339
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R. (2004).
Least Angle Regression.
Annals of Statistics, 32:407-499
Buehlmann, P., Yu, B. (2006).
Sparse Boosting.
Journal of Machine Learning Research, 7- 1001:1024
Slawski, M. Daumer, M. Boulesteix, A.-L. (2008) CMA - A comprehensive Bioconductor package for supervised classification with high dimensional data. BMC Bioinformatics 9: 439
filter
, GenerateLearningsets
, tune
,
classification
# load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression from first 10 genes golubX <- as.matrix(golub[,-1]) ### Generate five different learningsets set.seed(111) five <- GenerateLearningsets(y=golubY, method = "CV", fold = 5, strat = TRUE) ### simple t-test: selttest <- GeneSelection(golubX, golubY, learningsets = five, method = "t.test") ### show result: show(selttest) toplist(selttest, k = 10, iter = 1) plot(selttest, iter = 1)
# load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression from first 10 genes golubX <- as.matrix(golub[,-1]) ### Generate five different learningsets set.seed(111) five <- GenerateLearningsets(y=golubY, method = "CV", fold = 5, strat = TRUE) ### simple t-test: selttest <- GeneSelection(golubX, golubY, learningsets = five, method = "t.test") ### show result: show(selttest) toplist(selttest, k = 10, iter = 1) plot(selttest, iter = 1)
Performs gene selection for the following signatures:
signature 1
signature 2
signature 3
signature 4
For further argument and output information, consult
GeneSelection
.
s. below
data(golub)
data(golub)
A data frame with 38 observations and 3052 variables. The first
column (named golub.cl
) contains the tumor classes
(ALL = acute lymphatic leukaemia, AML = acute myeloid leukaemia).\
golub.cl
: a factor with levels ALL
AML
.\
X2-X3051
: Gene expression values.
Adopted from the dataset in the package multtest
.
Golub, T., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P.,
Coller, H., Loh, M. L., Downing, J., Caligiuri, M. A., Bloomfeld, C. D., Lander, E. S. (1999).
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.
Science 286, 531-537.
data(golub)
data(golub)
The method classification
returns a list of
class cloutput
or clvarseloutput
.
It is often more convenient to work with an object of class
cloutput
instead with a whole list, e.g.
because the convenience method defined for that class can
be used.
For S4 method information, s. join-methods
join(cloutputlist)
join(cloutputlist)
cloutputlist |
A list of objects of classes |
An object of class cloutput
.
warning:If the elements of
cloutputlist
have originally been of class clvarseloutput
,
the slot varsel
will be dropped !
The result of the join
method is incompatible with the methods
evaluation
, compare
. These require the lists returned by
classification
.
The list of objects of class cloutput
can be unified
into one object for the following signatures:
signature 1
For further argument and output information, consult
join
.
s. below
data(khan)
data(khan)
A data frame with 63 observations on the following 2309 variables.
The first column (named khanY
) contains the tumor classes
(BL = Burkitt Lymphoma, EWS = Ewing Sarcoma, NB = Neuro Blastoma,
RMS = Rhabdomyosarcoma).\
khanY
: a factor with levels BL
EWS
NB
RMS
\
X2-X2309
: Gene expression values.
Adopted from the dataset in the package pamr
.
Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C. R., Peterson, C., Meltzer, P. S., (2001).
Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks.
Nature Medicine 7, 673-679.
data(khan)
data(khan)
Ordinary k
nearest neighbours algorithm from the
very fast implementation in the package class
.
For S4
method information, see knnCMA-methods.
knnCMA(X, y, f, learnind, models=FALSE, ...)
knnCMA(X, y, f, learnind, models=FALSE, ...)
X |
Gene expression data. Can be one of the following:
|
y |
Class labels. Can be one of the following:
WARNING: The class labels will be re-coded to
range from |
f |
A two-sided formula, if |
learnind |
An index vector specifying the observations that belong to the learning set. Must not be missing for this method. |
models |
a logical value indicating whether the model object shall be returned |
... |
Further arguments to be passed to |
An object of class cloutput
.
Class probabilities are not returned. For a probabilistic
variant of knn
, s. pknnCMA
.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Ripley, B.D. (1996)
Pattern Recognition and Neural Networks.
Cambridge University Press
compBoostCMA
, dldaCMA
, ElasticNetCMA
,
fdaCMA
, flexdaCMA
, gbmCMA
,
ldaCMA
, LassoCMA
,
nnetCMA
, pknnCMA
, plrCMA
,
pls_ldaCMA
, pls_lrCMA
, pls_rfCMA
,
pnnCMA
, qdaCMA
, rfCMA
,
scdaCMA
, shrinkldaCMA
, svmCMA
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression from first 10 genes golubX <- as.matrix(golub[,-1]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run k-nearest neighbours result <- knnCMA(X=golubX, y=golubY, learnind=learnind, k = 3) ### show results show(result) ftable(result) ### multiclass example: ### load Khan data data(khan) ### extract class labels khanY <- khan[,1] ### extract gene expression khanX <- as.matrix(khan[,-1]) ### select learningset set.seed(111) learnind <- sample(length(khanY), size=floor(ratio*length(khanY))) ### run knn result <- knnCMA(X=khanX, y=khanY, learnind=learnind, k = 5) ### show results show(result) ftable(result)
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression from first 10 genes golubX <- as.matrix(golub[,-1]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run k-nearest neighbours result <- knnCMA(X=golubX, y=golubY, learnind=learnind, k = 3) ### show results show(result) ftable(result) ### multiclass example: ### load Khan data data(khan) ### extract class labels khanY <- khan[,1] ### extract gene expression khanX <- as.matrix(khan[,-1]) ### select learningset set.seed(111) learnind <- sample(length(khanY), size=floor(ratio*length(khanY))) ### run knn result <- knnCMA(X=khanX, y=khanY, learnind=learnind, k = 5) ### show results show(result) ftable(result)
Ordinary k
nearest neighbours algorithm from the
very fast implementation in the package class
signature 1
signature 2
signature 3
signature 4
For further argument and output information, consult
knnCMA
.
The Lasso (Tibshirani, 1996) is one of the most popular
tools for simultaneous shrinkage and variable selection. Recently,
Friedman, Hastie and Tibshirani (2008) have developped and algorithm to
compute the entire solution path of the Lasso for an arbitrary
generalized linear model, implemented in the package glmnet
.
The method can be used for variable selection alone, s. GeneSelection
.
For S4
method information, see LassoCMA-methods
.
LassoCMA(X, y, f, learnind, norm.fraction = 0.1,models=FALSE,...)
LassoCMA(X, y, f, learnind, norm.fraction = 0.1,models=FALSE,...)
X |
Gene expression data. Can be one of the following:
|
y |
Class labels. Can be one of the following:
WARNING: The class labels will be re-coded to
range from |
f |
A two-sided formula, if |
learnind |
An index vector specifying the observations that
belong to the learning set. May be |
norm.fraction |
L1 Shrinkage intensity, expressed as the fraction
of the coefficient L1 norm compared to the
maximum possible L1 norm (corresponds to |
models |
a logical value indicating whether the model object shall be returned |
... |
Further arguments passed to the function |
An object of class clvarseloutput
.
For a strongly related method, s. ElasticNetCMA
.
Up to now, this method can only be applied to binary classification.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Christoph Bernau [email protected]
Tibshirani, R. (1996)
Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society B, 58(1), 267-288
Friedman, J., Hastie, T. and Tibshirani, R. (2008) Regularization
Paths for Generalized Linear Models via Coordinate Descent
http://www-stat.stanford.edu/~hastie/Papers/glmnet.pdf
compBoostCMA
, dldaCMA
, ElasticNetCMA
,
fdaCMA
, flexdaCMA
, gbmCMA
,
knnCMA
, ldaCMA
,
nnetCMA
, pknnCMA
, plrCMA
,
pls_ldaCMA
, pls_lrCMA
, pls_rfCMA
,
pnnCMA
, qdaCMA
, rfCMA
,
scdaCMA
, shrinkldaCMA
, svmCMA
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression golubX <- as.matrix(golub[,-1]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run L1 penalized logistic regression (no tuning) lassoresult <- LassoCMA(X=golubX, y=golubY, learnind=learnind, norm.fraction = 0.2) show(lassoresult) ftable(lassoresult) plot(lassoresult)
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression golubX <- as.matrix(golub[,-1]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run L1 penalized logistic regression (no tuning) lassoresult <- LassoCMA(X=golubX, y=golubY, learnind=learnind, norm.fraction = 0.2) show(lassoresult) ftable(lassoresult) plot(lassoresult)
The Lasso (Tibshirani, 1996) is one of the most popular
tools for simultaneous shrinkage and variable selection. Recently,
Friedman, Hastie and Tibshirani (2008) have developped and algorithm to
compute the entire solution path of the Lasso for an arbitrary
generalized linear model, implemented in the package glmnet
.
The method can be used for variable selection alone, s. GeneSelection
signature 1
signature 2
signature 3
signature 4
For references, further argument and output information, consult
LassoCMA
.
Performs a linear discriminant analysis under the assumption
of a multivariate normal distribution in each classes (with equal, but
generally structured) covariance matrices. The function lda
from
the package MASS
is called for computation.
For S4
method information, see ldaCMA-methods.
ldaCMA(X, y, f, learnind, models=FALSE, ...)
ldaCMA(X, y, f, learnind, models=FALSE, ...)
X |
Gene expression data. Can be one of the following:
|
y |
Class labels. Can be one of the following:
WARNING: The class labels will be re-coded to
range from |
f |
A two-sided formula, if |
learnind |
An index vector specifying the observations that
belong to the learning set. May be |
models |
a logical value indicating whether the model object shall be returned |
... |
Further arguments to be passed to |
An object of class cloutput
.
Excessive variable selection has usually to performed before
ldaCMA
can be applied in the p > n
setting.
Not reducing the number of variables can result in an error
message.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
McLachlan, G.J. (1992).
Discriminant Analysis and Statistical Pattern Recognition.
Wiley, New York
compBoostCMA
, dldaCMA
, ElasticNetCMA
,
fdaCMA
, flexdaCMA
, gbmCMA
,
knnCMA
, LassoCMA
, nnetCMA
,
pknnCMA
, plrCMA
, pls_ldaCMA
,
pls_lrCMA
, pls_rfCMA
, pnnCMA
,
qdaCMA
, rfCMA
, scdaCMA
,
shrinkldaCMA
, svmCMA
## Not run: ### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression from first 10 genes golubX <- as.matrix(golub[,2:11]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run LDA ldaresult <- ldaCMA(X=golubX, y=golubY, learnind=learnind) ### show results show(ldaresult) ftable(ldaresult) plot(ldaresult) ### multiclass example: ### load Khan data data(khan) ### extract class labels khanY <- khan[,1] ### extract gene expression from first 10 genes khanX <- as.matrix(khan[,2:11]) ### select learningset set.seed(111) learnind <- sample(length(khanY), size=floor(ratio*length(khanY))) ### run LDA ldaresult <- ldaCMA(X=khanX, y=khanY, learnind=learnind) ### show results show(ldaresult) ftable(ldaresult) plot(ldaresult) ## End(Not run)
## Not run: ### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression from first 10 genes golubX <- as.matrix(golub[,2:11]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run LDA ldaresult <- ldaCMA(X=golubX, y=golubY, learnind=learnind) ### show results show(ldaresult) ftable(ldaresult) plot(ldaresult) ### multiclass example: ### load Khan data data(khan) ### extract class labels khanY <- khan[,1] ### extract gene expression from first 10 genes khanX <- as.matrix(khan[,2:11]) ### select learningset set.seed(111) learnind <- sample(length(khanY), size=floor(ratio*length(khanY))) ### run LDA ldaresult <- ldaCMA(X=khanX, y=khanY, learnind=learnind) ### show results show(ldaresult) ftable(ldaresult) plot(ldaresult) ## End(Not run)
Performs a linear discriminant analysis for the following signatures:
signature 1
signature 2
signature 3
signature 4
For further argument and output information, consult
ldaCMA
.
An object returned from GenerateLearningsets
which is usually passed as arguments to GeneSelection
,
tune
and classification
.
learnmatrix
: A matrix of dimension niter x
ntrain
. Each row contains the indices of those observations
representing the learningset for one iteration. If method =
CV
, zeros appear due to rounding issues.
method
:The method used to generate the learnmatrix
,
s.GenerateLearningsets
ntrain
:Number of observations in one learning set.If
method = CV
, this number is not attained for all
iterations, due to rounding issues.
iter
:Number of iterations (different learningsets)
that are stored in learnmatrix
.
showUse show(learningsets-object)
for brief information.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Christoph Bernau [email protected]
GenerateLearningsets
, GeneSelection
,
tune
, classification
This method provides access to the function
nnet
in the package of the same name that trains
Feed-forward Neural Networks with one hidden layer.
For S4
method information, see nnetCMA-methods
nnetCMA(X, y, f, learnind, eigengenes = FALSE, models=FALSE,...)
nnetCMA(X, y, f, learnind, eigengenes = FALSE, models=FALSE,...)
X |
Gene expression data. Can be one of the following:
|
y |
Class labels. Can be one of the following:
WARNING: The class labels will be re-coded to
range from |
f |
A two-sided formula, if |
learnind |
An index vector specifying the observations that
belong to the learning set. May be |
eigengenes |
Should the training be performed be in the space of
eigengenes obtained from a singular value decomposition
of the Gene expression data matrix ? Default is |
models |
a logical value indicating whether the model object shall be returned |
... |
Further arguments passed to the function
|
An object of class cloutput
.
Excessive variable selection is usually necessary if eigengenes = FALSE
Different runs of this method on the same dataset not necessarily produce the same results due to the fact that optimization for Feed-Forward Neural Networks is rather difficult and depends on the choice of (normally randomly chosen) starting values for the network weights.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Christoph Bernau [email protected]
Ripley, B.D. (1996)
Pattern Recognition and Neural Networks.
Cambridge University Press
compBoostCMA
, dldaCMA
, ElasticNetCMA
,
fdaCMA
, flexdaCMA
, gbmCMA
,
knnCMA
, ldaCMA
, LassoCMA
,
nnetCMA
, pknnCMA
, plrCMA
,
pls_ldaCMA
, pls_lrCMA
, pls_rfCMA
,
pnnCMA
, qdaCMA
, rfCMA
,
scdaCMA
, shrinkldaCMA
, svmCMA
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression from first 10 genes golubX <- as.matrix(golub[,2:11]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run nnet (not tuned) nnetresult <- nnetCMA(X=golubX, y=golubY, learnind=learnind, size = 3, decay = 0.01) ### show results show(nnetresult) ftable(nnetresult) plot(nnetresult) ### in the space of eigengenes (not tuned) golubXfull <- as.matrix(golubX[,-1]) nnetresult <- nnetCMA(X=golubXfull, y=golubY, learnind = learnind, eigengenes = TRUE, size = 3, decay = 0.01) ### show results show(nnetresult) ftable(nnetresult) plot(nnetresult)
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression from first 10 genes golubX <- as.matrix(golub[,2:11]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run nnet (not tuned) nnetresult <- nnetCMA(X=golubX, y=golubY, learnind=learnind, size = 3, decay = 0.01) ### show results show(nnetresult) ftable(nnetresult) plot(nnetresult) ### in the space of eigengenes (not tuned) golubXfull <- as.matrix(golubX[,-1]) nnetresult <- nnetCMA(X=golubXfull, y=golubY, learnind = learnind, eigengenes = TRUE, size = 3, decay = 0.01) ### show results show(nnetresult) ftable(nnetresult) plot(nnetresult)
This method provides access to the function
nnet
in the package of the same name that trains
Feed-forward Neural Networks with one hidden layer.
signature 1
signature 2
signature 3
signature 4
For further argument and output information, consult
nnetCMA
.
Some observations are harder to classify than others. It is frequently of interest to know which observations are consistenly misclassified; these are candiates for outliers or wrong class labels.
object |
An object of class |
threshold |
threshold value of (observation-wise) performance measure,
s. |
show |
Should the information be printed ? Default is |
As not all observation must have been classified at least once, observations not classified at all are also shown.
A list with two components
misclassification |
A |
notclassified |
The indices of those observations not classfied at all, s. details. |
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Slawski, M. Daumer, M. Boulesteix, A.-L. (2008) CMA - A comprehensive Bioconductor package for supervised classification with high dimensional data. BMC Bioinformatics 9: 439
Nearest neighbour variant that replaces the simple voting scheme by a weighted one (based on euclidean distances). This is also used to compute class probabilities.
For S4
class information, see pknnCMA-methods.
pknnCMA(X, y, f, learnind, beta = 1, k = 1, models=FALSE, ...)
pknnCMA(X, y, f, learnind, beta = 1, k = 1, models=FALSE, ...)
X |
Gene expression data. Can be one of the following:
|
y |
Class labels. Can be one of the following:
WARNING: The class labels will be re-coded to
range from |
f |
A two-sided formula, if |
learnind |
An index vector specifying the observations that belong to the learning set. Must not be missing for this method. |
beta |
Slope parameter for the logistic function which is used for the computation of class probabilities. The default value (1) need not produce reasonable results and can produce warnings. |
k |
Number of nearest neighbours to use. |
models |
a logical value indicating whether the model object shall be returned |
... |
Currently unused argument. |
The algorithm is as follows:
Determine the k
nearest neighbours
For each class represented among these, compute the average euclidean distance.
The negative distances are plugged into the logistic function
with parameter beta
.
Classify into the class with highest probability.
An object of class cloutput
.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
compBoostCMA
, dldaCMA
, ElasticNetCMA
,
fdaCMA
, flexdaCMA
, gbmCMA
,
knnCMA
, ldaCMA
, LassoCMA
,
nnetCMA
, plrCMA
,
pls_ldaCMA
, pls_lrCMA
, pls_rfCMA
,
pnnCMA
, qdaCMA
, rfCMA
,
scdaCMA
, shrinkldaCMA
, svmCMA
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression from first 10 genes golubX <- as.matrix(golub[,-1]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run probabilistic k-nearest neighbours result <- pknnCMA(X=golubX, y=golubY, learnind=learnind, k = 3) ### show results show(result) ftable(result) plot(result)
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression from first 10 genes golubX <- as.matrix(golub[,-1]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run probabilistic k-nearest neighbours result <- pknnCMA(X=golubX, y=golubY, learnind=learnind, k = 3) ### show results show(result) ftable(result) plot(result)
Nearest neighbour variant that replaces the simple voting scheme by a weighted one (based on euclidean distances). This is also used to compute class probabilities.
signature 1
signature 2
signature 3
signature 4
For further argument and output information, consult
pknnCMA.
Given two variables, the methods trains a classifier
(argument classifier
) based on these two variables
and plots the resulting class regions, learning- and test
observations in the plane.
Appropriate variables are usually found by GeneSelection
.
For S4 method information, s. Planarplot-methods
.
Planarplot(X, y, f, learnind, predind, classifier, gridsize = 100, ...)
Planarplot(X, y, f, learnind, predind, classifier, gridsize = 100, ...)
X |
Gene expression data. Can be one of the following:
|
y |
Class labels. Can be one of the following:
|
f |
A two-sided formula, if |
learnind |
An index vector specifying the observations that
belong to the learning set. May be |
predind |
A vector containing exactly two indices that denote the two variables used for classification. |
classifier |
Name of function ending with |
gridsize |
The gridsize used for two-dimensional plotting. For both variables specified in |
... |
Further argument passed to |
No return.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected].
Idea is from the MLInterfaces
package, contributed
by Jess Mar, Robert Gentleman and Vince Carey.
GeneSelection
,
compBoostCMA
, dldaCMA
, ElasticNetCMA
,
fdaCMA
, flexdaCMA
, gbmCMA
,
knnCMA
, ldaCMA
, LassoCMA
,
nnetCMA
, pknnCMA
, plrCMA
,
pls_ldaCMA
, pls_lrCMA
, pls_rfCMA
,
pnnCMA
, qdaCMA
, rfCMA
,
scdaCMA
, shrinkldaCMA
, svmCMA
### simple linear discrimination for the golub data: data(golub) golubY <- golub[,1] golubX <- as.matrix(golub[,-1]) golubn <- nrow(golubX) set.seed(111) learnind <- sample(golubn, size=floor(2/3*golubn)) Planarplot(X=golubX, y=golubY, learnind=learnind, predind=c(2,4), classifier=ldaCMA)
### simple linear discrimination for the golub data: data(golub) golubY <- golub[,1] golubX <- as.matrix(golub[,-1]) golubn <- nrow(golubX) set.seed(111) learnind <- sample(golubn, size=floor(2/3*golubn)) Planarplot(X=golubX, y=golubY, learnind=learnind, predind=c(2,4), classifier=ldaCMA)
Given two variables, the methods trains a classifier
(argument classifier
) based on these two variables
and plots the resulting class regions, learning- and test
observations in the plane.
Appropriate variables are usually found by GeneSelection
.
signature 1
signature 2
signature 3
signature 4
For further argument and output information, consult
Planarplot
.
A popular way of visualizing the output of classifier
is to plot, separately for each class, the predicted
probability of each predicted observations for the respective class.
For this purpose, the plot area is divided into K
parts, where K
is the number of classes.
Predicted observations are assigned, according to their
true class, to one of those parts. Then, for each part
and each predicted observation, the predicted probabilities
are plotted, displayed by coloured dots, where each
colour corresponds to one class.
x |
An object of class |
main |
A title for the plot (character). |
No return.
The plot usually only makes sense if a sufficiently large numbers
of observations has been classified. This is usually achieved
by running the classifier on several learningsets
with the method classification
. The output can
then be processed via join
to obtain an object
of class cloutput
to which this method can be applied.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Slawski, M. Daumer, M. Boulesteix, A.-L. (2008) CMA - A comprehensive Bioconductor package for supervised classification with high dimensional data. BMC Bioinformatics 9: 439
After hyperparameter tuning using tune
it is useful
to see which choice of hyperparameters is suitable and how good the
performance is.
x |
An object of class |
iter |
Iteration number ( |
which |
Character vector (maximum length is two) naming
the arguments for which tuning results should
be display. Default is |
... |
Further graphical options passed either to |
no return.
Frequently, several hyperparameter (combinations) perform "best",
s. also the remark in best
.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Slawski, M. Daumer, M. Boulesteix, A.-L. (2008) CMA - A comprehensive Bioconductor package for supervised classification with high dimensional data. BMC Bioinformatics 9: 439
High dimensional logistic regression combined with an
L2-type (Ridge-)penalty.
Multiclass case is also possible.
For S4
method information, see plrCMA-methods
plrCMA(X, y, f, learnind, lambda = 0.01, scale = TRUE, models=FALSE,...)
plrCMA(X, y, f, learnind, lambda = 0.01, scale = TRUE, models=FALSE,...)
X |
Gene expression data. Can be one of the following:
|
y |
Class labels. Can be one of the following:
WARNING: The class labels will be re-coded to
range from |
f |
A two-sided formula, if |
learnind |
An index vector specifying the observations that
belong to the learning set. May be |
lambda |
Parameter governing the amount of penalization.
This hyperparameter should be |
scale |
Scale the predictors as specified by |
models |
a logical value indicating whether the model object shall be returned |
... |
Currently unused argument. |
An object of class cloutput
.
Special thanks go to
Ji Zhu (University of Ann Arbor, Michigan)
Trevor Hastie (Stanford University)
who provided the basic code that was then adapted by
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected].
Zhu, J., Hastie, T. (2004). Classification of gene microarrays by penalized logistic regression.
Biostatistics 5:427-443.
compBoostCMA
, dldaCMA
, ElasticNetCMA
,
fdaCMA
, flexdaCMA
, gbmCMA
,
knnCMA
, ldaCMA
, LassoCMA
,
nnetCMA
, pknnCMA
,
pls_ldaCMA
, pls_lrCMA
, pls_rfCMA
,
pnnCMA
, qdaCMA
, rfCMA
,
scdaCMA
, shrinkldaCMA
, svmCMA
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression from first 10 genes golubX <- as.matrix(golub[,-1]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run penalized logistic regression (no tuning) plrresult <- plrCMA(X=golubX, y=golubY, learnind=learnind) ### show results show(plrresult) ftable(plrresult) plot(plrresult) ### multiclass example: ### load Khan data data(khan) ### extract class labels khanY <- khan[,1] ### extract gene expression from first 10 genes khanX <- as.matrix(khan[,-1]) ### select learningset set.seed(111) learnind <- sample(length(khanY), size=floor(ratio*length(khanY))) ### run penalized logistic regression (no tuning) plrresult <- plrCMA(X=khanX, y=khanY, learnind=learnind) ### show results show(plrresult) ftable(plrresult) plot(plrresult)
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression from first 10 genes golubX <- as.matrix(golub[,-1]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run penalized logistic regression (no tuning) plrresult <- plrCMA(X=golubX, y=golubY, learnind=learnind) ### show results show(plrresult) ftable(plrresult) plot(plrresult) ### multiclass example: ### load Khan data data(khan) ### extract class labels khanY <- khan[,1] ### extract gene expression from first 10 genes khanX <- as.matrix(khan[,-1]) ### select learningset set.seed(111) learnind <- sample(length(khanY), size=floor(ratio*length(khanY))) ### run penalized logistic regression (no tuning) plrresult <- plrCMA(X=khanX, y=khanY, learnind=learnind) ### show results show(plrresult) ftable(plrresult) plot(plrresult)
High dimensional logistic regression combined with an L2-type (Ridge-)penalty. Multiclass case is also possible.
signature 1
signature 2
signature 3
signature 4
For further argument and output information, consult
plrCMA
.
This method constructs a classifier that extracts
Partial Least Squares components that are plugged into
Linear Discriminant Analysis.
The Partial Least Squares components are computed by the package
plsgenomics
.
For S4
method information, see pls_ldaCMA-methods
.
pls_ldaCMA(X, y, f, learnind, comp = 2, plot = FALSE,models=FALSE)
pls_ldaCMA(X, y, f, learnind, comp = 2, plot = FALSE,models=FALSE)
X |
Gene expression data. Can be one of the following:
|
y |
Class labels. Can be one of the following:
WARNING: The class labels will be re-coded to
range from |
f |
A two-sided formula, if |
learnind |
An index vector specifying the observations that
belong to the learning set. May be |
comp |
Number of Partial Least Squares components to extract.
Default is 2 which can be suboptimal, depending on the
particular dataset. Can be optimized using |
plot |
If |
models |
a logical value indicating whether the model object shall be returned |
An object of class cloutput
.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Nguyen, D., Rocke, D. M., (2002).
Tumor classifcation by partial least squares using microarray gene expression data.
Bioinformatics 18, 39-50
Boulesteix, A.L., Strimmer, K. (2007).
Partial least squares: a versatile tool for the analysis of high-dimensional genomic data.
Briefings in Bioinformatics 7:32-44.
compBoostCMA
, dldaCMA
, ElasticNetCMA
,
fdaCMA
, flexdaCMA
, gbmCMA
,
knnCMA
, ldaCMA
, LassoCMA
,
nnetCMA
, pknnCMA
, plrCMA
,
pls_ldaCMA
, pls_lrCMA
, pls_rfCMA
,
pnnCMA
, qdaCMA
, rfCMA
,
scdaCMA
, shrinkldaCMA
, svmCMA
## Not run: ### load Khan data data(khan) ### extract class labels khanY <- khan[,1] ### extract gene expression khanX <- as.matrix(khan[,-1]) ### select learningset set.seed(111) learnind <- sample(length(khanY), size=floor(2/3*length(khanY))) ### run Shrunken Centroids classfier, without tuning plsresult <- pls_ldaCMA(X=khanX, y=khanY, learnind=learnind, comp = 4) ### show results show(plsresult) ftable(plsresult) plot(plsresult) ## End(Not run)
## Not run: ### load Khan data data(khan) ### extract class labels khanY <- khan[,1] ### extract gene expression khanX <- as.matrix(khan[,-1]) ### select learningset set.seed(111) learnind <- sample(length(khanY), size=floor(2/3*length(khanY))) ### run Shrunken Centroids classfier, without tuning plsresult <- pls_ldaCMA(X=khanX, y=khanY, learnind=learnind, comp = 4) ### show results show(plsresult) ftable(plsresult) plot(plsresult) ## End(Not run)
-This method constructs a classifier that extracts
Partial Least Squares components that are plugged into
Linear Discriminant Analysis.
The Partial Least Squares components are computed by the package
plsgenomics
.
signature 1
signature 2
signature 3
signature 4
For further argument and output information, consult
pls_ldaCMA.
This method constructs a classifier that extracts
Partial Least Squares components that form the the covariates
in a binary logistic regression model.
The Partial Least Squares components are computed by the package
plsgenomics
.
For S4
method information, see pls_lrCMA-methods
.
pls_lrCMA(X, y, f, learnind, comp = 2, lambda = 1e-4, plot = FALSE,models=FALSE)
pls_lrCMA(X, y, f, learnind, comp = 2, lambda = 1e-4, plot = FALSE,models=FALSE)
X |
Gene expression data. Can be one of the following:
|
y |
Class labels. Can be one of the following:
WARNING: The class labels will be re-coded to
range from |
f |
A two-sided formula, if |
learnind |
An index vector specifying the observations that
belong to the learning set. May be |
comp |
Number of Partial Least Squares components to extract.
Default is 2 which can be suboptimal, depending on the
particular dataset. Can be optimized using |
lambda |
Parameter controlling the amount of L2 penalization for logistic regression, usually taken to be a small value in order to stabilize estimation in the case of separable data. |
plot |
If |
models |
a logical value indicating whether the model object shall be returned |
An object of class cloutput
.
Up to now, only the two-class case is supported.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Boulesteix, A.L., Strimmer, K. (2007).
Partial least squares: a versatile tool for the analysis of high-dimensional genomic data.
Briefings in Bioinformatics 7:32-44.
compBoostCMA
, dldaCMA
, ElasticNetCMA
,
fdaCMA
, flexdaCMA
, gbmCMA
,
knnCMA
, ldaCMA
, LassoCMA
,
nnetCMA
, pknnCMA
, plrCMA
,
pls_ldaCMA
, pls_rfCMA
,
pnnCMA
, qdaCMA
, rfCMA
,
scdaCMA
, shrinkldaCMA
, svmCMA
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression golubX <- as.matrix(golub[,-1]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run PLS, combined with logistic regression result <- pls_lrCMA(X=golubX, y=golubY, learnind=learnind) ### show results show(result) ftable(result) plot(result)
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression golubX <- as.matrix(golub[,-1]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run PLS, combined with logistic regression result <- pls_lrCMA(X=golubX, y=golubY, learnind=learnind) ### show results show(result) ftable(result) plot(result)
This method constructs a classifier that extracts
Partial Least Squares components that form the the covariates
in a binary logistic regression model.
The Partial Least Squares components are computed by the package
plsgenomics
.
signature 1
signature 2
signature 3
signature 4
For further argument and output information, consult
pls_lrCMA
This method constructs a classifier that extracts
Partial Least Squares components used to generate Random Forests, s. rfCMA
.
For S4
method information, see pls_rfCMA-methods
.
pls_rfCMA(X, y, f, learnind, comp = 2 * nlevels(as.factor(y)), seed = 111,models=FALSE, ...)
pls_rfCMA(X, y, f, learnind, comp = 2 * nlevels(as.factor(y)), seed = 111,models=FALSE, ...)
X |
Gene expression data. Can be one of the following:
|
y |
Class labels. Can be one of the following:
WARNING: The class labels will be re-coded to
range from |
f |
A two-sided formula, if |
learnind |
An index vector specifying the observations that
belong to the learning set. May be |
comp |
Number of Partial Least Squares components to extract. Default ist two times the number of different classes. |
seed |
Fix Random number generator seed to |
models |
a logical value indicating whether the model object shall be returned |
... |
Further arguments to be passed to |
An object of class cloutput
.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Boulesteix, A.L., Strimmer, K. (2007).
Partial least squares: a versatile tool for the analysis of high-dimensional genomic data.
Briefings in Bioinformatics 7:32-44.
compBoostCMA
, dldaCMA
, ElasticNetCMA
,
fdaCMA
, flexdaCMA
, gbmCMA
,
knnCMA
, ldaCMA
, LassoCMA
,
nnetCMA
, pknnCMA
, plrCMA
,
pls_ldaCMA
, pls_lrCMA
,
pnnCMA
, qdaCMA
, rfCMA
,
scdaCMA
, shrinkldaCMA
, svmCMA
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression golubX <- as.matrix(golub[,-1]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run PLS, combined with Random Forest #result <- pls_rfCMA(X=golubX, y=golubY, learnind=learnind) ### show results #show(result) #ftable(result) #plot(result)
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression golubX <- as.matrix(golub[,-1]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run PLS, combined with Random Forest #result <- pls_rfCMA(X=golubX, y=golubY, learnind=learnind) ### show results #show(result) #ftable(result) #plot(result)
This method constructs a classifier that extracts
Partial Least Squares components used to generate Random Forests, s. rfCMA
.
The Partial Least Squares components are computed by the package
plsgenomics
.
signature 1
signature 2
signature 3
signature 4
For further argument and output information, consult
pls_rfCMA
.
Probabilistic Neural Networks is the term Specht (1990) used for a Gaussian kernel estimator for the conditional class densities.
For S4
method information, see pnnCMA-methods.
pnnCMA(X, y, f, learnind, sigma = 1,models=FALSE)
pnnCMA(X, y, f, learnind, sigma = 1,models=FALSE)
X |
Gene expression data. Can be one of the following:
|
y |
Class labels. Can be one of the following:
WARNING: The class labels will be re-coded to
range from |
f |
A two-sided formula, if |
learnind |
An index vector specifying the observations that
belong to the learning set. For this method, this
must not be |
sigma |
Standard deviation of the Gaussian Kernel used. This hyperparameter should be tuned, s. |
models |
a logical value indicating whether the model object shall be returned |
An object of class cloutput
.
There is actually no strong relation of this method to Feed-Forward
Neural Networks, s. nnetCMA
.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Specht, D.F. (1990).
Probabilistic Neural Networks. Neural Networks, 3, 109-118.
compBoostCMA
, dldaCMA
, ElasticNetCMA
,
fdaCMA
, flexdaCMA
, gbmCMA
,
knnCMA
, ldaCMA
, LassoCMA
,
nnetCMA
, pknnCMA
, plrCMA
,
pls_ldaCMA
, pls_lrCMA
, pls_rfCMA
,
qdaCMA
, rfCMA
,
scdaCMA
, shrinkldaCMA
, svmCMA
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression from first 10 genes golubX <- as.matrix(golub[,2:11]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run PNN pnnresult <- pnnCMA(X=golubX, y=golubY, learnind=learnind, sigma = 3) ### show results show(pnnresult) ftable(pnnresult) plot(pnnresult)
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression from first 10 genes golubX <- as.matrix(golub[,2:11]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run PNN pnnresult <- pnnCMA(X=golubX, y=golubY, learnind=learnind, sigma = 3) ### show results show(pnnresult) ftable(pnnresult) plot(pnnresult)
Probabilistic Neural Networks is the term Specht (1990) used for a Gaussian kernel estimator for the conditional class densities.
signature 1
signature 2
signature 3
signature 4
For references, further argument and output information, consult
pnnCMA
.
This method constructs the given classifier using the specified training data, gene selection and tuning results.. Subsequently, class labels are predicted for new observations.
For S4 method information, s. classification-methods
.
prediction(X.tr,y.tr,X.new,f,classifier,genesel,models=F,nbgene,tuneres,...)
prediction(X.tr,y.tr,X.new,f,classifier,genesel,models=F,nbgene,tuneres,...)
X.tr |
Training gene expression data. Can be one of the following:
|
X.new |
gene expression data. Can be one of the following:
|
y.tr |
Class labels of training observation. Can be one of the following:
WARNING: The class labels will be re-coded for classifier construction to
range from |
f |
A two-sided formula, if |
genesel |
Optional (but usually recommended) object of class
|
nbgene |
Number of best genes to be kept for classification, based
on either
|
classifier |
Name of function ending with |
tuneres |
Analogous to the argument |
models |
a logical value indicating whether the model object shall be returned |
... |
Further arguments passed to the function |
This function builds the specified classifier and predicts the class labels of new observations. Hence, its usage differs from those of most other prediction functions in R.
A object of class predoutput-class
; Predicted classes can be seen by show(predoutput)
Christoph Bernau [email protected]
Anne-Laure Boulesteix [email protected]
Slawski, M. Daumer, M. Boulesteix, A.-L. (2008) CMA - A comprehensive Bioconductor package for supervised classification with high dimensional data. BMC Bioinformatics 9: 439
GeneSelection
, tune
, evaluation
,
compBoostCMA
, dldaCMA
, ElasticNetCMA
,
fdaCMA
, flexdaCMA
, gbmCMA
,
knnCMA
, ldaCMA
, LassoCMA
,
nnetCMA
, pknnCMA
, plrCMA
,
pls_ldaCMA
, pls_lrCMA
, pls_rfCMA
,
pnnCMA
, qdaCMA
, rfCMA
,
scdaCMA
, shrinkldaCMA
, svmCMA
classification
### a simple k-nearest neighbour example ### datasets ## Not run: plot(x) data(golub) golubY <- golub[,1] golubX <- as.matrix(golub[,-1]) ###Splitting data into training and test set X.tr<-golubX[1:30] X.new<-golubX[31:39] y.tr<-golubY[1:30] ### 1. GeneSelection selttest <- GeneSelection(X=X.tr, y=y.tr, method = "t.test") ### 2. tuning tunek <- tune(X.tr, y.tr, genesel = selttest, nbgene = 20, classifier = knnCMA) ### 3. classification pred <- prediction(X.tr=X.tr,y.tr=y.tr,X.new=X.new, genesel = selttest, tuneres = tunek, nbgene = 20, classifier = knnCMA) ### show and analyze results: show(pred) ## End(Not run)
### a simple k-nearest neighbour example ### datasets ## Not run: plot(x) data(golub) golubY <- golub[,1] golubX <- as.matrix(golub[,-1]) ###Splitting data into training and test set X.tr<-golubX[1:30] X.new<-golubX[31:39] y.tr<-golubY[1:30] ### 1. GeneSelection selttest <- GeneSelection(X=X.tr, y=y.tr, method = "t.test") ### 2. tuning tunek <- tune(X.tr, y.tr, genesel = selttest, nbgene = 20, classifier = knnCMA) ### 3. classification pred <- prediction(X.tr=X.tr,y.tr=y.tr,X.new=X.new, genesel = selttest, tuneres = tunek, nbgene = 20, classifier = knnCMA) ### show and analyze results: show(pred) ## End(Not run)
Perform prediction signatures:
signature 1
signature 2
signature 3
For further argument and output information, consult
classification
.
Object returned by the function prediction
Xnew
:Gene Expression matrix of new observations
yhat
:Predicted class labels for the new data.
model
:List containing the constructed classifier.
Returns predicted class labels for the new data.
Christoph Bernau [email protected]
Anne-Laure Boulesteix [email protected]
compBoostCMA
, dldaCMA
, ElasticNetCMA
,
fdaCMA
, flexdaCMA
, gbmCMA
,
knnCMA
, ldaCMA
, LassoCMA
,
nnetCMA
, pknnCMA
, plrCMA
,
pls_ldaCMA
, pls_lrCMA
, pls_rfCMA
,
pnnCMA
, qdaCMA
, rfCMA
,
scdaCMA
, shrinkldaCMA
, svmCMA
Performs a quadratic discriminant analysis under the assumption
of a multivariate normal distribution in each classes without restriction
concerning the covariance matrices. The function qda
from
the package MASS
is called for computation.
For S4
method information, see qdaCMA-methods.
qdaCMA(X, y, f, learnind,models=FALSE, ...)
qdaCMA(X, y, f, learnind,models=FALSE, ...)
X |
Gene expression data. Can be one of the following:
|
y |
Class labels. Can be one of the following:
WARNING: The class labels will be re-coded to
range from |
f |
A two-sided formula, if |
learnind |
An index vector specifying the observations that
belong to the learning set. May be |
models |
a logical value indicating whether the model object shall be returned |
... |
Further arguments to be passed to |
An object of class cloutput
.
Excessive variable selection has usually to performed before
qdaCMA
can be applied in the p > n
setting.
Not reducing the number of variables can result in an error
message.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
McLachlan, G.J. (1992).
Discriminant Analysis and Statistical Pattern Recognition.
Wiley, New York
compBoostCMA
, dldaCMA
, ElasticNetCMA
,
fdaCMA
, flexdaCMA
, gbmCMA
,
knnCMA
, ldaCMA
, LassoCMA
,
nnetCMA
, pknnCMA
, plrCMA
,
pls_ldaCMA
, pls_lrCMA
, pls_rfCMA
,
pnnCMA
, rfCMA
,
scdaCMA
, shrinkldaCMA
, svmCMA
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression from first 3 genes golubX <- as.matrix(golub[,2:4]) ### select learningset ratio <- 2/3 set.seed(112) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run QDA qdaresult <- qdaCMA(X=golubX, y=golubY, learnind=learnind) ### show results show(qdaresult) ftable(qdaresult) plot(qdaresult) ### multiclass example: ### load Khan data data(khan) ### extract class labels khanY <- khan[,1] ### extract gene expression from first 4 genes khanX <- as.matrix(khan[,2:5]) ### select learningset set.seed(111) learnind <- sample(length(khanY), size=floor(ratio*length(khanY))) ### run QDA qdaresult <- qdaCMA(X=khanX, y=khanY, learnind=learnind) ### show results show(qdaresult) ftable(qdaresult) plot(qdaresult)
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression from first 3 genes golubX <- as.matrix(golub[,2:4]) ### select learningset ratio <- 2/3 set.seed(112) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run QDA qdaresult <- qdaCMA(X=golubX, y=golubY, learnind=learnind) ### show results show(qdaresult) ftable(qdaresult) plot(qdaresult) ### multiclass example: ### load Khan data data(khan) ### extract class labels khanY <- khan[,1] ### extract gene expression from first 4 genes khanX <- as.matrix(khan[,2:5]) ### select learningset set.seed(111) learnind <- sample(length(khanY), size=floor(ratio*length(khanY))) ### run QDA qdaresult <- qdaCMA(X=khanX, y=khanY, learnind=learnind) ### show results show(qdaresult) ftable(qdaresult) plot(qdaresult)
Performs a quadratic discriminant analysis under the assumption
of a multivariate normal distribution in each classes without restriction
concerning the covariance matrices. The function qda
from
the package MASS
is called for computation.
signature 1
signature 2
signature 3
signature 4
For further argument and output information, consult
qdaCMA
.
Random Forests were proposed by Breiman (2001)
and are implemented in the package randomForest
.
In this package, they can as well be used to rank variables
according to their importance, s. GeneSelection
.
For S4
method information, see rfCMA-methods
rfCMA(X, y, f, learnind, varimp = TRUE, seed = 111, models=FALSE,type=1,scale=FALSE,importance=TRUE, ...)
rfCMA(X, y, f, learnind, varimp = TRUE, seed = 111, models=FALSE,type=1,scale=FALSE,importance=TRUE, ...)
X |
Gene expression data. Can be one of the following:
|
y |
Class labels. Can be one of the following:
WARNING: The class labels will be re-coded to
range from |
f |
A two-sided formula, if |
learnind |
An index vector specifying the observations that
belong to the learning set. May be |
varimp |
Should additional information for variable selection be provided ? Defaults to |
seed |
Fix Random number generator seed to |
models |
a logical value indicating whether the model object shall be returned |
type |
Parameter passed to function |
scale |
Parameter passed to function |
importance |
Parameter passed to function |
... |
Further arguments to be passed to |
If varimp
, then an object of class clvarseloutput
is returned,
otherwise an object of class cloutput
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Breiman, L. (2001)
Random Forest.
Machine Learning, 45:5-32.
compBoostCMA
, dldaCMA
, ElasticNetCMA
,
fdaCMA
, flexdaCMA
, gbmCMA
,
knnCMA
, ldaCMA
, LassoCMA
,
nnetCMA
, pknnCMA
, plrCMA
,
pls_ldaCMA
, pls_lrCMA
, pls_rfCMA
,
pnnCMA
, qdaCMA
,
scdaCMA
, shrinkldaCMA
, svmCMA
### load Khan data data(khan) ### extract class labels khanY <- khan[,1] ### extract gene expression khanX <- as.matrix(khan[,-1]) ### select learningset set.seed(111) learnind <- sample(length(khanY), size=floor(2/3*length(khanY))) ### run random Forest #rfresult <- rfCMA(X=khanX, y=khanY, learnind=learnind, varimp = FALSE) ### show results #show(rfresult) #ftable(rfresult) #plot(rfresult)
### load Khan data data(khan) ### extract class labels khanY <- khan[,1] ### extract gene expression khanX <- as.matrix(khan[,-1]) ### select learningset set.seed(111) learnind <- sample(length(khanY), size=floor(2/3*length(khanY))) ### run random Forest #rfresult <- rfCMA(X=khanX, y=khanY, learnind=learnind, varimp = FALSE) ### show results #show(rfresult) #ftable(rfresult) #plot(rfresult)
Random Forests were proposed by Breiman (2001)
and are implemented in the package randomForest
.
In this package, they can as well be used to rank variables
according to their importance, s. GeneSelection
.
signature 1
signature 2
signature 3
signature 4
For references, further argument and output information, consult
rfCMA
The empirical Receiver Operator Characteristic (ROC) is widely used for the evaluation of diagnostic tests, but also for the evaluation of classfiers. In this implementation, it can only be used for the binary classification case. The input are a numeric vector of class probabilities (which play the role of a test result) and the true class labels. Note that misclassifcation performance can (partly widely) differ from the Area under the ROC (AUC). This is due to the fact that misclassifcation rates are always computed for the threshold 'probability = 0.5'.
object |
An object of |
plot |
Should the ROC curve be plotted ? Default is |
... |
Argument to specifiy further graphical options. |
The empirical area under the curve (AUC).
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Slawski, M. Daumer, M. Boulesteix, A.-L. (2008) CMA - A comprehensive Bioconductor package for supervised classification with high dimensional data. BMC Bioinformatics 9: 439
The nearest shrunken centroid classification algorithm is detailly described in Tibshirani et al. (2002).
It is widely known under the name PAM (prediction analysis for microarrays),
which can also be found in the package pamr
.
For S4
method information, see scdaCMA-methods.
scdaCMA(X, y, f, learnind, delta = 0.5, models=FALSE,...)
scdaCMA(X, y, f, learnind, delta = 0.5, models=FALSE,...)
X |
Gene expression data. Can be one of the following:
|
y |
Class labels. Can be one of the following:
WARNING: The class labels will be re-coded to
range from |
f |
A two-sided formula, if |
learnind |
An index vector specifying the observations that
belong to the learning set. May be |
delta |
The shrinkage intensity for the class centroids -
a hyperparameter that must be tuned. The default
|
models |
a logical value indicating whether the model object shall be returned |
... |
Currently unused argument. |
An object of class cloutput
.
The results can differ from those obtained by
using the package pamr
.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G., (2003).
Class prediction by nearest shrunken centroids with applications to DNA microarrays.
Statistical Science, 18, 104-117
compBoostCMA
, dldaCMA
, ElasticNetCMA
,
fdaCMA
, flexdaCMA
, gbmCMA
,
knnCMA
, ldaCMA
, LassoCMA
,
nnetCMA
, pknnCMA
, plrCMA
,
pls_ldaCMA
, pls_lrCMA
, pls_rfCMA
,
pnnCMA
, qdaCMA
, rfCMA
,
shrinkldaCMA
, svmCMA
### load Khan data data(khan) ### extract class labels khanY <- khan[,1] ### extract gene expression khanX <- as.matrix(khan[,-1]) ### select learningset set.seed(111) learnind <- sample(length(khanY), size=floor(2/3*length(khanY))) ### run Shrunken Centroids classfier, without tuning scdaresult <- scdaCMA(X=khanX, y=khanY, learnind=learnind) ### show results show(scdaresult) ftable(scdaresult) plot(scdaresult)
### load Khan data data(khan) ### extract class labels khanY <- khan[,1] ### extract gene expression khanX <- as.matrix(khan[,-1]) ### select learningset set.seed(111) learnind <- sample(length(khanY), size=floor(2/3*length(khanY))) ### run Shrunken Centroids classfier, without tuning scdaresult <- scdaCMA(X=khanX, y=khanY, learnind=learnind) ### show results show(scdaresult) ftable(scdaresult) plot(scdaresult)
The nearest shrunken centroid classification algorithm is detailly described in Tibshirani et al. (2002).
It is widely known under the name PAM (prediction analysis for microarrays),
which can also be found in the package pamr
.
signature 1
signature 2
signature 3
signature 4
For references, further argument and output information, consult
scdaCMA
.
Linear Discriminant Analysis combined with the James-Stein-Shrinkage approach of Schaefer and Strimmer (2005) for the covariance matrix.
Currently still an experimental version.
For S4
method information, see shrinkldaCMA-methods
shrinkldaCMA(X, y, f, learnind, models=FALSE, ...)
shrinkldaCMA(X, y, f, learnind, models=FALSE, ...)
X |
Gene expression data. Can be one of the following:
|
y |
Class labels. Can be one of the following:
WARNING: The class labels will be re-coded to
range from |
f |
A two-sided formula, if |
learnind |
An index vector specifying the observations that
belong to the learning set. May be |
models |
a logical value indicating whether the model object shall be returned |
... |
Further arguments to be passed to |
An object of class cloutput
.
This is still an experimental version.
Covariance shrinkage is performed by calling functions
from the package corpcor
.
Variable selection is not necessary.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Schaefer, J., Strimmer, K. (2005).
A shrinkage approach to large-scale covariance estimation and implications for functional genomics.
Statististical Applications in Genetics and Molecular Biology, 4:32.
compBoostCMA
, dldaCMA
, ElasticNetCMA
,
fdaCMA
, flexdaCMA
, gbmCMA
,
knnCMA
, ldaCMA
, LassoCMA
,
nnetCMA
, pknnCMA
, plrCMA
,
pls_ldaCMA
, pls_lrCMA
, pls_rfCMA
,
pnnCMA
, qdaCMA
, rfCMA
,
scdaCMA
, svmCMA
.
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression golubX <- as.matrix(golub[,-1]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run shrinkage-LDA result <- shrinkldaCMA(X=golubX, y=golubY, learnind=learnind) ### show results show(result) ftable(result) plot(result)
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression golubX <- as.matrix(golub[,-1]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run shrinkage-LDA result <- shrinkldaCMA(X=golubX, y=golubY, learnind=learnind) ### show results show(result) ftable(result) plot(result)
Linear Discriminant Analysis combined with the James-Stein-Shrinkage approach of Schaefer and Strimmer (2005) for the covariance matrix.
Currently still an experimental version.
For S4
method information, see shrinkldaCMA-methods
signature 1
signature 2
signature 3
signature 4
For further argument and output information, consult
shrinkldaCMA
.
This method principally does nothing more than
applying the pre-implemented summary()
function to the slot score
of an object of class evaloutput
. One then obtains the usual
five-point-summary, consisting of minimum and maximum, lower and upper quartile
and the median. Additionally, the mean is also shown.
object |
An object of class |
... |
Further arguments passed to the pre-implemented
|
No return.
That the results normally differ for different evaluation schemes ("iterationwise" or "observationwise").
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Calls the function svm
from the package e1071
that provides an interface to the award-winning LIBSVM routines.
For S4
method information, see svmCMA-methods
svmCMA(X, y, f, learnind, probability, models=FALSE,seed=341,...)
svmCMA(X, y, f, learnind, probability, models=FALSE,seed=341,...)
X |
Gene expression data. Can be one of the following:
|
y |
Class labels. Can be one of the following:
WARNING: The class labels will be re-coded to
range from |
f |
A two-sided formula, if |
learnind |
An index vector specifying the observations that
belong to the learning set. May be |
probability |
logical indicating whether the model should allow for probability predictions. |
seed |
Fix random number generator for reproducibility. |
models |
a logical value indicating whether the model object shall be returned |
... |
Further arguments to be passed to |
An object of class cloutput
.
Contrary to the default settings in e1071:::svm
, the used
kernel is a linear kernel which has turned to be out a better
default setting in the small sample, large number of predictors - situation,
because additional nonlinearity is mostly not necessary there. It
additionally avoids the tuning of a further kernel parameter gamma
,
s. help of the package e1071
for details.
Nevertheless, hyperparameter tuning concerning the parameter cost
must
usually be performed to obtain reasonale results, s. tune
.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Christoph Bernau [email protected]
Boser, B., Guyon, I., Vapnik, V. (1992)
A training algorithm for optimal margin classifiers.
Proceedings of the fifth annual workshop on Computational learning theory, pages 144-152, ACM Press.
Chang, Chih-Chung and Lin, Chih-Jen : LIBSVM: a library for Support Vector Machines http://www.csie.ntu.edu.tw/~cjlin/libsvm
Schoelkopf, B., Smola, A.J. (2002)
Learning with kernels.
MIT Press, Cambridge, MA.
compBoostCMA
, dldaCMA
, ElasticNetCMA
,
fdaCMA
, flexdaCMA
, gbmCMA
,
knnCMA
, ldaCMA
, LassoCMA
,
nnetCMA
, pknnCMA
, plrCMA
,
pls_ldaCMA
, pls_lrCMA
, pls_rfCMA
,
pnnCMA
, qdaCMA
, rfCMA
,
scdaCMA
, shrinkldaCMA
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression golubX <- as.matrix(golub[,-1]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run _untuned_linear SVM svmresult <- svmCMA(X=golubX, y=golubY, learnind=learnind,probability=TRUE) ### show results show(svmresult) ftable(svmresult) plot(svmresult)
### load Golub AML/ALL data data(golub) ### extract class labels golubY <- golub[,1] ### extract gene expression golubX <- as.matrix(golub[,-1]) ### select learningset ratio <- 2/3 set.seed(111) learnind <- sample(length(golubY), size=floor(ratio*length(golubY))) ### run _untuned_linear SVM svmresult <- svmCMA(X=golubX, y=golubY, learnind=learnind,probability=TRUE) ### show results show(svmresult) ftable(svmresult) plot(svmresult)
Calls the function svm
from the package e1071
that provides an interface to the award-winning LIBSVM routines.
signature 1
signature 2
signature 3
signature 4
For further argument and output information, consult
svmCMA
.
This is a convenient method to get quick
access to the most important variables, based
on the result of call to GeneSelection
.
toplist(object, k = 10, iter = 1, show = TRUE, ...)
toplist(object, k = 10, iter = 1, show = TRUE, ...)
object |
An object of |
k |
Number of top genes for which information should be displayed. Defaults to 10. |
iter |
teration number ( |
show |
Should the results be printed ? Default is |
... |
Currently unused argument. |
The type of output depends on the gene selection scheme. For
the multiclass case, if gene selection has been run with
the "pairwise"
or "one-vs-all"
scheme, then the
output will be a list of data.frames
, each containing
the gene indices plus variable importance for the top k
genes. The list elements are named according to the binary
scenarios (e.g., 1 vs. 3
).
Otherwise, a single data.frame
is returned.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Slawski, M. Daumer, M. Boulesteix, A.-L. (2008) CMA - A comprehensive Bioconductor package for supervised classification with high dimensional data. BMC Bioinformatics 9: 439
genesel
, GeneSelection
, plot,genesel-method
Most classifiers implemented in this package depend on one
or even several hyperparameters (s. details) that should be optimized
to obtain good (and comparable !) results. As tuning scheme, we propose
three fold Cross-Validation on each learningset
(for fixed selected
variables). Note that learningsets
usually do not contain the
complete dataset, so tuning involves a second level of splitting the dataset.
Increasing the number of folds leads to larger datasets (and possibly to higher accuracy),
but also to higher computing times.
For S4 method information, s. link{tune-methods}
tune(X, y, f, learningsets, genesel, genesellist = list(), nbgene, classifier, fold = 3, strat = FALSE, grids = list(), trace = TRUE, ...)
tune(X, y, f, learningsets, genesel, genesellist = list(), nbgene, classifier, fold = 3, strat = FALSE, grids = list(), trace = TRUE, ...)
X |
Gene expression data. Can be one of the following:
|
y |
Class labels. Can be one of the following:
|
f |
A two-sided formula, if |
learningsets |
An object of class |
genesel |
Optional (but usually recommended) object of class
|
genesellist |
In the case that the argument |
nbgene |
Number of best genes to be kept for classification, based
on either
|
classifier |
Name of function ending with |
fold |
The number of cross-validation folds used within each |
strat |
Should stratified cross-validation according to the class proportions
in the complete dataset be used ? Default is |
grids |
A named list. The names correspond to the arguments to be tuned,
e.g. |
trace |
Should progress be traced ? Default is |
... |
Further arguments to be passed to |
The following default settings are used, if the arguments grids
is an empty list:
gbmCMA
n.trees = c(50, 100, 200, 500, 1000)
compBoostCMA
mstop = c(50, 100, 200, 500, 1000)
LassoCMA
norm.fraction = seq(from=0.1, to=0.9, length=9)
ElasticNetCMA
norm.fraction = seq(from=0.1, to=0.9, length=5), alpha = 2^{-(5:1)}
plrCMA
lambda = 2^{-4:4}
pls_ldaCMA
comp = 1:10
pls_lrCMA
comp = 1:10
pls_rfCMA
comp = 1:10
rfCMA
mtry = ceiling(c(0.1, 0.25, 0.5, 1, 2)*sqrt(ncol(X))), nodesize = c(1,2,3)
knnCMA
k=1:10
pknnCMA
k = 1:10
scdaCMA
delta = c(0.1, 0.25, 0.5, 1, 2, 5)
pnnCMA
sigma = c(2^{-2:2})
,
nnetCMA
size = 1:5, decay = c(0, 2^{-(4:1)})
svmCMA
, kernel = "linear"
cost = c(0.1, 1, 5, 10, 50, 100, 500)
svmCMA
, kernel = "radial"
cost = c(0.1, 1, 5, 10, 50, 100, 500), gamma = 2^{-2:2}
svmCMA
, kernel = "polynomial"
cost = c(0.1, 1, 5, 10, 50, 100, 500), degree = 2:4
An object of class tuningresult
The computation time can be enormously high. Note that for each different
learningset
, the classifier must be trained fold
times
number of possible different hyperparameter combinations
times.
E.g. if the number of the learningsets is fifty, fold = 3
and
two hyperparameters (each with 5 candidate values) are tuned, 50x3x25=3750
training iterations are necessary !
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Christoph Bernau [email protected]
Slawski, M. Daumer, M. Boulesteix, A.-L. (2008) CMA - A comprehensive Bioconductor package for supervised classification with high dimensional data. BMC Bioinformatics 9: 439
tuningresult
, GeneSelection
, classification
## Not run: ### simple example for a one-dimensional grid, using compBoostCMA. ### dataset data(golub) golubY <- golub[,1] golubX <- as.matrix(golub[,-1]) ### learningsets set.seed(111) lset <- GenerateLearningsets(y=golubY, method = "CV", fold=5, strat =TRUE) ### tuning after gene selection with the t.test tuneres <- tune(X = golubX, y = golubY, learningsets = lset, genesellist = list(method = "t.test"), classifier=compBoostCMA, nbgene = 100, grids = list(mstop = c(50, 100, 250, 500, 1000))) ### inspect results show(tuneres) best(tuneres) plot(tuneres, iter = 3) ## End(Not run)
## Not run: ### simple example for a one-dimensional grid, using compBoostCMA. ### dataset data(golub) golubY <- golub[,1] golubX <- as.matrix(golub[,-1]) ### learningsets set.seed(111) lset <- GenerateLearningsets(y=golubY, method = "CV", fold=5, strat =TRUE) ### tuning after gene selection with the t.test tuneres <- tune(X = golubX, y = golubY, learningsets = lset, genesellist = list(method = "t.test"), classifier=compBoostCMA, nbgene = 100, grids = list(mstop = c(50, 100, 250, 500, 1000))) ### inspect results show(tuneres) best(tuneres) plot(tuneres, iter = 3) ## End(Not run)
Performs hyperparameter tuning for the following signatures:
signature 1
signature 2
signature 3
signature 4
For further argument and output information, consult
tune
.
Object returned by the function tune
hypergrid
:A data.frame
representing the
grid of values that were tried and
evaluated. The number of columns equals
the number of tuned hyperparameters
and the number rows equals the number
of all possible combinations of the
discrete grids.
tuneres
:A list whose lengths equals the number
of different learningsets
for which
tuning has been performed and whose
elements are numeric vectors with length
equal to the number of rows of hypergrid
(s.above),
containing the misclassifcation
rate belonging to the respective
hyperparameter/hyperparameter combination.
In order to to get an overview
about the best hyperparmeter/hyperparameter
combination, use the convenience method
best
method
:Name of the classifier that has been tuned.
fold
:Number of cross-validation fold used for
tuning, s. argument of the same name in
tune
Use show(tuninresult-object)
for brief information.
Use best(tuningresult-object)
to see which
hyperparameter/hyperparameter combination has performed
best in terms of the misclassification rate,
s. best,tuningresult-method
Use plot(tuningresult-object, iter, which)
to display the performance of hyperparameter/hyperparameter
combinations graphically, either as one-dimensional or
as two-dimensional (contour) plot,
s. plot,tuningresult-method
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
An object returned by the functions described in filter
,
usually not created directly by the user.
varsel
:numeric
vector of variable importance measures, e.g.
absolute of genewise statistics.
No methods are currently defined.
Martin Slawski [email protected]
Anne-Laure Boulesteix [email protected]
Performs subsampling for several classifiers or a single classifiers with different tuning parameter values or numbers of selected genes. Eventually, a specific procedure for correcting for the tuning or selection bias, which is caused by optimal selection of classifiers or tuning parameters, is applied.
weighted.mcr(classifiers,parameters,nbgenes,sel.method,X,y,portion,niter=100,shrinkage=F)
weighted.mcr(classifiers,parameters,nbgenes,sel.method,X,y,portion,niter=100,shrinkage=F)
classifiers |
A character vector of the several CMA classifiers that shall be used. If the same classifier shall be used with different tuning parameters it must appear several times in this vector. |
parameters |
A character containing the tuning parameter values
corresponding to the classification methods in |
nbgenes |
A numeric vector indicating how many variables
shall be selected by |
sel.method |
The CMA-method (represented as a string) that shall be applied for variable
selection. If this parameter is set to |
X |
The matrix of gene expression data. Can be one of the following. Rows correspond to observations, columns to variables. |
y |
Class labels. Can be one of the following:
WARNING: The class labels will be re-coded to
range from |
portion |
A numeric value which indicates the portion of observations that will be used for training the classifiers. |
niter |
The number of subsampling iterations. |
shrinkage |
A logical value indicating whether shrinkage (WMCS) shall be applied. |
The algorithm tries to avoid the additional computational costs of a nested cross validation by estimating the corrected misclassification rate of the best classifier by a weighted mean of all classifiers included in the subsampling approach.
An object of class wmcr.result
which provides the
corrected and uncorrected misclassification rate of the best
classifier as well as weights and misclassifcation rates for all
classifiers used in the subsampling approach.
Christoph Bernau [email protected]
Anne-Laure Boulesteix [email protected]
Bernau Ch., Augustin, Th. and Boulesteix, A.-L. (2011): Correcting the optimally selected resampling-based error rate: A smooth analytical alternative to nested cross-validation. Department of Statistics: Technical Reports, Nr. 105.
wmc
,classification
,GeneSelection
, tune
, evaluation
,
#inputs classifiers<-rep('knnCMA',7) nbgenes<-rep(50,7) parameters<-c('k=1','k=3','k=5','k=7','k=9','k=11','k=13') portion<-0.8 niter<-100 data(golub) X<-as.matrix(golub[,-1]) y<-golub[,1] sel.method<-'t.test' #function call wmcr<-weighted.mcr(classifiers=classifiers,parameters=parameters,nbgenes=nbgenes,sel.method=sel.method,X=X,y=y,portion=portion,niter=niter)
#inputs classifiers<-rep('knnCMA',7) nbgenes<-rep(50,7) parameters<-c('k=1','k=3','k=5','k=7','k=9','k=11','k=13') portion<-0.8 niter<-100 data(golub) X<-as.matrix(golub[,-1]) y<-golub[,1] sel.method<-'t.test' #function call wmcr<-weighted.mcr(classifiers=classifiers,parameters=parameters,nbgenes=nbgenes,sel.method=sel.method,X=X,y=y,portion=portion,niter=niter)
Perform tuning / selection bias correction in subsampling for the following signatures:
signature 1
signature 2
signature 3
For further argument and output information, consult
weighted.mcr
.
Perform tuning / selection bias correction for a matrix of subsampling fold errors.
wmc(mcr.m,n.tr,n.ts,shrinkage=F)
wmc(mcr.m,n.tr,n.ts,shrinkage=F)
mcr.m |
A matrix of resampling fold errors. Columns correspond the the fold errors of a single classifier. |
n.tr |
Number of observations in the resampling training sets. |
n.ts |
Number of observations in the resampling test sets. |
shrinkage |
A logical value indicating whether shrinkage (WMCS) shall be applied. |
The algorithm tries to avoid the additional computational costs of a nested cross validation by estimating the corrected misclassification rate of the best classifier by a weighted mean of all classifiers included in the subsampling approach.
A list containing the corrected misclassification rate, the index of the best method and a logical value indicating whether shrinkage has been applied.
Christoph Bernau [email protected]
Anne-Laure Boulesteix [email protected]
Bernau Ch., Augustin, Th. and Boulesteix, A.-L. (2011): Correcting the optimally selected resampling-based error rate: A smooth analytical alternative to nested cross-validation. Department of Statistics: Technical Reports, Nr. 105.
weighted.mcr
,classification
,GeneSelection
, tune
, evaluation
,
Perform tuning / selection bias correction for a matrix of subsampling fold errors for the following signature:
signature 1
For further argument and output information, consult
wmc
.
Object returned by function weighted.mcr
.
corrected.mcr
:The corrected misclassification rate for the best method.
best.method
:The method which performed best in the subsampling approach.
mcrs
:Misclassification rates of all classifiers used in the subsampling approach.
weights
:The weights used for the different classifiers in the correction method.
cov
:Estimated covariance matrix for the misclassification rates of the different classifiers.
uncorrected.mcr
The uncorrected misclassification rate of the best method.
ranges
Minimum and maximal mean misclassification rates as well as the theoretical bound for nested cross validation (averaging over foldwise minima or maxima respectively).
mcr.m
matrix of resampling fold errors, columns correspond to the fold errors of a single classifier
shrinkage
a logical value indicating whether shrinkage (WMCS) has been aplied.
Use show(wmcr.result-object)
for brief information
Christoph Bernau [email protected]
Anne-Laure Boulesteix [email protected]
weighted.mcr