Title: | Uniform interfaces to R machine learning procedures for data in Bioconductor containers |
---|---|
Description: | This package provides uniform interfaces to machine learning code for data in R and Bioconductor containers. |
Authors: | Vincent Carey [cre, aut] , Jess Mar [aut], Jason Vertrees [ctb], Laurent Gatto [ctb], Phylis Atieno [ctb] (Translated vignettes from Sweave to Rmarkdown / HTML.) |
Maintainer: | Vincent Carey <[email protected]> |
License: | LGPL |
Version: | 1.87.0 |
Built: | 2024-10-31 06:26:49 UTC |
Source: | https://github.com/bioc/MLInterfaces |
generate a partition function for cross-validation, where the partitions are approximately balanced with respect to the distribution of a response variable
balKfold.xvspec(K)
balKfold.xvspec(K)
K |
number of partitions to be computed |
This function returns a closure. The symbol K
is
bound in the environment of the returned function.
A closure consisting of a function that can be used
as a partitionFunc
for passage in xvalSpec
.
VJ Carey <[email protected]>
## The function is currently defined as function (K) function(data, clab, iternum) { clabs <- data[[clab]] narr <- nrow(data) cnames <- unique(clabs) ilist <- list() for (i in 1:length(cnames)) ilist[[cnames[i]]] <- which(clabs == cnames[i]) clens <- lapply(ilist, length) nrep <- lapply(clens, function(x) ceiling(x/K)) grpinds <- list() for (i in 1:length(nrep)) grpinds[[i]] <- rep(1:K, nrep[[i]])[1:clens[[i]]] (1:narr)[-which(unlist(grpinds) == iternum)] } # try it out library("MASS") data(crabs) p1c = balKfold.xvspec(5) inds = p1c( crabs, "sp", 3 ) table(crabs$sp[inds] ) inds2 = p1c( crabs, "sp", 4 ) table(crabs$sp[inds2] ) allc = 1:200 # are test sets disjoint? intersect(setdiff(allc,inds), setdiff(allc,inds2))
## The function is currently defined as function (K) function(data, clab, iternum) { clabs <- data[[clab]] narr <- nrow(data) cnames <- unique(clabs) ilist <- list() for (i in 1:length(cnames)) ilist[[cnames[i]]] <- which(clabs == cnames[i]) clens <- lapply(ilist, length) nrep <- lapply(clens, function(x) ceiling(x/K)) grpinds <- list() for (i in 1:length(nrep)) grpinds[[i]] <- rep(1:K, nrep[[i]])[1:clens[[i]]] (1:narr)[-which(unlist(grpinds) == iternum)] } # try it out library("MASS") data(crabs) p1c = balKfold.xvspec(5) inds = p1c( crabs, "sp", 3 ) table(crabs$sp[inds] ) inds2 = p1c( crabs, "sp", 4 ) table(crabs$sp[inds2] ) allc = 1:200 # are test sets disjoint? intersect(setdiff(allc,inds), setdiff(allc,inds2))
The clinical characteristics table of https://doi.org/10.1016/j.cell.2013.09.034 in supplemental table S7 was aligned with the GBM samples in curatedTCGAData (selecting GBM and version 2.0.1 with curatedTCGAData 1.17.0.
data("brennan_2013_tabS7exc")
data("brennan_2013_tabS7exc")
A data frame with 158 observations on the following 16 variables.
Case_ID
a character vector
Secondary_or_Recurrent
a character vector
Age_at_Procedure
a numeric vector
Gender
a character vector
Path_Dx
a character vector
MGMT_Status
a character vector
Methylation_Class_2012
a character vector
G_CIMP_methylation
a character vector
IDH1_status
a character vector
Expression_Subclass
a character vector
Therapy_Class
a character vector
Vital_Status
a character vector
OS_days
a numeric vector
Progression_Status
a character vector
PFS_days
a numeric vector
V16
a logical vector
Simple intersection on Case_ID in Supp Tab 7 with patientID in the GBM from curatedTCGAData.
https://doi.org/10.1016/j.cell.2013.09.034
The Somatic Genomic Landscape of Glioblastoma by Cameron W. Brennan, Roel G.W. Verhaak, Aaron McKenna, and others, Cell Oct 10 2013.
data(brennan_2013_tabS7exc) head(brennan_2013_tabS7exc)
data(brennan_2013_tabS7exc) head(brennan_2013_tabS7exc)
This class summarizes the output values from different classifiers.
Objects are typically created during the application of a supervised machine learning algorithm to data and are the value returned. It is very unlikely that any user would create such an object by hand.
testOutcomes
:Object of class "factor"
that
lists the actual outcomes in the records on the test set
testPredictions
:Object of class "factor"
that
lists the predictions of outcomes in the test set
testScores
:Object of class "ANY"
– this
element will include matrices or vectors or arrays that include
information that is typically related to the posterior probability
of occupancy of the predicted class or of all classes. The actual
contents of this slot can be determined by inspecting the converter
element of the learnerSchema used to select the model.
trainOutcomes
:Object of class "factor"
that
lists the actual outcomes in records on the training set
trainPredictions
:Object of class "factor"
that
lists the predicted outcomes in the training set
trainScores
:Object of class "ANY"
see
the description of testScores
above; the same information
is returned, but applicable to the training set records.
trainInd
:Object of class "numeric"
with of
indices of data to be used for training.
RObject
:Object of class "ANY"
– when
the trainInd
parameter of the MLearn
call is
numeric, this slot holds
the return value of the underlying R function that carried out
the predictive modeling. For example, if rpartI
was used
as MLearn method
, Robject
holds an instance of the
rpart
S3 class, and plot
and text
methods
can be applied to this. When the trainInd
parameter
of the MLearn
call is an instance of
xvalSpec
, this slot holds a list
of
results of cross-validatory iterations. Each element of this
list has two elements: test.idx
, giving the numeric
indices of the test cases for the associated cross-validation
iteration, and mlans
, which is the classifierOutput
for the associated iteration. See the example for an illustration
of 'digging out' the predicted probabilities associated with each
cross-validation iteration executed through an xvalSpec specification.
embeddedCV
:logical value that is TRUE if the procedure in use performs its own cross-validation
fsHistory
:list of features selected through cross-validation process
learnerSchema
:propagation of the learner schema object used in the call
call
:Object of class "call"
– records the
call used to generate the classifierOutput RObject
signature(obj = "classifierOutput")
: Compute
the confusion matrix for test records.
signature(obj = "classifierOutput")
: Compute
the confusion matrix for training set. Typically yields optimistically biased
information on misclassification rate.
signature(obj = "classifierOutput")
: The R object
returned by the underlying classifier. This can then be passed on to
specific methods for those objects, when they exist.
signature(obj = "classifierOutput")
: Returns
the indices of data used for training.
signature(object = "classifierOutput")
: A print method
that provides a summary of the output of the classifier.
signature(object = "classifierOutput")
: Print
the predicted classes for each sample/individual. The predictions
for the training set are the training outcomes.
signature(object = "classifierOutput", t
= "numeric")
: Print the predicted classes for each
sample/individual that have a testScore
greater or equal
than t
. The predictions for the training set are the
training outcomes. Non-predicted cases and cases that
matche multiple classes are returned as NA
s.
signature(object = "classifierOutput")
:
Returns the scores for predicted class for each
sample/individual. The scores for the training set are set to 1.
signature(object = "classifierOutput")
:
Returns the prediction scores for all classes for each
sample/individual. The scores for the training set are set to 1 for
the appropriate class, 0 otherwise.
signature(object = "classifierOutput")
: ...
signature(object = "classifierOutput")
: Print
the predicted classes for each sample/individual in the test set.
signature(object = "classifierOutput", t
= "numeric")
: Print the predicted classes for each
sample/individual in the test set that have a testScore
greater or equal than t
. Non-predicted cases and cases that
matche multiple classes are returned as NA
s.
signature(object = "classifierOutput")
: ...
signature(object =
"classifierOutput")
: Print the predicted classes for each
sample/individual in the train set.
signature(object = "classifierOutput", t
= "numeric")
: Print the predicted classes for each
sample/individual in the train set that have a testScore
greater or equal than t
. Non-predicted cases and cases that
matche multiple classes are returned as NA
s.
signature(object = "classifierOutput")
: ...
V. Carey
showClass("classifierOutput") library(golubEsets) data(Golub_Train) # now cross-validate a neural net set.seed(1234) xv5 = xvalSpec("LOG", 5, balKfold.xvspec(5)) m2 = MLearn(ALL.AML~., Golub_Train[1000:1050,], nnetI, xv5, size=5, decay=.01, maxit=1900 ) testScores(RObject(m2)[[1]]$mlans) alls = lapply(RObject(m2), function(x) testScores(x$mlans))
showClass("classifierOutput") library(golubEsets) data(Golub_Train) # now cross-validate a neural net set.seed(1234) xv5 = xvalSpec("LOG", 5, balKfold.xvspec(5)) m2 = MLearn(ALL.AML~., Golub_Train[1000:1050,], nnetI, xv5, size=5, decay=.01, maxit=1900 ) testScores(RObject(m2)[[1]]$mlans) alls = lapply(RObject(m2), function(x) testScores(x$mlans))
container for clustering outputs in uniform structure
Objects can be created by calls of the form new("clusteringOutput", ...)
.
partition
:Object of class "integer"
, labels for
observations as clustered
silhouette
:Object of class "silhouette"
,
structure from Rousseeuw cluster package measuring cluster
membership strength per observation
prcomp
:Object of class "prcompObj"
a wrapped
instance of stats package prcomp output
call
:Object of class "call"
for auditing
learnerSchema
:Object of class "learnerSchema"
, a
formal object indicating the package, function, and other attributes
of the clustering algorithm employed to generate this object
RObject
:Object of class "ANY"
, the
unaltered output of the function called according to learnerSchema
converter
:converter propagated from call
distFun
:distfun propagated from call
signature(x = "clusteringOutput")
: extract
the unaltered output of the R function or method called according
to learnerSchema
signature(x = "clusteringOutput", y = "ANY")
:
a 4-panel plot showing features of the clustering, including the
scree plot for a principal components transformation and a display
of the partition in PC1xPC2 plane. For a clustering method that
does not have a native plot procedure, such as kmeans, the parameter
y should be bound to a data frame or matrix with feature data for all records;
an image plot of robust feature z-scores (z=(x-median(x))/mad(x)) and
the cluster indices is produced in the northwest panel.
signature(object = "clusteringOutput")
: concise
report
VJ Carey <[email protected]>
showClass("clusteringOutput")
showClass("clusteringOutput")
This function will compute the confusion matrix for a classifier's output
Typically, an instance of class
"classifierOutput"
is built on a training subset of the input data. The model is then
used to predict the class of samples in the test set. When the
true class labels for the test set are available the confusion matrix
is the cross-tabulation of the true labels
of the test set against the predictions from the classifier. An
optional t
score threshold can also be specified.
For instances of classifierOutput, it is possible
to specify the type
of confusion matrix desired.
The default is test
, which tabulates classes from the
test set against the associated predictions. If type
is
train
, the training class vector is tabulated against the
predictions on the training set. An optional t
score
threshold can also be specified.
For instances of classifierOutput, it is possible
to specify the minimum score feature classification
threshold. Features with a score less than the threshold are
classified as NA
in the confustion train
or
test
confusion matrix.
library(golubEsets) data(Golub_Merge) smallG <- Golub_Merge[101:150,] k1 <- MLearn(ALL.AML~., smallG, knnI(k=1), 1:30) confuMat(k1) confuMat(k1, "train")
library(golubEsets) data(Golub_Merge) smallG <- Golub_Merge[101:150,] k1 <- MLearn(ALL.AML~., smallG, knnI(k=1), 1:30) confuMat(k1) confuMat(k1, "train")
Given a n
by n
confusion matrix, the function returns a
list
of n
2 by 2 tables with false positives, false
negatives, false positives and true negative for each initial variables.
confuTab(obj, naAs0. = FALSE)
confuTab(obj, naAs0. = FALSE)
obj |
An instance of class |
naAs0. |
A |
A list
of length nrow(obj)
and names
rownames(obj)
.
Laurent Gatto <[email protected]>
The tp
, tn
, fp
,
fn
, methods to extract the respective classification
outcomes from a contingency matrix.
## the confusion matrix cm <- table(iris$Species, sample(iris$Species)) ## the 3 confusion tables (ct <- confuTab(cm))
## the confusion matrix cm <- table(iris$Species, sample(iris$Species)) ## the 3 confusion tables (ct <- confuTab(cm))
support for feature selection in cross-validation
fs.absT(N) fs.probT(p) fs.topVariance(p)
fs.absT(N) fs.probT(p) fs.topVariance(p)
N |
number of features to retain; features are ordered by descending value of abs(two-sample t stat.), and the top N are used. |
p |
cumulative probability (in (0,1)) in the distribution of absolute t statistics above which we retain features |
This function returns a function that will be used as a parameter
to xvalSpec
in applications of MLearn
.
a function is returned, that will itself return a formula
consisting of the selected features for application of MLearn
.
The functions fs.absT
and fs.probT
are
two examples of approaches to embedded feature selection that make
sense for two-sample prediction problems. For selection based on
linear models or other discrimination measures, you will need to create
your own selection helper, following the code in these functions as
examples.
fs.topVariance performs non-specific feature selection based on the variance. Argument p is the variance percentile beneath which features are discarded.
VJ Carey <[email protected]>
library("MASS") data(crabs) # we will demonstrate this procedure with the crabs data. # first, create the closure to pick 3 features demFS = fs.absT(3) # run it on the entire dataset with features excluding sex demFS(sp~.-sex, crabs) # emulate cross-validation by excluding last 50 records demFS(sp~.-sex, crabs[1:150,]) # emulate cross-validation by excluding first 50 records -- different features retained demFS(sp~.-sex, crabs[51:200,])
library("MASS") data(crabs) # we will demonstrate this procedure with the crabs data. # first, create the closure to pick 3 features demFS = fs.absT(3) # run it on the entire dataset with features excluding sex demFS(sp~.-sex, crabs) # emulate cross-validation by excluding last 50 records demFS(sp~.-sex, crabs[1:150,]) # emulate cross-validation by excluding first 50 records -- different features retained demFS(sp~.-sex, crabs[51:200,])
extract history of feature selection for a cross-validated machine learner
fsHistory(x)
fsHistory(x)
x |
instance of |
returns a list of names of selected features
a list; the names of variables are made 'syntactic'
Vince Carey <[email protected]>
data(iris) iris2 = iris[ iris$Species %in% levels(iris$Species)[1:2], ] iris2$Species = factor(iris2$Species) # drop unused levels x1 = MLearn(Species~., iris2, ldaI, xvalSpec("LOG", 3, balKfold.xvspec(3), fs.absT(3))) fsHistory(x1)
data(iris) iris2 = iris[ iris$Species %in% levels(iris$Species)[1:2], ] iris2$Species = factor(iris2$Species) # drop unused levels x1 = MLearn(Species~., iris2, ldaI, xvalSpec("LOG", 3, balKfold.xvspec(3), fs.absT(3))) fsHistory(x1)
shiny-oriented GUI for cluster or classifier exploration
hclustWidget(mat, featureName = "feature", title = paste0("hclustWidget for ", deparse(substitute(mat))), minfeats = 2, auxdf = NULL) mlearnWidget(eset, infmla)
hclustWidget(mat, featureName = "feature", title = paste0("hclustWidget for ", deparse(substitute(mat))), minfeats = 2, auxdf = NULL) mlearnWidget(eset, infmla)
mat |
matrix with feature vectors in rows |
featureName |
name to be used for control that asks for number of features to use |
title |
widget title |
minfeats |
lower bound on number of features to use |
auxdf |
data.frame with number of rows equal to nrow(mat), with metadata to be displayed in hovering tooltip |
eset |
instance of |
infmla |
instance of |
Experimental tool to illustrate impacts of choice of distance, agglomeration method, etc.
a shinyApp result that will display in active browser
mlearnWidget
will attempt to nicely produce a variable
importance plot using randomForestI
. This means
that the annotation package for probe identifiers should be loaded
or an error will be thrown.
VJ Carey <[email protected]>
# should run with example(hclustWidget, ask=FALSE) if (interactive()) { library(shiny) library(MASS) data(crabs) cr = data.matrix(crabs[,-c(1:3)]) au = crabs[,1:3] show(hclustWidget(cr, auxdf=au)) ## must use stop widget button to proceed library(ALL) library(hgu95av2.db) data(ALL) show(mlearnWidget(ALL[1:500,], mol.biol~.)) }
# should run with example(hclustWidget, ask=FALSE) if (interactive()) { library(shiny) library(MASS) data(crabs) cr = data.matrix(crabs[,-c(1:3)]) au = crabs[,1:3] show(hclustWidget(cr, auxdf=au)) ## must use stop widget button to proceed library(ALL) library(hgu95av2.db) data(ALL) show(mlearnWidget(ALL[1:500,], mol.biol~.)) }
conveys information about machine learning functions in CRAN packages, for example, to MLearn wrapper
Objects can be created by calls of the form new("learnerSchema", ...)
.
packageName
:Object of class "character"
string naming the
package in which the function to be used is defined.
mlFunName
:Object of class "character"
string naming
the function to be used
converter
:Object of class "function"
function with parameters
obj, data, trainInd, that will produce a classifierOutput instance
signature(formula = "formula", data = "ExpressionSet", method = "learnerSchema", trainInd = "numeric")
: execute desired learner passing a formula
and ExpressionSet
signature(formula = "formula", data = "data.frame", method = "learnerSchema", trainInd = "numeric")
: execute desired learner passing a formula
signature(object = "learnerSchema")
: concise display
Vince Carey <[email protected]>
showClass("learnerSchema")
showClass("learnerSchema")
revised MLearn interface for machine learning, emphasizing a schematic description of external learning functions like knn, lda, nnet, etc.
MLearn( formula, data, .method, trainInd, ... ) makeLearnerSchema(packname, mlfunname, converter, predicter)
MLearn( formula, data, .method, trainInd, ... ) makeLearnerSchema(packname, mlfunname, converter, predicter)
formula |
standard model formula |
data |
data.frame or ExpressionSet instance |
.method |
instance of learnerSchema |
trainInd |
obligatory numeric vector of indices of data to be used for training; all other data are used for testing, or instance of the xvalSpec class |
... |
additional named arguments passed to external learning function |
packname |
character – name of package harboring a learner function |
mlfunname |
character – name of function to use |
converter |
function – with parameters (obj, data, trainInd) that tells how to convert the material in obj [produced by [packname::mlfunname] ] into a classifierOutput instance. |
predicter |
function – with parameters (obj, newdata, ...) that
tells how to use the material in |
The purpose of the MLearn methods is to provide a uniform calling sequence
to diverse machine learning algorithms. In R package, machine learning functions
can have parameters (x, y, ...)
or (formula, data, ...)
or some
other sequence, and these functions can return lists or vectors or other
sorts of things. With MLearn, we
always have calling sequence MLearn(formula, data, .method, trainInd, ...)
,
and data
can be a data.frame
or ExpressionSet
. MLearn
will always return an S4 instance of classifierObject
or clusteringObject
.
At this time (1.13.x), NA values in predictors trigger an error.
To obtain documentation on the older (pre bioc 2.1) version of the MLearn method, please use help(MLearn-OLD).
randomForest. Note, that to obtain the default performance of randomForestB, you need to set mtry and sampsize parameters to sqrt(number of features) and table([training set response factor]) respectively, as these were not taken to be the function's defaults. Note you can use xvalSpec("NOTEST") as trainInd, to use all the samples; the RObject() result will print the misclassification matrix estimate along with OOB error rate estimate.
knn; special support bridge required, defined in MLint
knn.cv; special support bridge required, defined in MLint. This option uses the embedded leave-one-out
cross-validation of knn.cv
, and thereby
achieves high performance. You can have more general cross-validation
using knnI
with an xvalSpec
, but it will be slower.
When using this learner schema, you should use the
numerical trainInd
setting with 1:N
where
N
is the number of samples.
diagDA; special support bridge required, defined in MLint
glm – with binomial family, expecting a dichotomous factor as response variable, not bulletproofed against other responses yet. If response probability estimate exceeds threshold, predict 1, else 0
gbm, forcing the Bernoulli loss function.
blackboost – you MUST supply a family parameter relevant for mboost package procedures
lvqtest after building codebook with lvqinit and updating with olvq1. You will need to write your own detailed schema if you want to tweak tuning parameters.
hclust – you must explicitly specify distance and agglomeration procedure.
kmeans – you must explicitly specify centers and algorithm name.
If the parallel
package is attached, cross-validation will
be distributed to cores using mclapply
.
Instances of classifierOutput or clusteringOutput
Vince Carey <[email protected]>
Try example(hclustWidget, ask=FALSE)
for an interactive
approach to cluster analysis tuning.
library("MASS") data(crabs) set.seed(1234) kp = sample(1:200, size=120) rf1 = MLearn(sp~CW+RW, data=crabs, randomForestI, kp, ntree=600 ) rf1 nn1 = MLearn(sp~CW+RW, data=crabs, nnetI, kp, size=3, decay=.01, trace=FALSE ) nn1 RObject(nn1) knn1 = MLearn(sp~CW+RW, data=crabs, knnI(k=3,l=2), kp) knn1 names(RObject(knn1)) dlda1 = MLearn(sp~CW+RW, data=crabs, dldaI, kp ) dlda1 names(RObject(dlda1)) lda1 = MLearn(sp~CW+RW, data=crabs, ldaI, kp ) lda1 names(RObject(lda1)) slda1 = MLearn(sp~CW+RW, data=crabs, sldaI, kp ) slda1 names(RObject(slda1)) svm1 = MLearn(sp~CW+RW, data=crabs, svmI, kp ) svm1 names(RObject(svm1)) ldapp1 = MLearn(sp~CW+RW, data=crabs, ldaI.predParms(method="debiased"), kp ) ldapp1 names(RObject(ldapp1)) qda1 = MLearn(sp~CW+RW, data=crabs, qdaI, kp ) qda1 names(RObject(qda1)) logi = MLearn(sp~CW+RW, data=crabs, glmI.logistic(threshold=0.5), kp, family=binomial ) # need family logi names(RObject(logi)) rp2 = MLearn(sp~CW+RW, data=crabs, rpartI, kp) rp2 ## recode data for RAB #nsp = ifelse(crabs$sp=="O", -1, 1) #nsp = factor(nsp) #ncrabs = cbind(nsp,crabs) #rab1 = MLearn(nsp~CW+RW, data=ncrabs, RABI, kp, maxiter=10) #rab1 # # new approach to adaboost # ada1 = MLearn(sp ~ CW+RW, data = crabs, .method = adaI, trainInd = kp, type = "discrete", iter = 200) ada1 confuMat(ada1) # lvq.1 = MLearn(sp~CW+RW, data=crabs, lvqI, kp ) lvq.1 nb.1 = MLearn(sp~CW+RW, data=crabs, naiveBayesI, kp ) confuMat(nb.1) bb.1 = MLearn(sp~CW+RW, data=crabs, baggingI, kp ) confuMat(bb.1) # # new mboost interface -- you MUST supply family for nonGaussian response # require(party) # trafo ... killing cmd check blb.1 = MLearn(sp~CW+RW+FL, data=crabs, blackboostI, kp, family=mboost::Binomial() ) confuMat(blb.1) # # ExpressionSet illustration # data(sample.ExpressionSet) # needed to increase training set size to avoid a new randomForest condition # on empty class set.seed(1234) X = MLearn(type~., sample.ExpressionSet[100:250,], randomForestI, 1:19, importance=TRUE ) library(randomForest) library(hgu95av2.db) opar = par(no.readonly=TRUE) par(las=2) plot(getVarImp(X), n=10, plat="hgu95av2", toktype="SYMBOL") par(opar) # # demonstrate cross validation # nn1cv = MLearn(sp~CW+RW, data=crabs[c(1:20,101:120),], nnetI, xvalSpec("LOO"), size=3, decay=.01, trace=FALSE ) confuMat(nn1cv) nn2cv = MLearn(sp~CW+RW, data=crabs[c(1:20,101:120),], nnetI, xvalSpec("LOG",5, balKfold.xvspec(5)), size=3, decay=.01, trace=FALSE ) confuMat(nn2cv) nn3cv = MLearn(sp~CW+RW+CL+BD+FL, data=crabs[c(1:20,101:120),], nnetI, xvalSpec("LOG",5, balKfold.xvspec(5), fsFun=fs.absT(2)), size=3, decay=.01, trace=FALSE ) confuMat(nn3cv) nn4cv = MLearn(sp~.-index-sex, data=crabs[c(1:20,101:120),], nnetI, xvalSpec("LOG",5, balKfold.xvspec(5), fsFun=fs.absT(2)), size=3, decay=.01, trace=FALSE ) confuMat(nn4cv) # # try with expression data # library(golubEsets) data(Golub_Train) litg = Golub_Train[ 100:150, ] g1 = MLearn(ALL.AML~. , litg, nnetI, xvalSpec("LOG",5, balKfold.xvspec(5), fsFun=fs.probT(.75)), size=3, decay=.01, trace=FALSE ) confuMat(g1) # # computations related to ALL that were used for rda and may be used elsewhere # library(ALL) data(ALL) # # restrict to BCR/ABL or NEG # bio <- which( ALL$mol.biol %in% c("BCR/ABL", "NEG")) # # restrict to B-cell # isb <- grep("^B", as.character(ALL$BT)) kp <- intersect(bio,isb) all2 <- ALL[,kp] mads = apply(exprs(all2),1,mad) kp = which(mads>1) # get around 250 genes vall2 = all2[kp, ] vall2$mol.biol = factor(vall2$mol.biol) # drop unused levels # illustrate clustering support cl1 = MLearn(~CW+RW+CL+FL+BD, data=crabs, hclustI(distFun=dist, cutParm=list(k=4))) plot(cl1) cl1a = MLearn(~CW+RW+CL+FL+BD, data=crabs, hclustI(distFun=dist, cutParm=list(k=4)), method="complete") plot(cl1a) cl2 = MLearn(~CW+RW+CL+FL+BD, data=crabs, kmeansI, centers=5, algorithm="Hartigan-Wong") plot(cl2, crabs[,-c(1:3)]) c3 = MLearn(~CL+CW+RW, crabs, pamI(dist), k=5) c3 plot(c3, data=crabs[,c("CL", "CW", "RW")]) # new interfaces to PLS thanks to Laurent Gatto set.seed(1234) kp = sample(1:200, size=120) #plsda.1 = MLearn(sp~CW+RW, data=crabs, plsdaI, kp, probMethod="Bayes") #plsda.1 #confuMat(plsda.1) #confuMat(plsda.1,t=.65) ## requires at least 0.65 post error prob to assign species # #plsda.2 = MLearn(type~., data=sample.ExpressionSet[100:250,], plsdaI, 1:16) #plsda.2 #confuMat(plsda.2) #confuMat(plsda.2,t=.65) ## requires at least 0.65 post error prob to assign outcome ## examples for predict #clout <- MLearn(type~., sample.ExpressionSet[100:250,], svmI , 1:16) #predict(clout, sample.ExpressionSet[100:250,17:26])
library("MASS") data(crabs) set.seed(1234) kp = sample(1:200, size=120) rf1 = MLearn(sp~CW+RW, data=crabs, randomForestI, kp, ntree=600 ) rf1 nn1 = MLearn(sp~CW+RW, data=crabs, nnetI, kp, size=3, decay=.01, trace=FALSE ) nn1 RObject(nn1) knn1 = MLearn(sp~CW+RW, data=crabs, knnI(k=3,l=2), kp) knn1 names(RObject(knn1)) dlda1 = MLearn(sp~CW+RW, data=crabs, dldaI, kp ) dlda1 names(RObject(dlda1)) lda1 = MLearn(sp~CW+RW, data=crabs, ldaI, kp ) lda1 names(RObject(lda1)) slda1 = MLearn(sp~CW+RW, data=crabs, sldaI, kp ) slda1 names(RObject(slda1)) svm1 = MLearn(sp~CW+RW, data=crabs, svmI, kp ) svm1 names(RObject(svm1)) ldapp1 = MLearn(sp~CW+RW, data=crabs, ldaI.predParms(method="debiased"), kp ) ldapp1 names(RObject(ldapp1)) qda1 = MLearn(sp~CW+RW, data=crabs, qdaI, kp ) qda1 names(RObject(qda1)) logi = MLearn(sp~CW+RW, data=crabs, glmI.logistic(threshold=0.5), kp, family=binomial ) # need family logi names(RObject(logi)) rp2 = MLearn(sp~CW+RW, data=crabs, rpartI, kp) rp2 ## recode data for RAB #nsp = ifelse(crabs$sp=="O", -1, 1) #nsp = factor(nsp) #ncrabs = cbind(nsp,crabs) #rab1 = MLearn(nsp~CW+RW, data=ncrabs, RABI, kp, maxiter=10) #rab1 # # new approach to adaboost # ada1 = MLearn(sp ~ CW+RW, data = crabs, .method = adaI, trainInd = kp, type = "discrete", iter = 200) ada1 confuMat(ada1) # lvq.1 = MLearn(sp~CW+RW, data=crabs, lvqI, kp ) lvq.1 nb.1 = MLearn(sp~CW+RW, data=crabs, naiveBayesI, kp ) confuMat(nb.1) bb.1 = MLearn(sp~CW+RW, data=crabs, baggingI, kp ) confuMat(bb.1) # # new mboost interface -- you MUST supply family for nonGaussian response # require(party) # trafo ... killing cmd check blb.1 = MLearn(sp~CW+RW+FL, data=crabs, blackboostI, kp, family=mboost::Binomial() ) confuMat(blb.1) # # ExpressionSet illustration # data(sample.ExpressionSet) # needed to increase training set size to avoid a new randomForest condition # on empty class set.seed(1234) X = MLearn(type~., sample.ExpressionSet[100:250,], randomForestI, 1:19, importance=TRUE ) library(randomForest) library(hgu95av2.db) opar = par(no.readonly=TRUE) par(las=2) plot(getVarImp(X), n=10, plat="hgu95av2", toktype="SYMBOL") par(opar) # # demonstrate cross validation # nn1cv = MLearn(sp~CW+RW, data=crabs[c(1:20,101:120),], nnetI, xvalSpec("LOO"), size=3, decay=.01, trace=FALSE ) confuMat(nn1cv) nn2cv = MLearn(sp~CW+RW, data=crabs[c(1:20,101:120),], nnetI, xvalSpec("LOG",5, balKfold.xvspec(5)), size=3, decay=.01, trace=FALSE ) confuMat(nn2cv) nn3cv = MLearn(sp~CW+RW+CL+BD+FL, data=crabs[c(1:20,101:120),], nnetI, xvalSpec("LOG",5, balKfold.xvspec(5), fsFun=fs.absT(2)), size=3, decay=.01, trace=FALSE ) confuMat(nn3cv) nn4cv = MLearn(sp~.-index-sex, data=crabs[c(1:20,101:120),], nnetI, xvalSpec("LOG",5, balKfold.xvspec(5), fsFun=fs.absT(2)), size=3, decay=.01, trace=FALSE ) confuMat(nn4cv) # # try with expression data # library(golubEsets) data(Golub_Train) litg = Golub_Train[ 100:150, ] g1 = MLearn(ALL.AML~. , litg, nnetI, xvalSpec("LOG",5, balKfold.xvspec(5), fsFun=fs.probT(.75)), size=3, decay=.01, trace=FALSE ) confuMat(g1) # # computations related to ALL that were used for rda and may be used elsewhere # library(ALL) data(ALL) # # restrict to BCR/ABL or NEG # bio <- which( ALL$mol.biol %in% c("BCR/ABL", "NEG")) # # restrict to B-cell # isb <- grep("^B", as.character(ALL$BT)) kp <- intersect(bio,isb) all2 <- ALL[,kp] mads = apply(exprs(all2),1,mad) kp = which(mads>1) # get around 250 genes vall2 = all2[kp, ] vall2$mol.biol = factor(vall2$mol.biol) # drop unused levels # illustrate clustering support cl1 = MLearn(~CW+RW+CL+FL+BD, data=crabs, hclustI(distFun=dist, cutParm=list(k=4))) plot(cl1) cl1a = MLearn(~CW+RW+CL+FL+BD, data=crabs, hclustI(distFun=dist, cutParm=list(k=4)), method="complete") plot(cl1a) cl2 = MLearn(~CW+RW+CL+FL+BD, data=crabs, kmeansI, centers=5, algorithm="Hartigan-Wong") plot(cl2, crabs[,-c(1:3)]) c3 = MLearn(~CL+CW+RW, crabs, pamI(dist), k=5) c3 plot(c3, data=crabs[,c("CL", "CW", "RW")]) # new interfaces to PLS thanks to Laurent Gatto set.seed(1234) kp = sample(1:200, size=120) #plsda.1 = MLearn(sp~CW+RW, data=crabs, plsdaI, kp, probMethod="Bayes") #plsda.1 #confuMat(plsda.1) #confuMat(plsda.1,t=.65) ## requires at least 0.65 post error prob to assign species # #plsda.2 = MLearn(type~., data=sample.ExpressionSet[100:250,], plsdaI, 1:16) #plsda.2 #confuMat(plsda.2) #confuMat(plsda.2,t=.65) ## requires at least 0.65 post error prob to assign outcome ## examples for predict #clout <- MLearn(type~., sample.ExpressionSet[100:250,], svmI , 1:16) #predict(clout, sample.ExpressionSet[100:250,17:26])
These functions are internal tools for MLInterfaces
.
Users will generally not call these functions directly.
getGrid(x)
getGrid(x)
x |
a vector or matrix or ExpressionSet |
Forthcoming.
Functions with ‘new’ as prefix are constructor helpers.
VJ Carey <[email protected]>
Methods to calculate the number of true positives (tp
),
true negatives (tn
), false negatives (fn
),
false positive (fp
), accuracy (acc
),
precision
, recall
(same as sensitivity
),
specificity
, F1
and macroF1
scores.
Each method also accepts an naAs0
argument definiting if
NAs
should be replaced by 0
(default is FALSE
).
Methods tp
, tn
, fp
, fn
, F1
,
acc
and specificity
:
signature(obj = "table")
Methods recall
(sensitivity
), precision
and
macroF1
:
signature(obj = "classifierOutput", type = "character")
signature(obj = "classifierOutput", type = "missing")
signature(obj = "classifierOutput", type = "numeric")
signature(obj = "table")
## the confusion matrix cm <- table(iris$Species, sample(iris$Species)) tp(cm) tn(cm) fp(cm) fn(cm) acc(cm) precision(cm) recall(cm) F1(cm) macroF1(cm)
## the confusion matrix cm <- table(iris$Species, sample(iris$Species)) tp(cm) tn(cm) fp(cm) fn(cm) acc(cm) precision(cm) recall(cm) F1(cm) macroF1(cm)
show the classification boundaries on the plane dictated by two genes in an ExpressionSet
uses two genes in the ExpressionSet to exhibit the decision boundaries in the plane
uses two columns in the data.frame to exhibit the decision boundaries in the plane
library(ALL) library(hgu95av2.db) data(ALL) # # restrict to BCR/ABL or NEG # bio <- which( ALL$mol.biol %in% c("BCR/ABL", "NEG")) # # restrict to B-cell # isb <- grep("^B", as.character(ALL$BT)) kp <- intersect(bio,isb) all2 <- ALL[,kp] # # sample 2 genes at random # set.seed(1234) ng <- nrow(exprs(all2)) # pick 5 in case any NAs come back pick <- sample(1:ng, size=5, replace=FALSE) gg <- all2[pick,] sym <- unlist(mget(featureNames(gg), hgu95av2SYMBOL)) bad = which(is.na(sym)) if (length(bad)>0) { gg = gg[-bad,] sym = sym[-bad] } gg = gg[1:2,] sym = sym[1:2] featureNames(gg) <- sym gg$class = factor(ifelse(all2$mol.biol=="NEG", "NEG", "POS")) cl1 <- which( gg$class == "NEG" ) cl2 <- which( gg$class != "NEG" ) # # create balanced training sample # trainInds <- c( sample(cl1, size=floor(length(cl1)/2) ), sample(cl2, size=floor(length(cl2)/2)) ) # # run rpart # tgg <- MLearn(class~., gg, rpartI, trainInds, minsplit=4 ) opar <- par(no.readonly=TRUE) par(mfrow=c(2,2)) planarPlot( tgg, gg, "class" ) title("rpart") points(exprs(gg)[1,trainInds], exprs(gg)[2,trainInds], col=ifelse(gg$class[trainInds]=="NEG", "yellow", "black"), pch=16) # # run nnet # ngg <- MLearn( class~., gg, nnetI, trainInds, size=8 ) planarPlot( ngg, gg, "class" ) points(exprs(gg)[1,trainInds], exprs(gg)[2,trainInds], col=ifelse(gg$class[trainInds]=="NEG", "yellow", "black"), pch=16) title("nnet") # # run knn # kgg <- MLearn( class~., gg, knnI(k=3,l=1), trainInds) planarPlot( kgg, gg, "class" ) points(exprs(gg)[1,trainInds], exprs(gg)[2,trainInds], col=ifelse(gg$class[trainInds]=="NEG", "yellow", "black"), pch=16) title("3-nn") # # run svm # sgg <- MLearn( class~., gg, svmI, trainInds ) planarPlot( sgg, gg, "class" ) points(exprs(gg)[1,trainInds], exprs(gg)[2,trainInds], col=ifelse(gg$class[trainInds]=="NEG", "yellow", "black"), pch=16) title("svm") par(opar)
library(ALL) library(hgu95av2.db) data(ALL) # # restrict to BCR/ABL or NEG # bio <- which( ALL$mol.biol %in% c("BCR/ABL", "NEG")) # # restrict to B-cell # isb <- grep("^B", as.character(ALL$BT)) kp <- intersect(bio,isb) all2 <- ALL[,kp] # # sample 2 genes at random # set.seed(1234) ng <- nrow(exprs(all2)) # pick 5 in case any NAs come back pick <- sample(1:ng, size=5, replace=FALSE) gg <- all2[pick,] sym <- unlist(mget(featureNames(gg), hgu95av2SYMBOL)) bad = which(is.na(sym)) if (length(bad)>0) { gg = gg[-bad,] sym = sym[-bad] } gg = gg[1:2,] sym = sym[1:2] featureNames(gg) <- sym gg$class = factor(ifelse(all2$mol.biol=="NEG", "NEG", "POS")) cl1 <- which( gg$class == "NEG" ) cl2 <- which( gg$class != "NEG" ) # # create balanced training sample # trainInds <- c( sample(cl1, size=floor(length(cl1)/2) ), sample(cl2, size=floor(length(cl2)/2)) ) # # run rpart # tgg <- MLearn(class~., gg, rpartI, trainInds, minsplit=4 ) opar <- par(no.readonly=TRUE) par(mfrow=c(2,2)) planarPlot( tgg, gg, "class" ) title("rpart") points(exprs(gg)[1,trainInds], exprs(gg)[2,trainInds], col=ifelse(gg$class[trainInds]=="NEG", "yellow", "black"), pch=16) # # run nnet # ngg <- MLearn( class~., gg, nnetI, trainInds, size=8 ) planarPlot( ngg, gg, "class" ) points(exprs(gg)[1,trainInds], exprs(gg)[2,trainInds], col=ifelse(gg$class[trainInds]=="NEG", "yellow", "black"), pch=16) title("nnet") # # run knn # kgg <- MLearn( class~., gg, knnI(k=3,l=1), trainInds) planarPlot( kgg, gg, "class" ) points(exprs(gg)[1,trainInds], exprs(gg)[2,trainInds], col=ifelse(gg$class[trainInds]=="NEG", "yellow", "black"), pch=16) title("3-nn") # # run svm # sgg <- MLearn( class~., gg, svmI, trainInds ) planarPlot( sgg, gg, "class" ) points(exprs(gg)[1,trainInds], exprs(gg)[2,trainInds], col=ifelse(gg$class[trainInds]=="NEG", "yellow", "black"), pch=16) title("svm") par(opar)
shiny app for interactive 3D visualization of mlbench hypercube
plspinHcube(insbwidth=4)
plspinHcube(insbwidth=4)
insbwidth |
numeric, sidebar width |
Runs shinyApp
on ui and server that
render gaussian data at hypercube vertices.
VJ Carey <[email protected]>
if (interactive()) plspinHcube()
if (interactive()) plspinHcube()
classifierOutput
objects
This function predicts values based on models trained with
MLInterfaces' MLearn
interface to many machine learning
algorithms.
## S3 method for class 'classifierOutput' predict(object, newdata, ...)
## S3 method for class 'classifierOutput' predict(object, newdata, ...)
object |
An instance of class |
newdata |
An object containing the new input data: either a |
... |
Other arguments to be passed to the algorithm-specific predict methods. |
This S3 method will extract the ML model from the
classifierOutput
instance and call either a
generic predict method or, if available, a specficly written wrapper
to do classes prediction and class probabilities.
Currently, a list with
testPredictions |
A factor with class predictions. |
testScores |
A |
The function output will most likely be updated in a near future to a
classifierOutput
(or similar) object.
Laurent Gatto <[email protected]>
MLearn
and classifierOutput
.
## Not run: set.seed(1234) data(sample.ExpressionSet) trainInd <- 1:16 clout.svm <- MLearn(type~., sample.ExpressionSet[100:250,], svmI, trainInd) predict(clout.svm, sample.ExpressionSet[100:250,-trainInd]) clout.ksvm <- MLearn(type~., sample.ExpressionSet[100:250,], ksvmI, trainInd) predict(clout.ksvm, sample.ExpressionSet[100:250,-trainInd]) clout.nnet <- MLearn(type~., sample.ExpressionSet[100:250,], nnetI, trainInd, size=3, decay=.01 ) predict(clout.nnet, sample.ExpressionSet[100:250,-trainInd]) clout.knn <- MLearn(type~., sample.ExpressionSet[100:250,], knnI(k=3), trainInd) predict(clout.knn, sample.ExpressionSet[100:250,-trainInd],k=1) predict(clout.knn, sample.ExpressionSet[100:250,-trainInd],k=3) #clout.plsda <- MLearn(type~., sample.ExpressionSet[100:250,], plsdaI, trainInd) #predict(clout.plsda, sample.ExpressionSet[100:250,-trainInd]) clout.nb <- MLearn(type~., sample.ExpressionSet[100:250,], naiveBayesI, trainInd) predict(clout.nb, sample.ExpressionSet[100:250,-trainInd]) # this can fail if training set does not yield sufficient diversity in response vector; # setting seed seems to help with this example, but other applications may have problems # clout.rf <- MLearn(type~., sample.ExpressionSet[100:250,], randomForestI, trainInd) predict(clout.rf, sample.ExpressionSet[100:250,-trainInd]) ## End(Not run) # end of dontrun
## Not run: set.seed(1234) data(sample.ExpressionSet) trainInd <- 1:16 clout.svm <- MLearn(type~., sample.ExpressionSet[100:250,], svmI, trainInd) predict(clout.svm, sample.ExpressionSet[100:250,-trainInd]) clout.ksvm <- MLearn(type~., sample.ExpressionSet[100:250,], ksvmI, trainInd) predict(clout.ksvm, sample.ExpressionSet[100:250,-trainInd]) clout.nnet <- MLearn(type~., sample.ExpressionSet[100:250,], nnetI, trainInd, size=3, decay=.01 ) predict(clout.nnet, sample.ExpressionSet[100:250,-trainInd]) clout.knn <- MLearn(type~., sample.ExpressionSet[100:250,], knnI(k=3), trainInd) predict(clout.knn, sample.ExpressionSet[100:250,-trainInd],k=1) predict(clout.knn, sample.ExpressionSet[100:250,-trainInd],k=3) #clout.plsda <- MLearn(type~., sample.ExpressionSet[100:250,], plsdaI, trainInd) #predict(clout.plsda, sample.ExpressionSet[100:250,-trainInd]) clout.nb <- MLearn(type~., sample.ExpressionSet[100:250,], naiveBayesI, trainInd) predict(clout.nb, sample.ExpressionSet[100:250,-trainInd]) # this can fail if training set does not yield sufficient diversity in response vector; # setting seed seems to help with this example, but other applications may have problems # clout.rf <- MLearn(type~., sample.ExpressionSet[100:250,], randomForestI, trainInd) predict(clout.rf, sample.ExpressionSet[100:250,-trainInd]) ## End(Not run) # end of dontrun
"projectedLearner"
helps depict prediction hyperregions from high-dimensional models
Objects can be created by calls of the form new("projectedLearner", ...)
.
fittedLearner
:Object of class "classifierOutput"
trainingSetPCA
:Object of class "prcomp"
trainingLabels
:Object of class "ANY"
given
labels for features used in training
testLabels
:Object of class "ANY"
given labels
for features used in testing
gridFeatsProjectedToTrainingPCs
:Object of class "matrix"
rotated coordinates of training features
gridPredictions
:Object of class "ANY"
predicted
labels for all grid points
trainFeatsProjectedToTrainingPCs
:Object of class "matrix"
rotated coordinates of training features
testFeatsProjectedToTrainingPCs
:Object of class "matrix"
rotated coordinates of test features
trainPredictions
:Object of class "ANY"
predicted
labels for training features
testPredictions
:Object of class "ANY"
predicted
labels for test features
theCall
:Object of class "call"
call used to
generate this wonderful thing
signature(x = "projectedLearner")
: uses
rgl
to give a dynamic 3d-like projection of labels in
colored regions. See projectLearnerToGrid
for an
example.
signature(x = "projectedLearner", y = "ANY")
: pairs
plot of the tesselated PCA of the training features
signature(x = "projectedLearner")
: a 2d plot
of tesselation projection for selected axes of the PCA
signature(object = "projectedLearner")
: object
housing numerical resources for the renderings
plot
may need to be modified when there are many features/PCs in use
plotOne
has additional arguments ind1
, ind2
, and type
. ind1 and ind2 specify the PCs to display. type is one of
"showTestPredictions"
(default),
"showTrainPredictions"
,
"showTestLabels"
,
"showTrainLabels"
. These indicate what will be used to locate
glyphs with labels in the projected scatterplots.
VJ Carey <[email protected]>
None.
showClass("projectedLearner")
showClass("projectedLearner")
create learned tesselation of feature space after PC transformation
projectLearnerToGrid(formula, data, learnerSchema, trainInds, ..., dropIntercept = TRUE, ngpts = 20, predExtras = list(), predWrapper = force)
projectLearnerToGrid(formula, data, learnerSchema, trainInds, ..., dropIntercept = TRUE, ngpts = 20, predExtras = list(), predWrapper = force)
formula |
standard formula, typically of the form "y~." where y denotes the class label variable to be predicted by all remaining features in the input data frame |
data |
a data.frame instance |
learnerSchema |
an instance of |
trainInds |
integer vector of rows of |
... |
additional parameters for use with |
dropIntercept |
logical indicating whether to include column of 1s among feature column-vectors |
ngpts |
number of equispaced points along the range of each input feature to use in forming a grid in feature space |
predExtras |
a list with named elements giving binding to extra parameters needed
to predict labels for the learner in use. For example, with
|
predWrapper |
Sometimes a function call is needed to extract the predicted
labels from the RObject applied to the |
instance of projectedLearner-class
VJ Carey <[email protected]>
none.
library(mlbench) # demostrate with 3 dimensional hypercube problem kk = mlbench.hypercube() colnames(kk$x) = c("f1", "f2", "f3") hcu = data.frame(cl=kk$classes, kk$x) set.seed(1234) sam = sample(1:nrow(kk$x), size=nrow(kk$x)/2) ldap = projectLearnerToGrid(cl~., data=hcu, ldaI, sam, predWrapper=function(x)x$class) plot(ldap) confuMat(ldap@fittedLearner) nnetp = projectLearnerToGrid(cl~., data=hcu, nnetI, sam, size=2, decay=.01, predExtras=list(type="class")) plot(nnetp) confuMat(nnetp@fittedLearner) #if (requireNamespace("rgl") && interactive()) { # learnerIn3D(nnetp) # ## customising the rgl plot # learnerIn3D(nnetp, size = 10, alpha = 0.1) #}
library(mlbench) # demostrate with 3 dimensional hypercube problem kk = mlbench.hypercube() colnames(kk$x) = c("f1", "f2", "f3") hcu = data.frame(cl=kk$classes, kk$x) set.seed(1234) sam = sample(1:nrow(kk$x), size=nrow(kk$x)/2) ldap = projectLearnerToGrid(cl~., data=hcu, ldaI, sam, predWrapper=function(x)x$class) plot(ldap) confuMat(ldap@fittedLearner) nnetp = projectLearnerToGrid(cl~., data=hcu, nnetI, sam, size=2, decay=.01, predExtras=list(type="class")) plot(nnetp) confuMat(nnetp@fittedLearner) #if (requireNamespace("rgl") && interactive()) { # learnerIn3D(nnetp) # ## customising the rgl plot # learnerIn3D(nnetp, size = 10, alpha = 0.1) #}
read adaboost ... a demonstration version
RAB(formula, data, maxiter=200, maxdepth=1)
RAB(formula, data, maxiter=200, maxdepth=1)
formula |
formula – the response variable must be coded -1, 1 |
data |
data |
maxiter |
maxiter |
maxdepth |
maxdepth – passed to rpart |
an instance of raboostCont
Vince Carey <[email protected]>
Friedman et al Ann Stat 28/2 337
library(MASS) library(rpart) data(Pima.tr) data(Pima.te) Pima.all = rbind(Pima.tr, Pima.te) tonp = ifelse(Pima.all$type == "Yes", 1, -1) tonp = factor(tonp) Pima.all = data.frame(Pima.all[,1:7], mtype=tonp) fit1 = RAB(mtype~ped+glu+npreg+bmi+age, data=Pima.all[1:200,], maxiter=10, maxdepth=5) pfit1 = Predict(fit1, newdata=Pima.tr) table(Pima.tr$type, pfit1)
library(MASS) library(rpart) data(Pima.tr) data(Pima.te) Pima.all = rbind(Pima.tr, Pima.te) tonp = ifelse(Pima.all$type == "Yes", 1, -1) tonp = factor(tonp) Pima.all = data.frame(Pima.all[,1:7], mtype=tonp) fit1 = RAB(mtype~ped+glu+npreg+bmi+age, data=Pima.all[1:200,], maxiter=10, maxdepth=5) pfit1 = Predict(fit1, newdata=Pima.tr) table(Pima.tr$type, pfit1)
~~ A concise (1-5 lines) description of what the class is. ~~
Objects can be created by calls of the form new("raboostCont", ...)
.
~~ describe objects here ~~
.Data
:Object of class "list"
~~
formula
:Object of class "formula"
~~
call
:Object of class "call"
~~
Class "list"
, from data part.
Class "vector"
, by class "list", distance 2.
Predict
is an S4 method that can apply to instances of this class.
VJ Carey <[email protected]>
showClass("raboostCont")
showClass("raboostCont")
collects data on variable importance
Objects can be created by calls of the form new("varImpStruct", ...)
.
These are matrices of importance measures with separate
slots identifying algorithm generating the measures and
variable names.
.Data
:Object of class "matrix"
actual importance
measures
method
:Object of class "character"
tag
varnames
:Object of class "character"
conformant
vector of names of variables
Class "matrix"
, from data part.
Class "structure"
, by class "matrix"
.
Class "array"
, by class "matrix"
.
Class "vector"
, by class "matrix", with explicit coerce.
Class "vector"
, by class "matrix", with explicit coerce.
signature(x = "varImpStruct")
: make a bar plot,
you can supply arguments plat
and toktype
which will use lookUp(...,plat,toktype)
from the annotate
package to translate probe names to, e.g.,
gene symbols.
signature(object = "varImpStruct")
: simple abbreviated
display
signature(object = "classifOutput", fixNames="logical")
: extractor
of variable importance structure; fixNames parameter is to remove leading X used
to make variable names syntactic by randomForest (ca 1/2008). You can set fixNames to false
if using hu6800 platform, because all featureNames are syntactic as given.
signature(object = "classifOutput", fixNames="logical")
: extractor
of variable importance data, with annotation; fixNames parameter is to remove leading X used
to make variable names syntactic by randomForest (ca 1/2008). You can set fixNames to false
if using hu6800 platform, because all featureNames are syntactic as given.
library(golubEsets) data(Golub_Merge) library(hu6800.db) smallG <- Golub_Merge[1001:1060,] set.seed(1234) opar=par(no.readonly=TRUE) par(las=2, mar=c(10,11,5,5)) rf2 <- MLearn(ALL.AML~., smallG, randomForestI, 1:40, importance=TRUE, sampsize=table(smallG$ALL.AML[1:40]), mtry=sqrt(ncol(exprs(smallG)))) plot( getVarImp( rf2, FALSE ), n=10, plat="hu6800", toktype="SYMBOL") par(opar) report( getVarImp( rf2, FALSE ), n=10, plat="hu6800", toktype="SYMBOL")
library(golubEsets) data(Golub_Merge) library(hu6800.db) smallG <- Golub_Merge[1001:1060,] set.seed(1234) opar=par(no.readonly=TRUE) par(las=2, mar=c(10,11,5,5)) rf2 <- MLearn(ALL.AML~., smallG, randomForestI, 1:40, importance=TRUE, sampsize=table(smallG$ALL.AML[1:40]), mtry=sqrt(ncol(exprs(smallG)))) plot( getVarImp( rf2, FALSE ), n=10, plat="hu6800", toktype="SYMBOL") par(opar) report( getVarImp( rf2, FALSE ), n=10, plat="hu6800", toktype="SYMBOL")
Use cross-validation in a clustered computing environment
xvalLoop( cluster, ... )
xvalLoop( cluster, ... )
cluster |
Any S4-class object, used to indicate how to perform clustered computations. |
... |
Additional arguments used to inform the clustered computation. |
Cross-validiation usually involves repeated calls to the same
function, but with different arguments. This provides an obvious
place for using clustered computers to enhance execution. The method
xval is structured to exploit this; xvalLoop
provides an easy mechanism to change how xval
performs
cross-validation.
The idea is to write an xvalLoop
method that returns a
function. The function is then used to execute the
cross-validation. For instance, the default method returns the
function lapply
, so the cross-validation is performed by using
lapply
. A different method might return a function that
executed lapply-like functions, but sent different parts of the
function to different computer nodes.
An accompanying vignette illustrates the technique in greater
detail. An effective division of labor is for experienced cluster
programmers to write lapply-like methods for their favored clustering
environment. The user then only has to add the cluster object to the
list of arguments to xval
to get clustered calculations.
A function taking arguments like those for lapply
## Not run: library(golubEsets) data(Golub_Merge) smallG <- Golub_Merge[200:250,] # Evaluation on one node lk1 <- xval(smallG, "ALL.AML", knnB, xvalMethod="LOO", group=as.integer(0)) table(lk1,smallG$ALL.AML) # Evaluation on several nodes -- a cluster programmer might write the following... library(snow) setOldClass("spawnedMPIcluster") setMethod("xvalLoop", signature( cluster = "spawnedMPIcluster"), ## use the function returned below to evalutae ## the central cross-validation loop in xval function( cluster, ... ) { clusterExportEnv <- function (cl, env = .GlobalEnv) { unpackEnv <- function(env) { for ( name in ls(env) ) assign(name, get(name, env), .GlobalEnv ) NULL } clusterCall(cl, unpackEnv, env) } function(X, FUN, ...) { # this gets returned to xval ## send all visible variables from the parent (i.e., xval) frame clusterExportEnv( cluster, parent.frame(1) ) parLapply( cluster, X, FUN, ... ) } }) # ... and use the cluster like this... cl <- makeCluster(2, "MPI") clusterEvalQ(cl, library(MLInterfaces)) lk1 <- xval(smallG, "ALL.AML", knnB, xvalMethod="LOO", group=as.integer(0), cluster = cl) table(lk1,smallG$ALL.AML) ## End(Not run)
## Not run: library(golubEsets) data(Golub_Merge) smallG <- Golub_Merge[200:250,] # Evaluation on one node lk1 <- xval(smallG, "ALL.AML", knnB, xvalMethod="LOO", group=as.integer(0)) table(lk1,smallG$ALL.AML) # Evaluation on several nodes -- a cluster programmer might write the following... library(snow) setOldClass("spawnedMPIcluster") setMethod("xvalLoop", signature( cluster = "spawnedMPIcluster"), ## use the function returned below to evalutae ## the central cross-validation loop in xval function( cluster, ... ) { clusterExportEnv <- function (cl, env = .GlobalEnv) { unpackEnv <- function(env) { for ( name in ls(env) ) assign(name, get(name, env), .GlobalEnv ) NULL } clusterCall(cl, unpackEnv, env) } function(X, FUN, ...) { # this gets returned to xval ## send all visible variables from the parent (i.e., xval) frame clusterExportEnv( cluster, parent.frame(1) ) parLapply( cluster, X, FUN, ... ) } }) # ... and use the cluster like this... cl <- makeCluster(2, "MPI") clusterEvalQ(cl, library(MLInterfaces)) lk1 <- xval(smallG, "ALL.AML", knnB, xvalMethod="LOO", group=as.integer(0), cluster = cl) table(lk1,smallG$ALL.AML) ## End(Not run)
container for information specifying a cross-validated machine learning exercise
xvalSpec(type, niter=0, partitionFunc = function(data, classLab,iternum) { (seq_len(nrow(data)))[-iternum] }, fsFun = function(formula, data) formula )
xvalSpec(type, niter=0, partitionFunc = function(data, classLab,iternum) { (seq_len(nrow(data)))[-iternum] }, fsFun = function(formula, data) formula )
type |
a string, "LOO" indicating leave-one-out cross-validation, or "LOG" indicating leave-out-group, or "NOTEST", indicating the entire dataset is used in a single training run. |
niter |
numeric specification of the number of
cross-validation iterations to use. Ignored if |
partitionFunc |
function, with parameters data (bound
to data.frame), clab (bound to character string), iternum (bound
to numeric index into sequence of 1: |
fsFun |
function, with parameters formula, data. The function must return a formula suitable for defining a model on the basis of the main input data. A candidate fsFun is given in example for fsHistory function. |
If type == "LOO"
, no other parameters are inspected.
If type == "LOG"
a value for partitionFunc
must be
supplied. We recommend using balKfold.xvspec(K)
. The
values of niter
and K
in this usage must be the same.
This redundancy will be removed in a future upgrade.
If the parallel
package is attached and symbol mc_fork
is loaded, cross-validation will
be distributed to cores using mclapply
.
An instance of classifierOutput
, with a special
structure. The RObject
return slot is populated with a
list of niter
cross-validation results. Each element of this list
is itself a list with two elements: test.idx
(the indices
of the test set for the associated cross-validation iteration,
and mlans
, the classifierOutput
generated at
each iteration. Thus there are classifierOutput
instances nested within the main classifierOutput
returned
when a xvalSpec
is used.
Vince Carey <[email protected]>
library("MASS") data(crabs) set.seed(1234) # # demonstrate cross validation # nn1cv = MLearn(sp~CW+RW, data=crabs, nnetI, xvalSpec("LOG", 5, balKfold.xvspec(5)), size=3, decay=.01 ) nn1cv confuMat(nn1cv) names(RObject(nn1cv)[[1]]) RObject(RObject(nn1cv)[[1]]$mlans)
library("MASS") data(crabs) set.seed(1234) # # demonstrate cross validation # nn1cv = MLearn(sp~CW+RW, data=crabs, nnetI, xvalSpec("LOG", 5, balKfold.xvspec(5)), size=3, decay=.01 ) nn1cv confuMat(nn1cv) names(RObject(nn1cv)[[1]]) RObject(RObject(nn1cv)[[1]]$mlans)