Title: | A framework for cross-validated classification problems, with applications to differential variability and differential distribution testing |
---|---|
Description: | The software formalises a framework for classification and survival model evaluation in R. There are four stages; Data transformation, feature selection, model training, and prediction. The requirements of variable types and variable order are fixed, but specialised variables for functions can also be provided. The framework is wrapped in a driver loop that reproducibly carries out a number of cross-validation schemes. Functions for differential mean, differential variability, and differential distribution are included. Additional functions may be developed by the user, by creating an interface to the framework. |
Authors: | Dario Strbenac [aut, cre], Ellis Patrick [aut], Sourish Iyengar [aut], Harry Robertson [aut], Andy Tran [aut], John Ormerod [aut], Graham Mann [aut], Jean Yang [aut] |
Maintainer: | Dario Strbenac <[email protected]> |
License: | GPL-3 |
Version: | 3.11.4 |
Built: | 2025-01-06 05:23:17 UTC |
Source: | https://github.com/bioc/ClassifyR |
Data set consists of a matrix of abundances of 2000 most variable gene expression measurements for 190 samples and a factor vector of classes for those samples.
measurements
has a row for each sample and a column for each
gene. classes
is a factor vector with values No and Yes, indicating if
a particular person has asthma or not.
A Nasal Brush-based Classifier of Asthma Identified by Machine Learning Analysis of Nasal RNA Sequence Data, Scientific Reports, 2018. Webpage: http://www.nature.com/articles/s41598-018-27189-4
Prints a list of keywords to use with crossValidate
available(what = c("classifier", "selectionMethod", "multiViewMethod"))
available(what = c("classifier", "selectionMethod", "multiViewMethod"))
what |
Default: |
Dario Strbenac
available()
available()
These functions tabulate or plot various aspects of precision pathways, such as accuracies and costs.
calcCostsAndPerformance(precisionPathways, costs = NULL) ## S3 method for class 'PrecisionPathways' summary(object, weights = c(accuracy = 0.5, cost = 0.5), ...) bubblePlot(precisionPathways, ...) ## S3 method for class 'PrecisionPathways' bubblePlot(precisionPathways, pathwayColours = NULL, ...) flowchart(precisionPathways, ...) ## S3 method for class 'PrecisionPathways' flowchart( precisionPathways, pathway, nodeColours = c(assay = "#86C57C", class1 = "#ACCEE0", class2 = "#F47F72"), ... ) strataPlot(precisionPathways, ...) ## S3 method for class 'PrecisionPathways' strataPlot( precisionPathways, pathway, classColours = c(class1 = "#4DAF4A", class2 = "#984EA3"), ... )
calcCostsAndPerformance(precisionPathways, costs = NULL) ## S3 method for class 'PrecisionPathways' summary(object, weights = c(accuracy = 0.5, cost = 0.5), ...) bubblePlot(precisionPathways, ...) ## S3 method for class 'PrecisionPathways' bubblePlot(precisionPathways, pathwayColours = NULL, ...) flowchart(precisionPathways, ...) ## S3 method for class 'PrecisionPathways' flowchart( precisionPathways, pathway, nodeColours = c(assay = "#86C57C", class1 = "#ACCEE0", class2 = "#F47F72"), ... ) strataPlot(precisionPathways, ...) ## S3 method for class 'PrecisionPathways' strataPlot( precisionPathways, pathway, classColours = c(class1 = "#4DAF4A", class2 = "#984EA3"), ... )
precisionPathways |
A pathway of class |
costs |
A named vector of assays with the cost of each one. |
object |
A set of pathways of class |
weights |
A numeric vector of length two specifying how to weight the predictive accuracy and the cost during ranking. Must sum to 1. |
... |
Not used but just following the S3 requirement of the generic template. |
pathwayColours |
A named vector of colours with names being the names of pathways. If none is specified, a default colour scheme will automatically be chosen. |
pathway |
A character vector of length 1 specifying which pathway to plot, e.g. "clinical-mRNA". |
nodeColours |
A named vector of colours with names being |
classColours |
A named vector of colours with names being |
If calcExternalPerformance
is used, such as when having a vector of
known classes and a vector of predicted classes determined outside of the
ClassifyR package, a single metric value is calculated. If
calcCVperformance
is used, annotates the results of calling
crossValidate
, runTests
or runTest
with one of the user-specified performance measures.
## S4 method for signature 'factor,factor' calcExternalPerformance( actualOutcome, predictedOutcome, performanceTypes = "auto" ) ## S4 method for signature 'Surv,numeric' calcExternalPerformance( actualOutcome, predictedOutcome, performanceTypes = "auto" ) ## S4 method for signature 'factor,tabular' calcExternalPerformance( actualOutcome, predictedOutcome, performanceTypes = "auto" ) ## S4 method for signature 'ClassifyResult' calcCVperformance(result, performanceTypes = "auto") performanceTable( resultsList, performanceTypes = "auto", aggregate = c("median", "mean") )
## S4 method for signature 'factor,factor' calcExternalPerformance( actualOutcome, predictedOutcome, performanceTypes = "auto" ) ## S4 method for signature 'Surv,numeric' calcExternalPerformance( actualOutcome, predictedOutcome, performanceTypes = "auto" ) ## S4 method for signature 'factor,tabular' calcExternalPerformance( actualOutcome, predictedOutcome, performanceTypes = "auto" ) ## S4 method for signature 'ClassifyResult' calcCVperformance(result, performanceTypes = "auto") performanceTable( resultsList, performanceTypes = "auto", aggregate = c("median", "mean") )
actualOutcome |
A factor vector or survival information specifying each sample's known outcome. |
predictedOutcome |
A factor vector or survival information of the same length as |
performanceTypes |
Default:
|
result |
An object of class |
resultsList |
A list of modelling results. Each element must be of type |
aggregate |
Default: |
All metrics except Matthews Correlation Coefficient are suitable for evaluating classification scenarios with more than two classes and are reimplementations of those available from Intel DAAL.
crossValidate
, runTests
or runTest
was run in resampling mode, one performance
measure is produced for every resampling. Otherwise, if the leave-k-out mode was used,
then the predictions are concatenated, and one performance measure is
calculated for all classifications.
"Balanced Error"
calculates the balanced error rate and is better
suited to class-imbalanced data sets than the ordinary error rate specified
by "Error"
. "Sample Error"
calculates the error rate of each
sample individually. This may help to identify which samples are
contributing the most to the overall error rate and check them for
confounding factors. Precision, recall and F1 score have micro and macro
summary versions. The macro versions are preferable because the metric will
not have a good score if there is substantial class imbalance and the
classifier predicts all samples as belonging to the majority class.
If calcCVperformance
was run, an updated
ClassifyResult
object, with new metric values in the
performance
slot. If calcExternalPerformance
was run, the
performance metric value itself.
Dario Strbenac
predictTable <- DataFrame(sample = paste("A", 1:10, sep = ''), class = factor(sample(LETTERS[1:2], 50, replace = TRUE))) actual <- factor(sample(LETTERS[1:2], 10, replace = TRUE)) result <- ClassifyResult(DataFrame(characteristic = "Data Set", value = "Example"), paste("A", 1:10, sep = ''), paste("Gene", 1:50), list(paste("Gene", 1:50), paste("Gene", 1:50)), list(paste("Gene", 1:5), paste("Gene", 1:10)), list(function(oracle){}), NULL, predictTable, actual) result <- calcCVperformance(result) performance(result)
predictTable <- DataFrame(sample = paste("A", 1:10, sep = ''), class = factor(sample(LETTERS[1:2], 50, replace = TRUE))) actual <- factor(sample(LETTERS[1:2], 10, replace = TRUE)) result <- ClassifyResult(DataFrame(characteristic = "Data Set", value = "Example"), paste("A", 1:10, sep = ''), paste("Gene", 1:50), list(paste("Gene", 1:50), paste("Gene", 1:50)), list(paste("Gene", 1:5), paste("Gene", 1:10)), list(function(oracle){}), NULL, predictTable, actual) result <- calcCVperformance(result) performance(result)
Contains a list of models, table of actual sample classes and predicted
classes, the identifiers of features selected for each fold of each
permutation or each hold-out classification, and performance metrics such as
error rates. This class is not intended to be created by the user. It is
created by crossValidate
, runTests
or runTest
.
ClassifyResult(characteristics, originalNames, originalFeatures,
rankedFeatures, chosenFeatures, models, tunedParameters, predictions, actualOutcome, importance = NULL, modellingParams = NULL, finalModel = NULL)
characteristics
A DataFrame
describing the
characteristics of classification done. First column must be named
"charateristic"
and second column must be named "value"
. If
using wrapper functions for feature selection and classifiers in this
package, the function names will automatically be generated and therefore it
is not necessary to specify them.
originalNames
All sample names.
originalFeatures
All feature names. Character vector
or DataFrame
with one row for each feature if the data set has multiple kinds
of measurements on the same set of samples.
chosenFeatures
Features selected at each fold. Character vector or a data frame if data set has multiple kinds of measurements on the same set of samples.
models
All of the models fitted to the training data.
tunedParameters
Names of tuning parameters and the value chosen of each parameter.
predictions
A data frame containing sample IDs, predicted class or risk and information about the cross-validation iteration in which the prediction was made.
actualOutcome
The known class or survival data of each sample.
importance
The changes in model performance for each selected variable when it is excluded.
modellingParams
Stores the object used for defining the model building to enable future reuse.
finalModel
A model built using all of the samples for future use. For any tuning parameters, the most popular value of the parameter in cross-validation is used. May be missing if some cross-validated fittings failed. Could be of any class, depending on the R package used to fit the model.
result
is a ClassifyResult
object.show(result)
: Prints a short summary of what result
contains.
result
is a ClassifyResult
object.
sampleNames(result)
Returns a vector of sample names present in the data set.
actualOutcome(result)
Returns the known outcome of each sample.
models(result)
A list
of the models fitted for each training.
finalModel(result)
A deployable model fitted on all of the data for use on future data.
chosenFeatureNames(result)
A list
of the features selected for each training.
predictions(result)
Returns a DataFrame
which has columns with test sample,
cross-validation and prediction information.
performance(result)
Returns a list
of performance measures. This is
empty until calcCVperformance
has been used.
tunedParameters(result)
Returns a list
of tuned parameter values.
If cross-validation is used, this list will be large, as it stores chosen values
for every iteration.
totalPredictions(result)
A single number representing the total number. of predictions made during the cross-validation procedure.
Dario Strbenac
#if(require(sparsediscrim)) #{ data(asthma) classified <- crossValidate(measurements, classes, nRepeats = 5) class(classified) #}
#if(require(sparsediscrim)) #{ data(asthma) classified <- crossValidate(measurements, classes, nRepeats = 5) class(classified) #}
A function to perform fast or standard Cox proportional hazard model tests.
colCoxTests(measurements, outcome, option = c("fast", "slow"), ...)
colCoxTests(measurements, outcome, option = c("fast", "slow"), ...)
measurements |
matrix with variables as columns. |
outcome |
matrix with first column as time and second column as event. |
option |
Default: |
... |
Not currently used. |
CrossValParams object
data(asthma) time <- rpois(nrow(measurements), 100) status <- sample(c(0,1), nrow(measurements), replace = TRUE) outcome <- cbind(time, status) output <- colCoxTests(measurements, outcome, "fast")
data(asthma) time <- rpois(nrow(measurements), 100) status <- sample(c(0,1), nrow(measurements), replace = TRUE) outcome <- cbind(time, status) output <- colCoxTests(measurements, outcome, "fast")
This function has been designed to give a heatmap output of the crissCrossValidate function.
crissCrossPlot(crissCrossResult, includeValues = FALSE)
crissCrossPlot(crissCrossResult, includeValues = FALSE)
crissCrossResult |
The output of the crissCrossValidate function. |
includeValues |
If TRUE, then the values of the matrix will be included in the plot. |
Harry Robertson
This function has been designed to perform cross-validation and model prediction on datasets in a pairwise manner.
crissCrossValidate( measurements, outcomes, nFeatures = 20, selectionMethod = "auto", selectionOptimisation = "Resubstitution", trainType = c("modelTrain", "modelTest"), performanceType = "auto", doRandomFeatures = FALSE, classifier = "auto", nFolds = 5, nRepeats = 20, nCores = 1, verbose = 0 )
crissCrossValidate( measurements, outcomes, nFeatures = 20, selectionMethod = "auto", selectionOptimisation = "Resubstitution", trainType = c("modelTrain", "modelTest"), performanceType = "auto", doRandomFeatures = FALSE, classifier = "auto", nFolds = 5, nRepeats = 20, nCores = 1, verbose = 0 )
measurements |
A |
outcomes |
A |
nFeatures |
The number of features to be used for modelling. |
selectionMethod |
Default: |
selectionOptimisation |
A character of "Resubstitution", "Nested CV" or "none" specifying the approach used to optimise nFeatures. |
trainType |
Default: |
performanceType |
Default: |
doRandomFeatures |
Default: |
classifier |
Default: |
nFolds |
A numeric specifying the number of folds to use for cross-validation. |
nRepeats |
A numeric specifying the the number of repeats or permutations to use for cross-validation. |
nCores |
A numeric specifying the number of cores used if the user wants to use parallelisation. |
verbose |
Default: 0. A number between 0 and 3 for the amount of progress messages to give. A higher number will produce more messages as more lower-level functions print messages. |
A list with elements "real"
for the matrix of pairwise performance metrics using real
feature selection, "random"
if doRandomFeatures
is TRUE
for metrics of random selection and
"params"
for a list of parameters used during the execution of this function.
Harry Robertson
This function has been designed to facilitate the comparison of classification
methods using cross-validation, particularly when there are multiple assays per biological unit.
A selection of typical comparisons are implemented. The train
function
is a convenience method for training on one data set and likewise predict
for predicting on an
independent validation data set.
## S4 method for signature 'DataFrame' crossValidate( measurements, outcome, nFeatures = 20, selectionMethod = "auto", selectionOptimisation = "Resubstitution", performanceType = "auto", classifier = "auto", autoTune = FALSE, multiViewMethod = "none", assayCombinations = "all", nFolds = 5, nRepeats = 20, nCores = 1, characteristicsLabel = NULL, extraParams = NULL, verbose = 0 ) ## S4 method for signature 'MultiAssayExperimentOrList' crossValidate( measurements, outcome, nFeatures = 20, selectionMethod = "auto", selectionOptimisation = "Resubstitution", performanceType = "auto", classifier = "auto", autoTune = FALSE, multiViewMethod = "none", assayCombinations = "all", nFolds = 5, nRepeats = 20, nCores = 1, characteristicsLabel = NULL, extraParams = NULL, verbose = 0 ) ## S4 method for signature 'data.frame' crossValidate( measurements, outcome, nFeatures = 20, selectionMethod = "auto", selectionOptimisation = "Resubstitution", performanceType = "auto", classifier = "auto", autoTune = FALSE, multiViewMethod = "none", assayCombinations = "all", nFolds = 5, nRepeats = 20, nCores = 1, characteristicsLabel = NULL, extraParams = NULL, verbose = 0 ) ## S4 method for signature 'matrix' crossValidate( measurements, outcome, nFeatures = 20, selectionMethod = "auto", selectionOptimisation = "Resubstitution", performanceType = "auto", classifier = "auto", autoTune = FALSE, multiViewMethod = "none", assayCombinations = "all", nFolds = 5, nRepeats = 20, nCores = 1, characteristicsLabel = NULL, extraParams = NULL, verbose = 0 ) ## S3 method for class 'matrix' train(x, outcomeTrain, ...) ## S3 method for class 'data.frame' train(x, outcomeTrain, ...) ## S3 method for class 'DataFrame' train( x, outcomeTrain, selectionMethod = "auto", nFeatures = 20, classifier = "auto", autoTune = FALSE, performanceType = "auto", multiViewMethod = "none", assayIDs = "all", extraParams = NULL, verbose = 0, ... ) ## S3 method for class 'list' train(x, outcomeTrain, ...) ## S3 method for class 'MultiAssayExperiment' train(x, outcome, ...) ## S3 method for class 'trainedByClassifyR' predict(object, newData, outcome, ...)
## S4 method for signature 'DataFrame' crossValidate( measurements, outcome, nFeatures = 20, selectionMethod = "auto", selectionOptimisation = "Resubstitution", performanceType = "auto", classifier = "auto", autoTune = FALSE, multiViewMethod = "none", assayCombinations = "all", nFolds = 5, nRepeats = 20, nCores = 1, characteristicsLabel = NULL, extraParams = NULL, verbose = 0 ) ## S4 method for signature 'MultiAssayExperimentOrList' crossValidate( measurements, outcome, nFeatures = 20, selectionMethod = "auto", selectionOptimisation = "Resubstitution", performanceType = "auto", classifier = "auto", autoTune = FALSE, multiViewMethod = "none", assayCombinations = "all", nFolds = 5, nRepeats = 20, nCores = 1, characteristicsLabel = NULL, extraParams = NULL, verbose = 0 ) ## S4 method for signature 'data.frame' crossValidate( measurements, outcome, nFeatures = 20, selectionMethod = "auto", selectionOptimisation = "Resubstitution", performanceType = "auto", classifier = "auto", autoTune = FALSE, multiViewMethod = "none", assayCombinations = "all", nFolds = 5, nRepeats = 20, nCores = 1, characteristicsLabel = NULL, extraParams = NULL, verbose = 0 ) ## S4 method for signature 'matrix' crossValidate( measurements, outcome, nFeatures = 20, selectionMethod = "auto", selectionOptimisation = "Resubstitution", performanceType = "auto", classifier = "auto", autoTune = FALSE, multiViewMethod = "none", assayCombinations = "all", nFolds = 5, nRepeats = 20, nCores = 1, characteristicsLabel = NULL, extraParams = NULL, verbose = 0 ) ## S3 method for class 'matrix' train(x, outcomeTrain, ...) ## S3 method for class 'data.frame' train(x, outcomeTrain, ...) ## S3 method for class 'DataFrame' train( x, outcomeTrain, selectionMethod = "auto", nFeatures = 20, classifier = "auto", autoTune = FALSE, performanceType = "auto", multiViewMethod = "none", assayIDs = "all", extraParams = NULL, verbose = 0, ... ) ## S3 method for class 'list' train(x, outcomeTrain, ...) ## S3 method for class 'MultiAssayExperiment' train(x, outcome, ...) ## S3 method for class 'trainedByClassifyR' predict(object, newData, outcome, ...)
measurements |
Either a |
outcome |
A vector of class labels of class |
... |
For |
nFeatures |
The number of features to be used for classification. If this is a single number, the same number of features will be used for all comparisons
or assays. If a numeric vector these will be optimised over using |
selectionMethod |
Default: |
selectionOptimisation |
A character of "Resubstitution", "Nested CV" or "none" specifying the approach used to optimise |
performanceType |
Performance metric to optimise if classifier has any tuning parameters. |
classifier |
Default: |
autoTune |
Default: |
multiViewMethod |
Default: |
assayCombinations |
A character vector or list of character vectors proposing the assays or, in the case of a list, combination of assays to use
with each element being a vector of assays to combine. Special value |
nFolds |
A numeric specifying the number of folds to use for cross-validation. |
nRepeats |
A numeric specifying the the number of repeats or permutations to use for cross-validation. |
nCores |
A numeric specifying the number of cores used if the user wants to use parallelisation. |
characteristicsLabel |
A character specifying an additional label for the cross-validation run. |
extraParams |
A list of parameters that will be used to overwrite default settings of transformation, selection, or model-building functions or
parameters which will be passed into the data cleaning function. The names of the list must be one of |
verbose |
Default: 0. A number between 0 and 3 for the amount of progress messages to give. A higher number will produce more messages as more lower-level functions print messages. |
x |
Same as |
outcomeTrain |
For the |
assayIDs |
A character vector for assays to train with. Special value |
object |
A fitted model or a list of such models. |
newData |
For the |
classifier
can be any a keyword for any of the implemented approaches as shown by available()
.
selectionMethod
can be a keyword for any of the implemented approaches as shown by available("selectionMethod")
.
multiViewMethod
can be a keyword for any of the implemented approaches as shown by available("multiViewMethod")
.
An object of class ClassifyResult
data(asthma) # Compare randomForest and SVM classifiers. result <- crossValidate(measurements, classes, classifier = c("randomForest", "SVM")) performancePlot(result) # Compare performance of different assays. # First make a toy example assay with multiple data types. We'll randomly assign different features to be clinical, gene or protein. # set.seed(51773) # measurements <- DataFrame(measurements, check.names = FALSE) # mcols(measurements)$assay <- c(rep("clinical",20),sample(c("gene", "protein"), ncol(measurements)-20, replace = TRUE)) # mcols(measurements)$feature <- colnames(measurements) # We'll use different nFeatures for each assay. We'll also use repeated cross-validation with 5 repeats for speed in the example. # set.seed(51773) #result <- crossValidate(measurements, classes, nFeatures = c(clinical = 5, gene = 20, protein = 30), classifier = "randomForest", nRepeats = 5) # performancePlot(result) # Merge different assays. But we will only do this for two combinations. If assayCombinations is not specified it would attempt all combinations. # set.seed(51773) # resultMerge <- crossValidate(measurements, classes, assayCombinations = list(c("clinical", "protein"), c("clinical", "gene")), multiViewMethod = "merge", nRepeats = 5) # performancePlot(resultMerge) # performancePlot(c(result, resultMerge))
data(asthma) # Compare randomForest and SVM classifiers. result <- crossValidate(measurements, classes, classifier = c("randomForest", "SVM")) performancePlot(result) # Compare performance of different assays. # First make a toy example assay with multiple data types. We'll randomly assign different features to be clinical, gene or protein. # set.seed(51773) # measurements <- DataFrame(measurements, check.names = FALSE) # mcols(measurements)$assay <- c(rep("clinical",20),sample(c("gene", "protein"), ncol(measurements)-20, replace = TRUE)) # mcols(measurements)$feature <- colnames(measurements) # We'll use different nFeatures for each assay. We'll also use repeated cross-validation with 5 repeats for speed in the example. # set.seed(51773) #result <- crossValidate(measurements, classes, nFeatures = c(clinical = 5, gene = 20, protein = 30), classifier = "randomForest", nRepeats = 5) # performancePlot(result) # Merge different assays. But we will only do this for two combinations. If assayCombinations is not specified it would attempt all combinations. # set.seed(51773) # resultMerge <- crossValidate(measurements, classes, assayCombinations = list(c("clinical", "protein"), c("clinical", "gene")), multiViewMethod = "merge", nRepeats = 5) # performancePlot(resultMerge) # performancePlot(c(result, resultMerge))
Collects and checks necessary parameters required for cross-validation by
runTests
.
CrossValParams( samplesSplits = c("Permute k-Fold", "Permute Percentage Split", "Leave-k-Out", "k-Fold"), permutations = 100, percentTest = 25, folds = 5, leave = 2, tuneMode = c("Resubstitution", "Nested CV", "none"), adaptiveResamplingDelta = NULL, parallelParams = bpparam() )
CrossValParams( samplesSplits = c("Permute k-Fold", "Permute Percentage Split", "Leave-k-Out", "k-Fold"), permutations = 100, percentTest = 25, folds = 5, leave = 2, tuneMode = c("Resubstitution", "Nested CV", "none"), adaptiveResamplingDelta = NULL, parallelParams = bpparam() )
samplesSplits |
Default: "Permute k-Fold". A character value specifying what kind of sample splitting to do. |
permutations |
Default: 100. Number of times to permute the
data set before it is split into training and test sets. Only relevant if
|
percentTest |
The percentage of the data
set to assign to the test set, with the remainder of the samples belonging
to the training set. Only relevant if |
folds |
The number of approximately equal-sized folds to partition
the samples into. Only relevant if |
leave |
The number of samples to generate all possible
combination of and use as the test set. Only relevant if |
tuneMode |
Default: Resubstitution. The scheme to use for selecting any tuning parameters. |
adaptiveResamplingDelta |
Default: |
parallelParams |
An instance of |
Dario Strbenac
CrossValParams() # Default is 100 permutations and 5 folds of each. snow <- SnowParam(workers = 2, RNGseed = 999) CrossValParams("Leave-k-Out", leave = 2, parallelParams = snow) # Fully reproducible Leave-2-out cross-validation on 4 cores, # even if feature selection or classifier use random sampling.
CrossValParams() # Default is 100 permutations and 5 folds of each. snow <- SnowParam(workers = 2, RNGseed = 999) CrossValParams("Leave-k-Out", leave = 2, parallelParams = snow) # Fully reproducible Leave-2-out cross-validation on 4 cores, # even if feature selection or classifier use random sampling.
There are two modes. For aggregating feature selection results, the function counts the number of times each feature was selected in all cross-validations. For aggregating predictive results, the accuracy or C-index for each sample is visualised. This is useful in identifying samples that are difficult to predict well.
result |
An object of class |
... |
Further parameters, such as |
dataType |
Default: |
plotType |
Whether to draw a probability density curve or a histogram. |
summaryType |
If feature selection, whether to summarise as a proportion or count. |
plot |
Whether to draw a plot of the frequency of selection or error rate. |
xMax |
Maximum data value to show in plot. |
fontSizes |
A vector of length 3. The first number is the size of the title. The second number is the size of the axes titles. The third number is the size of the axes values. |
ordering |
Default: |
If dataType
is "features", a vector as long as the number of
features that were chosen at least once containing the number of times the
feature was chosen in cross validations or the proportion of times chosen.
If dataType
is "samples", a vector as long as the number of samples,
containing the cross-validation error rate of the sample. If plot
is
TRUE
, then a plot is also made on the current graphics device.
Dario Strbenac
#if(require(sparsediscrim)) #{ data(asthma) result <- crossValidate(measurements, classes, nRepeats = 5) featureDistribution <- distribution(result, "features", summaryType = "count", plotType = "histogram", binwidth = 1) print(head(featureDistribution)) #}
#if(require(sparsediscrim)) #{ data(asthma) result <- crossValidate(measurements, classes, nRepeats = 5) featureDistribution <- distribution(result, "features", summaryType = "count", plotType = "histogram", binwidth = 1) print(head(featureDistribution)) #}
Interactions between pairs of features (typically a protein-protein interaction, commonly abbreviated as PPI, database) are restructured into a named list. The name of the each element of the list is a feature and the element contains all features which have an interaction with it.
edgesToHubNetworks(edges, minCardinality = 5)
edgesToHubNetworks(edges, minCardinality = 5)
edges |
A two-column |
minCardinality |
An integer specifying the minimum number of features to be associated with a hub feature for it to be present in the result. |
An object of type FeatureSetCollection
.
Dario Strbenac
VAN: an R package for identifying biologically perturbed networks via differential variability analysis, Vivek Jayaswal, Sarah-Jane Schramm, Graham J Mann, Marc R Wilkins and Yee Hwa Yang, 2010, BMC Research Notes, Volume 6 Article 430, https://bmcresnotes.biomedcentral.com/articles/10.1186/1756-0500-6-430.
interactor <- c("MITF", "MITF", "MITF", "MITF", "MITF", "MITF", "KRAS", "KRAS", "KRAS", "KRAS", "KRAS", "KRAS", "PD-1") otherInteractor <- c("HINT1", "LEF1", "PSMD14", "PIAS3", "UBE2I", "PATZ1", "ARAF", "CALM1", "CALM2", "CALM3", "RAF1", "HNRNPC", "PD-L1") edges <- data.frame(interactor, otherInteractor, stringsAsFactors = FALSE) edgesToHubNetworks(edges, minCardinality = 4)
interactor <- c("MITF", "MITF", "MITF", "MITF", "MITF", "MITF", "KRAS", "KRAS", "KRAS", "KRAS", "KRAS", "KRAS", "PD-1") otherInteractor <- c("HINT1", "LEF1", "PSMD14", "PIAS3", "UBE2I", "PATZ1", "ARAF", "CALM1", "CALM2", "CALM3", "RAF1", "HNRNPC", "PD-L1") edges <- data.frame(interactor, otherInteractor, stringsAsFactors = FALSE) edgesToHubNetworks(edges, minCardinality = 4)
This container is the required storage format for a collection of sets. Typically, the elements of a set will either be a set of proteins (i.e. character vector) which perform a particular biological process or a set of binary interactions (i.e. Two-column matrix of feature identifiers).
FeatureSetCollection(sets)
sets
A named list. The names of the list describe the sets and the elements of the list specify the features which comprise the sets.
featureSets
is a FeatureSetCollection
object.show(featureSets)
: Prints a short summary of what featureSets
contains.length(featureSets)
: Prints how many sets of features there are.
The FeatureSetCollection
may be subsetted to a smaller set of elements or a single set
may be extracted as a vector.
featureSets
is a FeatureSetCollection
object.featureSets[i:j]
: Reduces the object to a subset of the feature sets between elements i
and j
of the collection.featureSets[[i]]
: Extract the feature set identified by i
. i
may be a numeric index or the character name of a feature set.
Dario Strbenac
ontology <- list(c("SESN1", "PRDX1", "PRDX2", "PRDX3", "PRDX4", "PRDX5", "PRDX6", "LRRK2", "PARK7"), c("ATP7A", "CCS", "NQO1", "PARK7", "SOD1", "SOD2", "SOD3", "SZT2", "TNF"), c("AARS", "AIMP2", "CARS", "GARS", "KARS", "NARS", "NARS2", "LARS2", "NARS", "NARS2", "RGN", "UBA7"), c("CRY1", "CRY2", "ONP1SW", "OPN4", "RGR"), c("ESRRG", "RARA", "RARB", "RARG", "RXRA", "RXRB", "RXRG"), c("CD36", "CD47", "F2", "SDC4"), c("BUD31", "PARK7", "RWDD1", "TAF1") ) names(ontology) <- c("Peroxiredoxin Activity", "Superoxide Dismutase Activity", "Ligase Activity", "Photoreceptor Activity", "Retinoic Acid Receptor Activity", "Thrombospondin Receptor Activity", "Regulation of Androgen Receptor Activity") featureSets <- FeatureSetCollection(ontology) featureSets featureSets[3:5] featureSets[["Photoreceptor Activity"]] subNetworks <- list(MAPK = matrix(c("NRAS", "NRAS", "NRAS", "BRAF", "MEK", "ARAF", "BRAF", "CRAF", "MEK", "ERK"), ncol = 2), P53 = matrix(c("ATM", "ATR", "ATR", "P53", "CHK2", "CHK1", "P53", "MDM2"), ncol = 2) ) networkSets <- FeatureSetCollection(subNetworks) networkSets
ontology <- list(c("SESN1", "PRDX1", "PRDX2", "PRDX3", "PRDX4", "PRDX5", "PRDX6", "LRRK2", "PARK7"), c("ATP7A", "CCS", "NQO1", "PARK7", "SOD1", "SOD2", "SOD3", "SZT2", "TNF"), c("AARS", "AIMP2", "CARS", "GARS", "KARS", "NARS", "NARS2", "LARS2", "NARS", "NARS2", "RGN", "UBA7"), c("CRY1", "CRY2", "ONP1SW", "OPN4", "RGR"), c("ESRRG", "RARA", "RARB", "RARG", "RXRA", "RXRB", "RXRG"), c("CD36", "CD47", "F2", "SDC4"), c("BUD31", "PARK7", "RWDD1", "TAF1") ) names(ontology) <- c("Peroxiredoxin Activity", "Superoxide Dismutase Activity", "Ligase Activity", "Photoreceptor Activity", "Retinoic Acid Receptor Activity", "Thrombospondin Receptor Activity", "Regulation of Androgen Receptor Activity") featureSets <- FeatureSetCollection(ontology) featureSets featureSets[3:5] featureSets[["Photoreceptor Activity"]] subNetworks <- list(MAPK = matrix(c("NRAS", "NRAS", "NRAS", "BRAF", "MEK", "ARAF", "BRAF", "CRAF", "MEK", "ERK"), ncol = 2), P53 = matrix(c("ATM", "ATR", "ATR", "P53", "CHK2", "CHK1", "P53", "MDM2"), ncol = 2) ) networkSets <- FeatureSetCollection(subNetworks) networkSets
Represents a feature set by the mean or median feature measurement of a feature set for all features belonging to a feature set.
## S4 method for signature 'matrix' featureSetSummary( measurements, location = c("median", "mean"), featureSets, minimumOverlapPercent = 80, verbose = 3 ) ## S4 method for signature 'DataFrame' featureSetSummary( measurements, location = c("median", "mean"), featureSets, minimumOverlapPercent = 80, verbose = 3 ) ## S4 method for signature 'MultiAssayExperiment' featureSetSummary( measurements, target = NULL, location = c("median", "mean"), featureSets, minimumOverlapPercent = 80, verbose = 3 )
## S4 method for signature 'matrix' featureSetSummary( measurements, location = c("median", "mean"), featureSets, minimumOverlapPercent = 80, verbose = 3 ) ## S4 method for signature 'DataFrame' featureSetSummary( measurements, location = c("median", "mean"), featureSets, minimumOverlapPercent = 80, verbose = 3 ) ## S4 method for signature 'MultiAssayExperiment' featureSetSummary( measurements, target = NULL, location = c("median", "mean"), featureSets, minimumOverlapPercent = 80, verbose = 3 )
measurements |
Either a |
location |
Default: The median. The type of location to summarise a set of features belonging to a feature set by. |
featureSets |
An object of type |
minimumOverlapPercent |
The minimum percentage of overlapping features
between the data set and a feature set defined in |
verbose |
Default: 3. A number between 0 and 3 for the amount of progress messages to give. This function only prints progress messages if the value is 3. |
target |
If the input is a |
This feature transformation method is unusual because the mean or median feature of a feature set for one sample may be different to another sample, whereas most other feature transformation methods do not result in different features being compared between samples during classification.
The same class of variable as the input variable measurements
is, with the individual features summarised to feature sets. The number of
samples remains unchanged, so only one dimension of measurements
is
altered.
Dario Strbenac
Network-based biomarkers enhance classical approaches to prognostic gene expression signatures, Rebecca L Barter, Sarah-Jane Schramm, Graham J Mann and Yee Hwa Yang, 2014, BMC Systems Biology, Volume 8 Supplement 4 Article S5, https://bmcsystbiol.biomedcentral.com/articles/10.1186/1752-0509-8-S4-S5.
sets <- list(Adhesion = c("Gene 1", "Gene 2", "Gene 3"), `Cell Cycle` = c("Gene 8", "Gene 9", "Gene 10")) featureSets <- FeatureSetCollection(sets) # Adhesion genes have a median gene difference between classes. genesMatrix <- matrix(c(rnorm(5, 9, 0.3), rnorm(5, 7, 0.3), rnorm(5, 8, 0.3), rnorm(5, 6, 0.3), rnorm(10, 7, 0.3), rnorm(70, 5, 0.1)), nrow = 10) rownames(genesMatrix) <- paste("Patient", 1:10) colnames(genesMatrix) <- paste("Gene", 1:10) classes <- factor(rep(c("Poor", "Good"), each = 5)) # But not used for transformation. featureSetSummary(genesMatrix, featureSets = featureSets)
sets <- list(Adhesion = c("Gene 1", "Gene 2", "Gene 3"), `Cell Cycle` = c("Gene 8", "Gene 9", "Gene 10")) featureSets <- FeatureSetCollection(sets) # Adhesion genes have a median gene difference between classes. genesMatrix <- matrix(c(rnorm(5, 9, 0.3), rnorm(5, 7, 0.3), rnorm(5, 8, 0.3), rnorm(5, 6, 0.3), rnorm(10, 7, 0.3), rnorm(70, 5, 0.1)), nrow = 10) rownames(genesMatrix) <- paste("Patient", 1:10) colnames(genesMatrix) <- paste("Gene", 1:10) classes <- factor(rep(c("Poor", "Good"), each = 5)) # But not used for transformation. featureSetSummary(genesMatrix, featureSets = featureSets)
A collection of 45783 pairs of protein gene symbols, as determined by the The Human Reference Protein Interactome Mapping Project. Self-interactions have been removed.
interactors
is a Pairs
object containing each
pair of interacting proteins.
A Reference Map of the Human Binary Protein Interactome, Nature, 2020. Webpage: http://www.interactome-atlas.org/download
This conversion is useful for creating a meta-feature table for classifier training and prediction based on sub-networks that were selected based on their differential correlation between classes.
## S4 method for signature 'matrix' interactorDifferences(measurements, ...) ## S4 method for signature 'DataFrame' interactorDifferences( measurements, featurePairs = NULL, absolute = FALSE, verbose = 3 ) ## S4 method for signature 'MultiAssayExperiment' interactorDifferences(measurements, useFeatures = "all", ...)
## S4 method for signature 'matrix' interactorDifferences(measurements, ...) ## S4 method for signature 'DataFrame' interactorDifferences( measurements, featurePairs = NULL, absolute = FALSE, verbose = 3 ) ## S4 method for signature 'MultiAssayExperiment' interactorDifferences(measurements, useFeatures = "all", ...)
measurements |
Either a |
... |
Variables not used by the |
featurePairs |
A object of type |
absolute |
If TRUE, then the absolute values of the differences are returned. |
verbose |
Default: 3. A number between 0 and 3 for the amount of progress messages to give. This function only prints progress messages if the value is 3. |
useFeatures |
If |
The pairs of features known to interact with each other are specified by
networkSets
.
An object of class DataFrame
with one column for each
interactor pair difference and one row for each sample. Additionally,
mcols(resultTable)
prodvides a DataFrame
with a column
named "original" containing the name of the sub-network each meta-feature
belongs to.
Dario Strbenac
Dynamic modularity in protein interaction networks predicts breast cancer outcome, Ian W Taylor, Rune Linding, David Warde-Farley, Yongmei Liu, Catia Pesquita, Daniel Faria, Shelley Bull, Tony Pawson, Quaid Morris and Jeffrey L Wrana, 2009, Nature Biotechnology, Volume 27 Issue 2, https://www.nature.com/articles/nbt.1522.
pairs <- Pairs(rep(c('A', 'G'), each = 3), c('B', 'C', 'D', 'H', 'I', 'J')) # Consistent differences for interactors of A. measurements <- matrix(c(5.7, 10.1, 6.9, 7.7, 8.8, 9.1, 11.2, 6.4, 7.0, 5.5, 3.6, 7.6, 4.0, 4.4, 5.8, 6.2, 8.1, 3.7, 4.4, 2.1, 8.5, 13.0, 9.9, 10.0, 10.3, 11.9, 13.8, 9.9, 10.7, 8.5, 8.1, 10.6, 7.4, 10.7, 10.8, 11.1, 13.3, 9.7, 11.0, 9.1, round(rnorm(60, 8, 0.3), 1)), nrow = 10) rownames(measurements) <- paste("Patient", 1:10) colnames(measurements) <- LETTERS[1:10] interactorDifferences(measurements, pairs)
pairs <- Pairs(rep(c('A', 'G'), each = 3), c('B', 'C', 'D', 'H', 'I', 'J')) # Consistent differences for interactors of A. measurements <- matrix(c(5.7, 10.1, 6.9, 7.7, 8.8, 9.1, 11.2, 6.4, 7.0, 5.5, 3.6, 7.6, 4.0, 4.4, 5.8, 6.2, 8.1, 3.7, 4.4, 2.1, 8.5, 13.0, 9.9, 10.0, 10.3, 11.9, 13.8, 9.9, 10.7, 8.5, 8.1, 10.6, 7.4, 10.7, 10.8, 11.1, 13.3, 9.7, 11.0, 9.1, round(rnorm(60, 8, 0.3), 1)), nrow = 10) rownames(measurements) <- paste("Patient", 1:10) colnames(measurements) <- LETTERS[1:10] interactorDifferences(measurements, pairs)
470 patients with eight features.
clinical
A DataFrame
containing clinical data.
Dynamics of Breast Cancer Relapse Reveal Late-recurring ER-positive Genomic Subgroups, Nature, 2019. Webpage: https://www.nature.com/articles/s43018-020-0026-6
Collects and checks necessary parameters required for data modelling. Apart from data transfomation that needs to be done within cross-validation (e.g. subtracting each observation from training set mean), feature selection, model training and prediction, this container also stores a setting for class imbalance rebalancing.
ModellingParams( balancing = c("downsample", "upsample", "none"), transformParams = NULL, selectParams = SelectParams("t-test"), trainParams = TrainParams("DLDA"), predictParams = PredictParams("DLDA"), doImportance = FALSE )
ModellingParams( balancing = c("downsample", "upsample", "none"), transformParams = NULL, selectParams = SelectParams("t-test"), trainParams = TrainParams("DLDA"), predictParams = PredictParams("DLDA"), doImportance = FALSE )
balancing |
Default: |
transformParams |
Parameters used for feature transformation inside of C.V.
specified by a |
selectParams |
Parameters used during feature selection specified
by a |
trainParams |
Parameters for model training specified by a |
predictParams |
Parameters for model training specified by a |
doImportance |
Default: |
Dario Strbenac
#if(require(sparsediscrim)) #{ ModellingParams() # Default is differences in means selection and DLDA. ModellingParams(selectParams = NULL, # No feature selection before training. trainParams = TrainParams("randomForest"), predictParams = PredictParams("randomForest")) #}
#if(require(sparsediscrim)) #{ ModellingParams() # Default is differences in means selection and DLDA. ModellingParams(selectParams = NULL, # No feature selection before training. trainParams = TrainParams("randomForest"), predictParams = PredictParams("randomForest")) #}
Draws a graphical summary of a particular performance measure for a list of classifications
## S4 method for signature 'ClassifyResult' performancePlot(results, ...) ## S4 method for signature 'list' performancePlot( results, metric = "auto", characteristicsList = list(x = "auto"), aggregate = character(), coloursList = list(), alpha = 1, orderingList = list(), densityStyle = c("box", "violin"), yLimits = NULL, fontSizes = c(24, 16, 12, 12), title = NULL, margin = grid::unit(c(1, 1, 1, 1), "lines"), rotate90 = FALSE, showLegend = TRUE )
## S4 method for signature 'ClassifyResult' performancePlot(results, ...) ## S4 method for signature 'list' performancePlot( results, metric = "auto", characteristicsList = list(x = "auto"), aggregate = character(), coloursList = list(), alpha = 1, orderingList = list(), densityStyle = c("box", "violin"), yLimits = NULL, fontSizes = c(24, 16, 12, 12), title = NULL, margin = grid::unit(c(1, 1, 1, 1), "lines"), rotate90 = FALSE, showLegend = TRUE )
results |
A list of |
... |
Not used by end user. |
metric |
Default: |
characteristicsList |
A named list of characteristics. Each element's
name must be one of |
aggregate |
A character vector of the levels of
|
coloursList |
A named list of plot aspects and colours for the aspects.
No elements are mandatory. If specified, each list element's name must be
either |
alpha |
Default: 1. A number between 0 and 1 specifying the transparency level of any fill. |
orderingList |
An optional named list. Any of the variables specified
to |
densityStyle |
Default: "box". Either |
yLimits |
The minimum and maximum value of the performance metric to plot. |
fontSizes |
A vector of length 4. The first number is the size of the
title. The second number is the size of the axes titles. The third number
is the size of the axes values. The fourth number is the font size of the
titles of grouped plots, if any are produced. In other words, when
|
title |
An overall title for the plot. |
margin |
The margin to have around the plot. |
rotate90 |
Logical. IF |
showLegend |
If |
If there are multiple values for a performance measure in a single result
object, it is plotted as a violin plot, unless aggregate
is
TRUE
, in which case the all predictions in a single result object are
considered simultaneously, so that only one performance number is
calculated, and a barchart is plotted.
An object of class ggplot
and a plot on the current graphics
device, if plot
is TRUE
.
Dario Strbenac
predicted <- DataFrame(sample = sample(LETTERS[1:10], 80, replace = TRUE), permutation = rep(1:2, each = 40), class = factor(rep(c("Healthy", "Cancer"), 40))) actual <- factor(rep(c("Healthy", "Cancer"), each = 5)) result1 <- ClassifyResult(DataFrame(characteristic = c("Data Set", "Selection Name", "Classifier Name", "Cross-validation"), value = c("Example", "t-test", "Differential Expression", "2 Permutations, 2 Folds")), LETTERS[1:10], paste("Gene", 1:100), list(paste("Gene", 1:100), paste("Gene", c(10:1, 11:100)), paste("Gene", 1:100), paste("Gene", 1:100)), list(paste("Gene", 1:3), paste("Gene", c(2, 5, 6)), paste("Gene", 1:4), paste("Gene", 5:8)), list(function(oracle){}), NULL, predicted, actual) result1 <- calcCVperformance(result1, "Macro F1") predicted <- DataFrame(sample = sample(LETTERS[1:10], 80, replace = TRUE), permutation = rep(1:2, each = 40), class = factor(rep(c("Healthy", "Cancer"), 40))) result2 <- ClassifyResult(DataFrame(characteristic = c("Data Set", "Selection Name", "Classifier Name", "Cross-validation"), value = c("Example", "Bartlett Test", "Differential Variability", "2 Permutations, 2 Folds")), LETTERS[1:10], paste("Gene", 1:100), list(paste("Gene", 1:100), paste("Gene", c(10:1, 11:100)), paste("Gene", 1:100), paste("Gene", 1:100)), list(c(1:3), c(4:6), c(1, 6, 7, 9), c(5:8)), list(function(oracle){}), NULL, predicted, actual) result2 <- calcCVperformance(result2, "Macro F1") performancePlot(list(result1, result2), metric = "Macro F1", title = "Comparison")
predicted <- DataFrame(sample = sample(LETTERS[1:10], 80, replace = TRUE), permutation = rep(1:2, each = 40), class = factor(rep(c("Healthy", "Cancer"), 40))) actual <- factor(rep(c("Healthy", "Cancer"), each = 5)) result1 <- ClassifyResult(DataFrame(characteristic = c("Data Set", "Selection Name", "Classifier Name", "Cross-validation"), value = c("Example", "t-test", "Differential Expression", "2 Permutations, 2 Folds")), LETTERS[1:10], paste("Gene", 1:100), list(paste("Gene", 1:100), paste("Gene", c(10:1, 11:100)), paste("Gene", 1:100), paste("Gene", 1:100)), list(paste("Gene", 1:3), paste("Gene", c(2, 5, 6)), paste("Gene", 1:4), paste("Gene", 5:8)), list(function(oracle){}), NULL, predicted, actual) result1 <- calcCVperformance(result1, "Macro F1") predicted <- DataFrame(sample = sample(LETTERS[1:10], 80, replace = TRUE), permutation = rep(1:2, each = 40), class = factor(rep(c("Healthy", "Cancer"), 40))) result2 <- ClassifyResult(DataFrame(characteristic = c("Data Set", "Selection Name", "Classifier Name", "Cross-validation"), value = c("Example", "Bartlett Test", "Differential Variability", "2 Permutations, 2 Folds")), LETTERS[1:10], paste("Gene", 1:100), list(paste("Gene", 1:100), paste("Gene", c(10:1, 11:100)), paste("Gene", 1:100), paste("Gene", 1:100)), list(c(1:3), c(4:6), c(1, 6, 7, 9), c(5:8)), list(function(oracle){}), NULL, predicted, actual) result2 <- calcCVperformance(result2, "Macro F1") performancePlot(list(result1, result2), metric = "Macro F1", title = "Comparison")
Allows the visualisation of measurements in the data set. If useFeatures
is of type Pairs
, then a parallel plot is automatically drawn.
If it's a single categorical variable, then a bar chart is automatically
drawn.
## S4 method for signature 'matrix' plotFeatureClasses(measurements, ...) ## S4 method for signature 'DataFrame' plotFeatureClasses( measurements, classes, useFeatures, groupBy = NULL, groupingName = NULL, whichNumericFeaturePlots = c("both", "density", "stripchart"), measurementLimits = NULL, lineWidth = 1, dotBinWidth = 1, xAxisLabel = NULL, yAxisLabels = c("Density", "Classes"), showXtickLabels = TRUE, showYtickLabels = TRUE, xLabelPositions = "auto", yLabelPositions = "auto", fontSizes = c(24, 16, 12, 12, 12), colours = c("#3F48CC", "#880015"), showAssayName = TRUE ) ## S4 method for signature 'MultiAssayExperiment' plotFeatureClasses( measurements, useFeatures, classesColumn, groupBy = NULL, groupingName = NULL, showAssayName = TRUE, ... )
## S4 method for signature 'matrix' plotFeatureClasses(measurements, ...) ## S4 method for signature 'DataFrame' plotFeatureClasses( measurements, classes, useFeatures, groupBy = NULL, groupingName = NULL, whichNumericFeaturePlots = c("both", "density", "stripchart"), measurementLimits = NULL, lineWidth = 1, dotBinWidth = 1, xAxisLabel = NULL, yAxisLabels = c("Density", "Classes"), showXtickLabels = TRUE, showYtickLabels = TRUE, xLabelPositions = "auto", yLabelPositions = "auto", fontSizes = c(24, 16, 12, 12, 12), colours = c("#3F48CC", "#880015"), showAssayName = TRUE ) ## S4 method for signature 'MultiAssayExperiment' plotFeatureClasses( measurements, useFeatures, classesColumn, groupBy = NULL, groupingName = NULL, showAssayName = TRUE, ... )
measurements |
A |
... |
Unused variables by the three top-level methods passed to the internal method which generates the plot(s). |
classes |
Either a vector of class labels of class |
useFeatures |
If |
groupBy |
If |
groupingName |
A label for the grouping variable to be used in plots. |
whichNumericFeaturePlots |
If the feature is a single feature and has
numeric measurements, this option specifies which types of plot(s) to draw.
The default value is |
measurementLimits |
The minimum and maximum expression values to plot.
Default: |
lineWidth |
Numeric value that alters the line thickness for density plots. Default: 1. |
dotBinWidth |
Numeric value that alters the diameter of dots in the strip chart. Default: 1. |
xAxisLabel |
The axis label for the plot's horizontal axis. Default:
|
yAxisLabels |
A character vector of length 1 or 2. If the feature's
measurements are numeric an |
showXtickLabels |
Logical. Default: |
showYtickLabels |
Logical. Default: |
xLabelPositions |
Either |
yLabelPositions |
Either |
fontSizes |
A vector of length 5. The first number is the size of the title. The second number is the size of the axes titles. The third number is the size of the axes values. The fourth number is the size of the legends' titles. The fifth number is the font size of the legend labels. |
colours |
The colours to plot data of each class in. The length of this vector must be as long as the distinct number of classes in the data set. |
showAssayName |
Logical. Default: |
classesColumn |
If |
Plots are created on the current graphics device and a list of plot
objects is invisibly returned. The classes of the plot object are determined
based on the type of data plotted and the number of plots per feature
generated. If the plotted variable is discrete or if the variable is numeric
and one plot type was specified, the list element is an object of class
ggplot
. Otherwise, if the variable is numeric and both the density
and stripchart plot types were made, the list element is an object of class
TableGrob
.
Settling lineWidth
and dotBinWidth
to the same value doesn't
result in the density plot and the strip chart having elements of the same
size. Some manual experimentation is required to get similarly sized plot
elements.
Dario Strbenac
# First 25 samples and first 5 genes are mixtures of two normals. Last 25 samples are # one normal. genesMatrix <- sapply(1:15, function(geneColumn) c(rnorm(5, 5, 1))) genesMatrix <- cbind(genesMatrix, sapply(1:10, function(geneColumn) c(rnorm(5, 15, 1)))) genesMatrix <- cbind(genesMatrix, sapply(1:25, function(geneColumn) c(rnorm(5, 9, 2)))) genesMatrix <- rbind(genesMatrix, sapply(1:50, function(geneColumn) rnorm(95, 9, 3))) genesMatrix <- t(genesMatrix) rownames(genesMatrix) <- paste("Sample", 1:50) colnames(genesMatrix) <- paste("Gene", 1:100) classes <- factor(rep(c("Poor", "Good"), each = 25), levels = c("Good", "Poor")) plotFeatureClasses(genesMatrix, classes, useFeatures = "Gene 4", xAxisLabel = bquote(log[2]*'(expression)'), dotBinWidth = 0.5) infectionResults <- c(rep(c("No", "Yes"), c(20, 5)), rep(c("No", "Yes"), c(5, 20))) genders <- factor(rep(c("Male", "Female"), each = 10, length.out = 50)) clinicalData <- DataFrame(Gender = genders, Sugar = runif(50, 4, 10), Infection = factor(infectionResults, levels = c("No", "Yes")), row.names = rownames(genesMatrix)) plotFeatureClasses(clinicalData, classes, useFeatures = "Infection") plotFeatureClasses(clinicalData, classes, useFeatures = "Infection", groupBy = "Gender") genesMatrix <- t(genesMatrix) # MultiAssayExperiment needs features in rows. dataContainer <- MultiAssayExperiment(list(RNA = genesMatrix), colData = cbind(clinicalData, class = classes)) targetFeatures <- DataFrame(assay = "RNA", feature = "Gene 50") plotFeatureClasses(dataContainer, useFeatures = targetFeatures, classesColumn = "class", groupBy = c("clinical", "Gender"), # Table name, feature name. xAxisLabel = bquote(log[2]*'(expression)'), dotBinWidth = 0.5)
# First 25 samples and first 5 genes are mixtures of two normals. Last 25 samples are # one normal. genesMatrix <- sapply(1:15, function(geneColumn) c(rnorm(5, 5, 1))) genesMatrix <- cbind(genesMatrix, sapply(1:10, function(geneColumn) c(rnorm(5, 15, 1)))) genesMatrix <- cbind(genesMatrix, sapply(1:25, function(geneColumn) c(rnorm(5, 9, 2)))) genesMatrix <- rbind(genesMatrix, sapply(1:50, function(geneColumn) rnorm(95, 9, 3))) genesMatrix <- t(genesMatrix) rownames(genesMatrix) <- paste("Sample", 1:50) colnames(genesMatrix) <- paste("Gene", 1:100) classes <- factor(rep(c("Poor", "Good"), each = 25), levels = c("Good", "Poor")) plotFeatureClasses(genesMatrix, classes, useFeatures = "Gene 4", xAxisLabel = bquote(log[2]*'(expression)'), dotBinWidth = 0.5) infectionResults <- c(rep(c("No", "Yes"), c(20, 5)), rep(c("No", "Yes"), c(5, 20))) genders <- factor(rep(c("Male", "Female"), each = 10, length.out = 50)) clinicalData <- DataFrame(Gender = genders, Sugar = runif(50, 4, 10), Infection = factor(infectionResults, levels = c("No", "Yes")), row.names = rownames(genesMatrix)) plotFeatureClasses(clinicalData, classes, useFeatures = "Infection") plotFeatureClasses(clinicalData, classes, useFeatures = "Infection", groupBy = "Gender") genesMatrix <- t(genesMatrix) # MultiAssayExperiment needs features in rows. dataContainer <- MultiAssayExperiment(list(RNA = genesMatrix), colData = cbind(clinicalData, class = classes)) targetFeatures <- DataFrame(assay = "RNA", feature = "Gene 50") plotFeatureClasses(dataContainer, useFeatures = targetFeatures, classesColumn = "class", groupBy = c("clinical", "Gender"), # Table name, feature name. xAxisLabel = bquote(log[2]*'(expression)'), dotBinWidth = 0.5)
Precision pathways allows the evaluation of various permutations of multiomics or multiview data. Samples are predicted by a particular assay if they were consistently predicted as a particular class during cross-validation. Otherwise, they are passed onto subsequent assays/tiers for prediction. Balanced accuracy is used to evaluate overall prediction performance and sample-specific accuracy for individual-level evaluation.
## S4 method for signature 'MultiAssayExperimentOrList' precisionPathwaysTrain( measurements, class, useFeatures = NULL, maxMissingProp = 0, topNvariance = NULL, fixedAssays = "clinical", confidenceCutoff = 0.8, minAssaySamples = 10, nFeatures = 20, selectionMethod = setNames(c("none", rep("t-test", length(measurements))), c("clinical", names(measurements))), classifier = setNames(c("elasticNetGLM", rep("randomForest", length(measurements))), c("clinical", names(measurements))), nFolds = 5, nRepeats = 20, nCores = 1 ) ## S4 method for signature 'PrecisionPathways,MultiAssayExperimentOrList' precisionPathwaysPredict(pathways, measurements, class)
## S4 method for signature 'MultiAssayExperimentOrList' precisionPathwaysTrain( measurements, class, useFeatures = NULL, maxMissingProp = 0, topNvariance = NULL, fixedAssays = "clinical", confidenceCutoff = 0.8, minAssaySamples = 10, nFeatures = 20, selectionMethod = setNames(c("none", rep("t-test", length(measurements))), c("clinical", names(measurements))), classifier = setNames(c("elasticNetGLM", rep("randomForest", length(measurements))), c("clinical", names(measurements))), nFolds = 5, nRepeats = 20, nCores = 1 ) ## S4 method for signature 'PrecisionPathways,MultiAssayExperimentOrList' precisionPathwaysPredict(pathways, measurements, class)
measurements |
Either a |
class |
If a |
useFeatures |
Default: |
maxMissingProp |
Default: 0.0. A proportion less than 1 which is the maximum tolerated proportion of missingness for a feature to be retained for modelling. |
topNvariance |
Default: NULL. An integer number of most variable features per assay to subset to. Assays with less features won't be reduced in size. |
fixedAssays |
A character vector of assay names specifying any assays which must be at the beginning of the pathway. |
confidenceCutoff |
The minimum confidence of predictions for a sample to be predicted by a particular issue
. If a sample was predicted to belong to a particular class a proportion |
minAssaySamples |
An integer specifying the minimum number of samples a tier may have. If a subsequent tier would have less than this number of samples, the samples are incorporated into the current tier. |
nFeatures |
Default: 20. The number of features to consider during feature selection, if feature selection is done. |
selectionMethod |
A named character vector of feature selection methods to use for the assays, one for each. The names must correspond to names of |
classifier |
A named character vector of modelling methods to use for the assays, one for each. The names must correspond to names of |
nFolds |
A numeric specifying the number of folds to use for cross-validation. |
nRepeats |
A numeric specifying the the number of repeats or permutations to use for cross-validation. |
nCores |
A numeric specifying the number of cores used if the user wants to use parallelisation. |
pathways |
A set of pathways created by |
An object of class PrecisionPathways
which is basically a named list that other plotting and
tabulating functions can use.
# To be determined.
# To be determined.
Collects the function to be used for making predictions and any associated parameters.
The function specified must return either a factor vector of class predictions, or a numeric vector of scores for the second class, according to the levels of the class vector of the input data set, or a data frame which has two columns named class and score.
PredictParams(predictor, characteristics = DataFrame(), intermediate = character(0), ...)
Creates a PredictParams object which stores the function
which will do the class prediction, if required, and parameters that the
function will use. If the training function also makes predictions, this
must be set to NULL
.
predictor
A character keyword referring to a registered classifier. See available
for valid keywords.
characteristics
A DataFrame
describing
the characteristics of the predictor function used. First column must be
named "charateristic"
and second column must be named "value"
.
intermediate
Character vector. Names of any
variables created in prior stages in runTest
that need to be
passed to the prediction function.
...
Other arguments that predictor
may use.
predictParams
is a PredictParams
object.show(predictParams)
: Prints a short summary of what predictParams
contains.
Dario Strbenac
# For prediction by trained object created by DLDA training function. predictParams <- PredictParams("DLDA")
# For prediction by trained object created by DLDA training function. predictParams <- PredictParams("DLDA")
Input data could be of matrix, MultiAssayExperiment, or DataFrame format and this function will prepare a DataFrame of features and a vector of outcomes and help to exclude nuisance features such as dates or unique sample identifiers from subsequent modelling.
## S4 method for signature 'matrix' prepareData(measurements, outcome, ...) ## S4 method for signature 'data.frame' prepareData(measurements, outcome, ...) ## S4 method for signature 'DataFrame' prepareData( measurements, outcome, useFeatures = NULL, maxMissingProp = 0, topNvariance = NULL ) ## S4 method for signature 'MultiAssayExperiment' prepareData(measurements, outcomeColumns = NULL, useFeatures = NULL, ...) ## S4 method for signature 'list' prepareData(measurements, outcome = NULL, useFeatures = NULL, ...)
## S4 method for signature 'matrix' prepareData(measurements, outcome, ...) ## S4 method for signature 'data.frame' prepareData(measurements, outcome, ...) ## S4 method for signature 'DataFrame' prepareData( measurements, outcome, useFeatures = NULL, maxMissingProp = 0, topNvariance = NULL ) ## S4 method for signature 'MultiAssayExperiment' prepareData(measurements, outcomeColumns = NULL, useFeatures = NULL, ...) ## S4 method for signature 'list' prepareData(measurements, outcome = NULL, useFeatures = NULL, ...)
measurements |
Either a |
... |
Variables not used by the |
outcome |
Either a factor vector of classes, a |
useFeatures |
Default: |
maxMissingProp |
Default: 0.0. A proportion less than 1 which is the maximum tolerated proportion of missingness for a feature to be retained for modelling. |
topNvariance |
Default: NULL. If |
outcomeColumns |
If |
A list of length two. The first element is a DataFrame
of features
and the second element is the outcomes to use for modelling.
Dario Strbenac
Pair-wise overlaps can be done for two types of analyses. Firstly, each cross-validation iteration can be considered within a single classification. This explores the feature ranking stability. Secondly, the overlap may be considered between different classification results. This approach compares the feature ranking commonality between different results. Two types of commonality are possible to analyse. One summary is the average pair-wise overlap between all possible pairs of results. The second kind of summary is the pair-wise overlap of each level of the comparison factor that is not the reference level against the reference level. The overlaps are converted to percentages and plotted as lineplots.
## S4 method for signature 'ClassifyResult' rankingPlot(results, ...) ## S4 method for signature 'list' rankingPlot( results, topRanked = seq(10, 100, 10), comparison = "within", referenceLevel = NULL, characteristicsList = list(), orderingList = list(), sizesList = list(lineWidth = 1, pointSize = 2, legendLinesPointsSize = 1, fonts = c(24, 16, 12, 12, 12, 16)), lineColours = NULL, xLabelPositions = seq(10, 100, 10), yMax = 100, title = if (comparison[1] == "within") "Feature Ranking Stability" else "Feature Ranking Commonality", yLabel = if (is.null(referenceLevel)) "Average Common Features (%)" else paste("Average Common Features with", referenceLevel, "(%)"), margin = grid::unit(c(1, 1, 1, 1), "lines"), showLegend = TRUE, parallelParams = bpparam() )
## S4 method for signature 'ClassifyResult' rankingPlot(results, ...) ## S4 method for signature 'list' rankingPlot( results, topRanked = seq(10, 100, 10), comparison = "within", referenceLevel = NULL, characteristicsList = list(), orderingList = list(), sizesList = list(lineWidth = 1, pointSize = 2, legendLinesPointsSize = 1, fonts = c(24, 16, 12, 12, 12, 16)), lineColours = NULL, xLabelPositions = seq(10, 100, 10), yMax = 100, title = if (comparison[1] == "within") "Feature Ranking Stability" else "Feature Ranking Commonality", yLabel = if (is.null(referenceLevel)) "Average Common Features (%)" else paste("Average Common Features with", referenceLevel, "(%)"), margin = grid::unit(c(1, 1, 1, 1), "lines"), showLegend = TRUE, parallelParams = bpparam() )
results |
A list of |
... |
Not used by end user. |
topRanked |
A sequence of thresholds of number of the best features to use for overlapping. |
comparison |
Default: |
referenceLevel |
The level of the comparison factor to use as the
reference to compare each non-reference level to. If |
characteristicsList |
A named list of characteristics. The name must be
one of |
orderingList |
An optional named list. Any of the variables specified
to |
sizesList |
Default: |
lineColours |
A vector of colours for different levels of the line
colouring parameter, if one is specified by
|
xLabelPositions |
Locations where to put labels on the x-axis. |
yMax |
The maximum value of the percentage to plot. |
title |
An overall title for the plot. |
yLabel |
Label to be used for the y-axis of overlap percentages. |
margin |
The margin to have around the plot. |
showLegend |
If |
parallelParams |
An object of class |
If comparison
is "within"
, then the feature selection overlaps
are compared within a particular analysis. The result will inform how stable
the selections are between different iterations of cross-validation for a
particular analysis. Otherwise, the comparison is between different
cross-validation runs, and this gives an indication about how common are the
features being selected by different classifications.
Calculating all pair-wise set overlaps for a large cross-validation result
can be time-consuming. This stage can be done on multiple CPUs by providing
the relevant options to parallelParams
.
An object of class ggplot
and a plot on the current graphics
device, if plot
is TRUE
.
Dario Strbenac
predicted <- DataFrame(sample = sample(10, 100, replace = TRUE), permutation = rep(1:2, each = 50), class = rep(c("Healthy", "Cancer"), each = 50)) actual <- factor(rep(c("Healthy", "Cancer"), each = 5)) allFeatures <- sapply(1:100, function(index) paste(sample(LETTERS, 3), collapse = '')) rankList <- list(allFeatures[1:100], allFeatures[c(15:6, 1:5, 16:100)], allFeatures[c(1:9, 11, 10, 12:100)], allFeatures[c(1:50, 61:100, 60:51)]) result1 <- ClassifyResult(DataFrame(characteristic = c("Data Set", "Selection Name", "Classifier Name", "Cross-validation"), value = c("Melanoma", "t-test", "Diagonal LDA", "2 Permutations, 2 Folds")), LETTERS[1:10], allFeatures, rankList, list(rankList[[1]][1:15], rankList[[2]][1:15], rankList[[3]][1:10], rankList[[4]][1:10]), list(function(oracle){}), NULL, predicted, actual) predicted[, "class"] <- sample(predicted[, "class"]) rankList <- list(allFeatures[1:100], allFeatures[c(sample(20), 21:100)], allFeatures[c(1:9, 11, 10, 12:100)], allFeatures[c(1:50, 60:51, 61:100)]) result2 <- ClassifyResult(DataFrame(characteristic = c("Data Set", "Selection Name", "Classifier Name", "Cross-validations"), value = c("Melanoma", "t-test", "Random Forest", "2 Permutations, 2 Folds")), LETTERS[1:10], allFeatures, rankList, list(rankList[[1]][1:15], rankList[[2]][1:15], rankList[[3]][1:10], rankList[[4]][1:10]), list(function(oracle){}), NULL, predicted, actual) rankingPlot(list(result1, result2), characteristicsList = list(pointType = "Classifier Name"))
predicted <- DataFrame(sample = sample(10, 100, replace = TRUE), permutation = rep(1:2, each = 50), class = rep(c("Healthy", "Cancer"), each = 50)) actual <- factor(rep(c("Healthy", "Cancer"), each = 5)) allFeatures <- sapply(1:100, function(index) paste(sample(LETTERS, 3), collapse = '')) rankList <- list(allFeatures[1:100], allFeatures[c(15:6, 1:5, 16:100)], allFeatures[c(1:9, 11, 10, 12:100)], allFeatures[c(1:50, 61:100, 60:51)]) result1 <- ClassifyResult(DataFrame(characteristic = c("Data Set", "Selection Name", "Classifier Name", "Cross-validation"), value = c("Melanoma", "t-test", "Diagonal LDA", "2 Permutations, 2 Folds")), LETTERS[1:10], allFeatures, rankList, list(rankList[[1]][1:15], rankList[[2]][1:15], rankList[[3]][1:10], rankList[[4]][1:10]), list(function(oracle){}), NULL, predicted, actual) predicted[, "class"] <- sample(predicted[, "class"]) rankList <- list(allFeatures[1:100], allFeatures[c(sample(20), 21:100)], allFeatures[c(1:9, 11, 10, 12:100)], allFeatures[c(1:50, 60:51, 61:100)]) result2 <- ClassifyResult(DataFrame(characteristic = c("Data Set", "Selection Name", "Classifier Name", "Cross-validations"), value = c("Melanoma", "t-test", "Random Forest", "2 Permutations, 2 Folds")), LETTERS[1:10], allFeatures, rankList, list(rankList[[1]][1:15], rankList[[2]][1:15], rankList[[3]][1:10], rankList[[4]][1:10]), list(function(oracle){}), NULL, predicted, actual) rankingPlot(list(result1, result2), characteristicsList = list(pointType = "Classifier Name"))
Creates one ROC plot or multiple ROC plots for a list of ClassifyResult objects. One plot is created if the data set has two classes and multiple plots are created if the data set has three or more classes.
## S4 method for signature 'ClassifyResult' ROCplot(results, ...) ## S4 method for signature 'list' ROCplot( results, mode = c("merge", "average"), interval = 95, comparison = "auto", lineColours = "auto", lineWidth = 1, fontSizes = c(24, 16, 12, 12, 12), labelPositions = seq(0, 1, 0.2), plotTitle = "ROC", legendTitle = NULL, xLabel = "False Positive Rate", yLabel = "True Positive Rate", showAUC = TRUE )
## S4 method for signature 'ClassifyResult' ROCplot(results, ...) ## S4 method for signature 'list' ROCplot( results, mode = c("merge", "average"), interval = 95, comparison = "auto", lineColours = "auto", lineWidth = 1, fontSizes = c(24, 16, 12, 12, 12), labelPositions = seq(0, 1, 0.2), plotTitle = "ROC", legendTitle = NULL, xLabel = "False Positive Rate", yLabel = "True Positive Rate", showAUC = TRUE )
results |
A list of |
... |
Parameters not used by the |
mode |
Default: |
interval |
Default: 95 (percent). The percent confidence interval to
draw around the averaged ROC curve, if mode is |
comparison |
Default: |
lineColours |
Default: |
lineWidth |
A single number controlling the thickness of lines drawn. |
fontSizes |
A vector of length 5. The first number is the size of the title. The second number is the size of the axes titles and AUC text, if it is not part of the legend. The third number is the size of the axes values. The fourth number is the size of the legends' titles. The fifth number is the font size of the legend labels. |
labelPositions |
Default: 0.0, 0.2, 0.4, 0.6, 0.8, 1.0. Locations where to put labels on the x and y axes. |
plotTitle |
An overall title for the plot. |
legendTitle |
A default name is used if the value is |
xLabel |
Label to be used for the x-axis of false positive rate. |
yLabel |
Label to be used for the y-axis of true positive rate. |
showAUC |
Logical. If |
The scores stored in the results should be higher if the sample is more likely to be from the class which the score is associated with. The score for each class must be in a column which has a column name equal to the class name.
For cross-validated classification, all predictions from all iterations are considered simultaneously, to calculate one curve per classification.
An object of class ggplot
and a plot on the current graphics
device, if plot
is TRUE
.
Dario Strbenac
predicted <- do.call(rbind, list(DataFrame(data.frame(sample = LETTERS[seq(1, 20, 2)], Healthy = c(0.89, 0.68, 0.53, 0.76, 0.13, 0.20, 0.60, 0.25, 0.10, 0.30), Cancer = c(0.11, 0.32, 0.47, 0.24, 0.87, 0.80, 0.40, 0.75, 0.90, 0.70), fold = 1)), DataFrame(sample = LETTERS[seq(2, 20, 2)], Healthy = c(0.45, 0.56, 0.33, 0.56, 0.65, 0.33, 0.20, 0.60, 0.40, 0.80), Cancer = c(0.55, 0.44, 0.67, 0.44, 0.35, 0.67, 0.80, 0.40, 0.60, 0.20), fold = 2))) actual <- factor(c(rep("Healthy", 10), rep("Cancer", 10)), levels = c("Healthy", "Cancer")) result1 <- ClassifyResult(DataFrame(characteristic = c("Data Set", "Selection Name", "Classifier Name", "Cross-validation"), value = c("Melanoma", "t-test", "Random Forest", "2-fold")), LETTERS[1:20], paste("Gene", LETTERS[1:10]), list(paste("Gene", LETTERS[1:10]), paste("Gene", LETTERS[c(5:1, 6:10)])), list(paste("Gene", LETTERS[1:3]), paste("Gene", LETTERS[1:5])), list(function(oracle){}), NULL, predicted, actual) predicted[c(2, 6), "Healthy"] <- c(0.40, 0.60) predicted[c(2, 6), "Cancer"] <- c(0.60, 0.40) result2 <- ClassifyResult(DataFrame(characteristic = c("Data Set", "Selection Name", "Classifier Name", "Cross-validation"), value = c("Melanoma", "Bartlett Test", "Differential Variability", "2-fold")), LETTERS[1:20], paste("Gene", LETTERS[1:10]), list(paste("Gene", LETTERS[1:10]), paste("Gene", LETTERS[c(5:1, 6:10)])), list(paste("Gene", LETTERS[1:3]), paste("Gene", LETTERS[1:5])), list(function(oracle){}), NULL, predicted, actual) ROCplot(list(result1, result2), plotTitle = "Cancer ROC")
predicted <- do.call(rbind, list(DataFrame(data.frame(sample = LETTERS[seq(1, 20, 2)], Healthy = c(0.89, 0.68, 0.53, 0.76, 0.13, 0.20, 0.60, 0.25, 0.10, 0.30), Cancer = c(0.11, 0.32, 0.47, 0.24, 0.87, 0.80, 0.40, 0.75, 0.90, 0.70), fold = 1)), DataFrame(sample = LETTERS[seq(2, 20, 2)], Healthy = c(0.45, 0.56, 0.33, 0.56, 0.65, 0.33, 0.20, 0.60, 0.40, 0.80), Cancer = c(0.55, 0.44, 0.67, 0.44, 0.35, 0.67, 0.80, 0.40, 0.60, 0.20), fold = 2))) actual <- factor(c(rep("Healthy", 10), rep("Cancer", 10)), levels = c("Healthy", "Cancer")) result1 <- ClassifyResult(DataFrame(characteristic = c("Data Set", "Selection Name", "Classifier Name", "Cross-validation"), value = c("Melanoma", "t-test", "Random Forest", "2-fold")), LETTERS[1:20], paste("Gene", LETTERS[1:10]), list(paste("Gene", LETTERS[1:10]), paste("Gene", LETTERS[c(5:1, 6:10)])), list(paste("Gene", LETTERS[1:3]), paste("Gene", LETTERS[1:5])), list(function(oracle){}), NULL, predicted, actual) predicted[c(2, 6), "Healthy"] <- c(0.40, 0.60) predicted[c(2, 6), "Cancer"] <- c(0.60, 0.40) result2 <- ClassifyResult(DataFrame(characteristic = c("Data Set", "Selection Name", "Classifier Name", "Cross-validation"), value = c("Melanoma", "Bartlett Test", "Differential Variability", "2-fold")), LETTERS[1:20], paste("Gene", LETTERS[1:10]), list(paste("Gene", LETTERS[1:10]), paste("Gene", LETTERS[c(5:1, 6:10)])), list(paste("Gene", LETTERS[1:3]), paste("Gene", LETTERS[1:5])), list(function(oracle){}), NULL, predicted, actual) ROCplot(list(result1, result2), plotTitle = "Cancer ROC")
For a data set of features and samples, the classification process is run. It consists of data transformation, feature selection, classifier training and testing.
## S4 method for signature 'matrix' runTest(measurementsTrain, outcomeTrain, measurementsTest, outcomeTest, ...) ## S4 method for signature 'DataFrame' runTest( measurementsTrain, outcomeTrain, measurementsTest, outcomeTest, crossValParams = CrossValParams(), modellingParams = ModellingParams(), characteristics = S4Vectors::DataFrame(), ..., verbose = 1, .iteration = NULL ) ## S4 method for signature 'MultiAssayExperiment' runTest(measurementsTrain, measurementsTest, outcomeColumns, ...)
## S4 method for signature 'matrix' runTest(measurementsTrain, outcomeTrain, measurementsTest, outcomeTest, ...) ## S4 method for signature 'DataFrame' runTest( measurementsTrain, outcomeTrain, measurementsTest, outcomeTest, crossValParams = CrossValParams(), modellingParams = ModellingParams(), characteristics = S4Vectors::DataFrame(), ..., verbose = 1, .iteration = NULL ) ## S4 method for signature 'MultiAssayExperiment' runTest(measurementsTrain, measurementsTest, outcomeColumns, ...)
measurementsTrain |
Either a |
... |
Variables not used by the |
outcomeTrain |
Either a factor vector of classes, a |
measurementsTest |
Same data type as |
outcomeTest |
Same data type as |
crossValParams |
An object of class |
modellingParams |
An object of class |
characteristics |
A |
verbose |
Default: 1. A number between 0 and 3 for the amount of progress messages to give. A higher number will produce more messages as more lower-level functions print messages. |
.iteration |
Not to be set by a user. This value is used to keep track
of the cross-validation iteration, if called by |
outcomeColumns |
If |
This function only performs one classification and prediction. See
runTests
for a driver function that enables a number of
different cross-validation schemes to be applied and uses this function to
perform each iteration.
If called directly by the user rather than being used internally by
runTests
, a ClassifyResult
object. Otherwise a
list of different aspects of the result which is passed back to runTests
.
Dario Strbenac
#if(require(sparsediscrim)) #{ data(asthma) tuneList <- list(nFeatures = seq(5, 25, 5), performanceType = "Balanced Error") selectParams <- SelectParams("limma", tuneParams = tuneList) modellingParams <- ModellingParams(selectParams = selectParams) trainIndices <- seq(1, nrow(measurements), 2) testIndices <- seq(2, nrow(measurements), 2) runTest(measurements[trainIndices, ], classes[trainIndices], measurements[testIndices, ], classes[testIndices], modellingParams = modellingParams) #}
#if(require(sparsediscrim)) #{ data(asthma) tuneList <- list(nFeatures = seq(5, 25, 5), performanceType = "Balanced Error") selectParams <- SelectParams("limma", tuneParams = tuneList) modellingParams <- ModellingParams(selectParams = selectParams) trainIndices <- seq(1, nrow(measurements), 2) testIndices <- seq(2, nrow(measurements), 2) runTest(measurements[trainIndices, ], classes[trainIndices], measurements[testIndices, ], classes[testIndices], modellingParams = modellingParams) #}
Enables doing classification schemes such as ordinary 10-fold, 100
permutations 5-fold, and leave one out cross-validation. Processing in
parallel is possible by leveraging the package BiocParallel
.
## S4 method for signature 'matrix' runTests(measurements, outcome, ...) ## S4 method for signature 'DataFrame' runTests( measurements, outcome, crossValParams = CrossValParams(), modellingParams = ModellingParams(), characteristics = S4Vectors::DataFrame(), ..., verbose = 1 ) ## S4 method for signature 'MultiAssayExperiment' runTests(measurements, outcome, ...)
## S4 method for signature 'matrix' runTests(measurements, outcome, ...) ## S4 method for signature 'DataFrame' runTests( measurements, outcome, crossValParams = CrossValParams(), modellingParams = ModellingParams(), characteristics = S4Vectors::DataFrame(), ..., verbose = 1 ) ## S4 method for signature 'MultiAssayExperiment' runTests(measurements, outcome, ...)
measurements |
Either a |
... |
Variables not used by the |
outcome |
Either a factor vector of classes, a |
crossValParams |
An object of class |
modellingParams |
An object of class |
characteristics |
A |
verbose |
Default: 1. A number between 0 and 3 for the amount of progress messages to give. A higher number will produce more messages as more lower-level functions print messages. |
An object of class ClassifyResult
.
Dario Strbenac
#if(require(sparsediscrim)) #{ data(asthma) CVparams <- CrossValParams(permutations = 5) tuneList <- list(nFeatures = seq(5, 25, 5), performanceType = "Balanced Error") selectParams <- SelectParams("t-test", tuneParams = tuneList) modellingParams <- ModellingParams(selectParams = selectParams) runTests(measurements, classes, CVparams, modellingParams, DataFrame(characteristic = c("Assay Name", "Classifier Name"), value = c("Asthma", "Different Means")) ) #}
#if(require(sparsediscrim)) #{ data(asthma) CVparams <- CrossValParams(permutations = 5) tuneList <- list(nFeatures = seq(5, 25, 5), performanceType = "Balanced Error") selectParams <- SelectParams("t-test", tuneParams = tuneList) modellingParams <- ModellingParams(selectParams = selectParams) runTests(measurements, classes, CVparams, modellingParams, DataFrame(characteristic = c("Assay Name", "Classifier Name"), value = c("Asthma", "Different Means")) ) #}
A grid of coloured tiles is drawn. There is one column for each sample and one row for each cross-validation result.
## S4 method for signature 'ClassifyResult' samplesMetricMap(results, ...) ## S4 method for signature 'list' samplesMetricMap( results, comparison = "auto", metric = "auto", featureValues = NULL, featureName = NULL, metricColours = list(c("#FFFFFF", "#CFD1F2", "#9FA3E5", "#6F75D8", "#3F48CC"), c("#FFFFFF", "#E1BFC4", "#C37F8A", "#A53F4F", "#880015")), classColours = c("#3F48CC", "#880015"), groupColours = c("darkgreen", "yellow2"), fontSizes = c(24, 16, 12, 12, 12), mapHeight = 4, title = "auto", showLegends = TRUE, xAxisLabel = "Sample Name", showXtickLabels = TRUE, yAxisLabel = "Analysis", showYtickLabels = TRUE, legendSize = grid::unit(1, "lines") ) ## S4 method for signature 'matrix' samplesMetricMap( results, classes, metric = c("Sample Error", "Sample Accuracy"), featureValues = NULL, featureName = NULL, metricColours = list(c("#3F48CC", "#6F75D8", "#9FA3E5", "#CFD1F2", "#FFFFFF"), c("#880015", "#A53F4F", "#C37F8A", "#E1BFC4", "#FFFFFF")), classColours = c("#3F48CC", "#880015"), groupColours = c("darkgreen", "yellow2"), fontSizes = c(24, 16, 12, 12, 12), mapHeight = 4, title = "Error Comparison", showLegends = TRUE, xAxisLabel = "Sample Name", showXtickLabels = TRUE, yAxisLabel = "Analysis", showYtickLabels = TRUE, legendSize = grid::unit(1, "lines") )
## S4 method for signature 'ClassifyResult' samplesMetricMap(results, ...) ## S4 method for signature 'list' samplesMetricMap( results, comparison = "auto", metric = "auto", featureValues = NULL, featureName = NULL, metricColours = list(c("#FFFFFF", "#CFD1F2", "#9FA3E5", "#6F75D8", "#3F48CC"), c("#FFFFFF", "#E1BFC4", "#C37F8A", "#A53F4F", "#880015")), classColours = c("#3F48CC", "#880015"), groupColours = c("darkgreen", "yellow2"), fontSizes = c(24, 16, 12, 12, 12), mapHeight = 4, title = "auto", showLegends = TRUE, xAxisLabel = "Sample Name", showXtickLabels = TRUE, yAxisLabel = "Analysis", showYtickLabels = TRUE, legendSize = grid::unit(1, "lines") ) ## S4 method for signature 'matrix' samplesMetricMap( results, classes, metric = c("Sample Error", "Sample Accuracy"), featureValues = NULL, featureName = NULL, metricColours = list(c("#3F48CC", "#6F75D8", "#9FA3E5", "#CFD1F2", "#FFFFFF"), c("#880015", "#A53F4F", "#C37F8A", "#E1BFC4", "#FFFFFF")), classColours = c("#3F48CC", "#880015"), groupColours = c("darkgreen", "yellow2"), fontSizes = c(24, 16, 12, 12, 12), mapHeight = 4, title = "Error Comparison", showLegends = TRUE, xAxisLabel = "Sample Name", showXtickLabels = TRUE, yAxisLabel = "Analysis", showYtickLabels = TRUE, legendSize = grid::unit(1, "lines") )
results |
A list of |
... |
Parameters not used by the |
comparison |
Default: |
metric |
Default: |
featureValues |
If not NULL, can be a named factor or named numeric vector specifying some variable of interest to plot above the heatmap. |
featureName |
A label describing the information in
|
metricColours |
If the outcome is categorical, a list of vectors of colours for metric levels for each class. If the outcome is numeric, such as a risk score, then a single vector of colours for the metric levels for all samples. |
classColours |
Either a vector of colours for class levels if both classes should have same colour, or a list of length 2, with each component being a vector of the same length. The vector has the colour gradient for each class. |
groupColours |
A vector of colours for group levels. Only useful if
|
fontSizes |
A vector of length 5. The first number is the size of the title. The second number is the size of the axes titles. The third number is the size of the axes values. The fourth number is the size of the legends' titles. The fifth number is the font size of the legend labels. |
mapHeight |
Height of the map, relative to the height of the class colour bar. |
title |
The title to place above the plot. |
showLegends |
Logical. IF FALSE, the legend is not drawn. |
xAxisLabel |
The name plotted for the x-axis. NULL suppresses label. |
showXtickLabels |
Logical. IF FALSE, the x-axis labels are hidden. |
yAxisLabel |
The name plotted for the y-axis. NULL suppresses label. |
showYtickLabels |
Logical. IF FALSE, the y-axis labels are hidden. |
legendSize |
The size of the boxes in the legends. |
classes |
If |
The names of results
determine the row names that will be in the
plot. The length of metricColours
determines how many bins the metric
values will be discretised to.
A grob is returned that can be drawn on a graphics device.
Dario Strbenac
predicted <- DataFrame(sample = LETTERS[sample(10, 100, replace = TRUE)], class = rep(c("Healthy", "Cancer"), each = 50)) actual <- factor(rep(c("Healthy", "Cancer"), each = 5), levels = c("Healthy", "Cancer")) features <- sapply(1:100, function(index) paste(sample(LETTERS, 3), collapse = '')) result1 <- ClassifyResult(DataFrame(characteristic = c("Data Set", "Selection Name", "Classifier Name", "Cross-validation"), value = c("Example", "t-test", "Differential Expression", "2 Permutations, 2 Folds")), LETTERS[1:10], features, list(1:100), list(sample(10, 10)), list(function(oracle){}), NULL, predicted, actual) predicted[, "class"] <- sample(predicted[, "class"]) result2 <- ClassifyResult(DataFrame(characteristic = c("Data Set", "Selection Name", "Classifier Name", "Cross-validation"), value = c("Example", "Bartlett Test", "Differential Variability", "2 Permutations, 2 Folds")), LETTERS[1:10], features, list(1:100), list(sample(10, 10)), list(function(oracle){}), NULL, predicted, actual) result1 <- calcCVperformance(result1) result2 <- calcCVperformance(result2) groups <- factor(rep(c("Male", "Female"), length.out = 10)) names(groups) <- LETTERS[1:10] cholesterol <- c(4.0, 5.5, 3.9, 4.9, 5.7, 7.1, 7.9, 8.0, 8.5, 7.2) names(cholesterol) <- LETTERS[1:10] wholePlot <- samplesMetricMap(list(Gene = result1, Protein = result2)) wholePlot <- samplesMetricMap(list(Gene = result1, Protein = result2), featureValues = groups, featureName = "Gender") wholePlot <- samplesMetricMap(list(Gene = result1, Protein = result2), featureValues = cholesterol, featureName = "Cholesterol")
predicted <- DataFrame(sample = LETTERS[sample(10, 100, replace = TRUE)], class = rep(c("Healthy", "Cancer"), each = 50)) actual <- factor(rep(c("Healthy", "Cancer"), each = 5), levels = c("Healthy", "Cancer")) features <- sapply(1:100, function(index) paste(sample(LETTERS, 3), collapse = '')) result1 <- ClassifyResult(DataFrame(characteristic = c("Data Set", "Selection Name", "Classifier Name", "Cross-validation"), value = c("Example", "t-test", "Differential Expression", "2 Permutations, 2 Folds")), LETTERS[1:10], features, list(1:100), list(sample(10, 10)), list(function(oracle){}), NULL, predicted, actual) predicted[, "class"] <- sample(predicted[, "class"]) result2 <- ClassifyResult(DataFrame(characteristic = c("Data Set", "Selection Name", "Classifier Name", "Cross-validation"), value = c("Example", "Bartlett Test", "Differential Variability", "2 Permutations, 2 Folds")), LETTERS[1:10], features, list(1:100), list(sample(10, 10)), list(function(oracle){}), NULL, predicted, actual) result1 <- calcCVperformance(result1) result2 <- calcCVperformance(result2) groups <- factor(rep(c("Male", "Female"), length.out = 10)) names(groups) <- LETTERS[1:10] cholesterol <- c(4.0, 5.5, 3.9, 4.9, 5.7, 7.1, 7.9, 8.0, 8.5, 7.2) names(cholesterol) <- LETTERS[1:10] wholePlot <- samplesMetricMap(list(Gene = result1, Protein = result2)) wholePlot <- samplesMetricMap(list(Gene = result1, Protein = result2), featureValues = groups, featureName = "Gender") wholePlot <- samplesMetricMap(list(Gene = result1, Protein = result2), featureValues = cholesterol, featureName = "Cholesterol")
samplesSplits
Creates two lists of lists. First has training samples, second has test samples for a range
of different cross-validation schemes.
splitsTestInfo
creates a table for tracking the permutation, fold number, or subset of each set
of test samples. Useful for column-binding to the predictions, once they are unlisted into a vector.
samplesSplits( samplesSplits = c("k-Fold", "Permute k-Fold", "Permute Percentage Split", "Leave-k-Out"), permutations = 100, folds = 5, percentTest = 25, leave = 2, outcome ) splitsTestInfo( samplesSplits = c("k-Fold", "Permute k-Fold", "Permute Percentage Split", "Leave-k-Out"), permutations = 100, folds = 5, percentTest = 25, leave = 2, splitsList )
samplesSplits( samplesSplits = c("k-Fold", "Permute k-Fold", "Permute Percentage Split", "Leave-k-Out"), permutations = 100, folds = 5, percentTest = 25, leave = 2, outcome ) splitsTestInfo( samplesSplits = c("k-Fold", "Permute k-Fold", "Permute Percentage Split", "Leave-k-Out"), permutations = 100, folds = 5, percentTest = 25, leave = 2, splitsList )
samplesSplits |
Default: |
permutations |
Default: |
folds |
Default: |
percentTest |
Default: |
leave |
Default: |
outcome |
A |
splitsList |
The return value of the function |
For samplesSplits
, two lists of the same length. First is training partitions. Second is test partitions.
For splitsTestInfoTable
, a table with a subset of columns "permutation"
, "fold"
and "subset"
, depending on the cross-validation scheme specified.
classes <- factor(rep(c('A', 'B'), c(15, 5))) splitsList <-samplesSplits(permutations = 1, outcome = classes) splitsList splitsTestInfo(permutations = 1, splitsList = splitsList)
classes <- factor(rep(c('A', 'B'), c(15, 5))) splitsList <-samplesSplits(permutations = 1, outcome = classes) splitsList splitsTestInfo(permutations = 1, splitsList = splitsList)
Pair-wise overlaps can be done for two types of analyses. Firstly, each cross-validation iteration can be considered within a single classification. This explores the feature selection stability. Secondly, the overlap may be considered between different classification results. This approach compares the feature selection commonality between different selection methods. Two types of commonality are possible to analyse. One summary is the average pair-wise overlap between all levels of the comparison factor and the other summary is the pair-wise overlap of each level of the comparison factor that is not the reference level against the reference level. The overlaps are converted to percentages and plotted as lineplots.
## S4 method for signature 'ClassifyResult' selectionPlot(results, ...) ## S4 method for signature 'list' selectionPlot( results, comparison = "within", referenceLevel = NULL, characteristicsList = list(x = "auto"), coloursList = list(), alpha = 1, orderingList = list(), binsList = list(), yMax = 100, densityStyle = c("box", "violin"), fontSizes = c(24, 16, 12, 16), title = if (comparison == "within") "Feature Selection Stability" else if (comparison == "size") "Feature Selection Size" else if (comparison == "importance") "Variable Importance" else "Feature Selection Commonality", yLabel = if (is.null(referenceLevel) && !comparison %in% c("size", "importance")) "Common Features (%)" else if (comparison == "size") "Set Size" else if (comparison == "importance") tail(names(results[[1]]@importance), 1) else paste("Common Features with", referenceLevel, "(%)"), margin = grid::unit(c(1, 1, 1, 1), "lines"), rotate90 = FALSE, showLegend = TRUE, parallelParams = bpparam() )
## S4 method for signature 'ClassifyResult' selectionPlot(results, ...) ## S4 method for signature 'list' selectionPlot( results, comparison = "within", referenceLevel = NULL, characteristicsList = list(x = "auto"), coloursList = list(), alpha = 1, orderingList = list(), binsList = list(), yMax = 100, densityStyle = c("box", "violin"), fontSizes = c(24, 16, 12, 16), title = if (comparison == "within") "Feature Selection Stability" else if (comparison == "size") "Feature Selection Size" else if (comparison == "importance") "Variable Importance" else "Feature Selection Commonality", yLabel = if (is.null(referenceLevel) && !comparison %in% c("size", "importance")) "Common Features (%)" else if (comparison == "size") "Set Size" else if (comparison == "importance") tail(names(results[[1]]@importance), 1) else paste("Common Features with", referenceLevel, "(%)"), margin = grid::unit(c(1, 1, 1, 1), "lines"), rotate90 = FALSE, showLegend = TRUE, parallelParams = bpparam() )
results |
A list of |
... |
Not used by end user. |
comparison |
Default: |
referenceLevel |
The level of the comparison factor to use as the
reference to compare each non-reference level to. If |
characteristicsList |
A named list of characteristics. Each element's
name must be one of |
coloursList |
A named list of plot aspects and colours for the aspects.
No elements are mandatory. If specified, each list element's name must be
either |
alpha |
Default: 1. A number between 0 and 1 specifying the transparency level of any fill. |
orderingList |
An optional named list. Any of the variables specified
to |
binsList |
Used only if |
yMax |
Used only if |
densityStyle |
Default: "box". Either |
fontSizes |
A vector of length 4. The first number is the size of the
title. The second number is the size of the axes titles. The third number
is the size of the axes values. The fourth number is the font size of the
titles of grouped plots, if any are produced. In other words, when
|
title |
An overall title for the plot. By default, specifies whether stability or commonality is shown. |
yLabel |
Label to be used for the y-axis of overlap percentages. By default, specifies whether stability or commonality is shown. |
margin |
The margin to have around the plot. |
rotate90 |
Logical. If |
showLegend |
If |
parallelParams |
An object of class |
Additionally, a heatmap of selection size frequencies can be made by specifying size as the comparison to make.
Lastly, a plot showing the distribution of performance metric changes when features are excluded from training can be made if variable importance calculation was turned on during cross-validation.
If comparison
is "within"
, then the feature selection overlaps
are compared within a particular analysis. The result will inform how stable
the selections are between different iterations of cross-validation for a
particular analysis. Otherwise, the comparison is between different
cross-validation runs, and this gives an indication about how common are the
features being selected by different classifications.
Calculating all pair-wise set overlaps can be time-consuming. This stage can
be done on multiple CPUs by providing the relevant options to
parallelParams
. The percentage is calculated as the intersection of
two sets of features divided by the union of the sets, multiplied by 100.
For the feature selection size mode, binsList
is used to create bins
which include the lowest value for the first bin, and the highest value for
the last bin using cut
.
An object of class ggplot
and a plot on the current graphics
device, if plot
is TRUE
.
Dario Strbenac
predicted <- DataFrame(sample = sample(10, 100, replace = TRUE), class = rep(c("Healthy", "Cancer"), each = 50)) actual <- factor(rep(c("Healthy", "Cancer"), each = 5)) allFeatures <- sapply(1:100, function(index) paste(sample(LETTERS, 3), collapse = '')) rankList <- list(allFeatures[1:100], allFeatures[c(5:1, 6:100)], allFeatures[c(1:9, 11, 10, 12:100)], allFeatures[c(1:50, 60:51, 61:100)]) result1 <- ClassifyResult(DataFrame(characteristic = c("Data Set", "Selection Name", "Classifier Name", "Cross-validations"), value = c("Melanoma", "t-test", "Random Forest", "2 Permutations, 2 Folds")), LETTERS[1:10], allFeatures, rankList, list(rankList[[1]][1:15], rankList[[2]][1:15], rankList[[3]][1:10], rankList[[4]][1:10]), list(function(oracle){}), NULL, predicted, actual) predicted[, "class"] <- sample(predicted[, "class"]) rankList <- list(allFeatures[1:100], allFeatures[c(sample(20), 21:100)], allFeatures[c(1:9, 11, 10, 12:100)], allFeatures[c(1:50, 60:51, 61:100)]) result2 <- ClassifyResult(DataFrame(characteristic = c("Data Set", "Selection Name", "Classifier Name", "Cross-validation"), value = c("Melanoma", "t-test", "Diagonal LDA", "2 Permutations, 2 Folds")), LETTERS[1:10], allFeatures, rankList, list(rankList[[1]][1:15], rankList[[2]][1:25], rankList[[3]][1:10], rankList[[4]][1:10]), list(function(oracle){}), NULL, predicted, actual) cList <- list(x = "Classifier Name", fillColour = "Classifier Name") selectionPlot(list(result1, result2), characteristicsList = cList) cList <- list(x = "Classifier Name", fillColour = "size") selectionPlot(list(result1, result2), comparison = "size", characteristicsList = cList, binsList = list(frequencies = seq(0, 100, 10), setSizes = seq(0, 25, 5)) )
predicted <- DataFrame(sample = sample(10, 100, replace = TRUE), class = rep(c("Healthy", "Cancer"), each = 50)) actual <- factor(rep(c("Healthy", "Cancer"), each = 5)) allFeatures <- sapply(1:100, function(index) paste(sample(LETTERS, 3), collapse = '')) rankList <- list(allFeatures[1:100], allFeatures[c(5:1, 6:100)], allFeatures[c(1:9, 11, 10, 12:100)], allFeatures[c(1:50, 60:51, 61:100)]) result1 <- ClassifyResult(DataFrame(characteristic = c("Data Set", "Selection Name", "Classifier Name", "Cross-validations"), value = c("Melanoma", "t-test", "Random Forest", "2 Permutations, 2 Folds")), LETTERS[1:10], allFeatures, rankList, list(rankList[[1]][1:15], rankList[[2]][1:15], rankList[[3]][1:10], rankList[[4]][1:10]), list(function(oracle){}), NULL, predicted, actual) predicted[, "class"] <- sample(predicted[, "class"]) rankList <- list(allFeatures[1:100], allFeatures[c(sample(20), 21:100)], allFeatures[c(1:9, 11, 10, 12:100)], allFeatures[c(1:50, 60:51, 61:100)]) result2 <- ClassifyResult(DataFrame(characteristic = c("Data Set", "Selection Name", "Classifier Name", "Cross-validation"), value = c("Melanoma", "t-test", "Diagonal LDA", "2 Permutations, 2 Folds")), LETTERS[1:10], allFeatures, rankList, list(rankList[[1]][1:15], rankList[[2]][1:25], rankList[[3]][1:10], rankList[[4]][1:10]), list(function(oracle){}), NULL, predicted, actual) cList <- list(x = "Classifier Name", fillColour = "Classifier Name") selectionPlot(list(result1, result2), characteristicsList = cList) cList <- list(x = "Classifier Name", fillColour = "size") selectionPlot(list(result1, result2), comparison = "size", characteristicsList = cList, binsList = list(frequencies = seq(0, 100, 10), setSizes = seq(0, 25, 5)) )
Collects and checks necessary parameters required for feature selection. Either one function is specified or a list of functions to perform ensemble feature selection. The empty constructor is provided for convenience.
SelectParams(featureRanking, characteristics = DataFrame(), minPresence = 1, intermediate = character(0),subsetToSelections = TRUE, tuneParams = list(nFeatures = seq(10, 100, 10), performanceType = "Balanced Accuracy"), ...)
Creates a SelectParams
object which stores the function(s) which will do the selection and parameters that the function will use.
featureRanking
A character keyword referring to a registered feature ranking function. See available
for valid keywords.
characteristics
A DataFrame
describing the characteristics of feature selection to be done. First column must be named "charateristic"
and second column must be named "value"
. If using wrapper functions for feature selection in this package, the feature selection name will automatically be generated and therefore it is not necessary to specify it.
minPresence
If a list of functions was provided, how many of those must a feature have been selected by to be used in classification. 1 is equivalent to a set union and a number the same length as featureSelection
is equivalent to set intersection.
intermediate
Character vector. Names of any variables created in prior stages by runTest
that need to be passed to a feature selection function.
subsetToSelections
Whether to subset the data table(s), after feature selection has been done.
tuneParams
A list specifying tuning parameters required during feature selection. The names of the list are the names of the parameters and the vectors are the values of the parameters to try. All possible combinations are generated. Two elements named nFeatures
and performanceType
are mandatory, to define the performance metric which will be used to select features and how many top-ranked features to try.
...
Other named parameters which will be used by the selection function. If featureSelection
was a list of functions, this must be a list of lists, as long as featureSelection
.
selectParams
is a SelectParams
object.show(SelectParams)
: Prints a short summary of what selectParams
contains.
Dario Strbenac
#if(require(sparsediscrim)) #{ SelectParams("KS") # Ensemble feature selection. SelectParams(list("Bartlett", "Levene")) #}
#if(require(sparsediscrim)) #{ SelectParams("KS") # Ensemble feature selection. SelectParams(list("Bartlett", "Levene")) #}
Collects and checks necessary parameters required for classifier training. The empty constructor is provided for convenience.
TrainParams(classifier, balancing = c("downsample", "upsample", "none"), characteristics = DataFrame(),
intermediate = character(0), tuneParams = NULL, getFeatures = NULL, ...)
Creates a TrainParams
object which stores the function which will do the
classifier building and parameters that the function will use.
classifier
A character keyword referring to a registered classifier. See available
for valid keywords.
balancing
Default: "downsample"
. A keyword specifying how to handle class imbalance for data sets with categorical outcome.
Valid values are "downsample"
, "upsample"
and "none"
.
characteristics
A DataFrame
describing the
characteristics of the classifier used. First column must be named "charateristic"
and second column must be named "value"
. If using wrapper functions for classifiers
in this package, a classifier name will automatically be generated and
therefore it is not necessary to specify it.
intermediate
Character vector. Names of any variables created
in prior stages by runTest
that need to be passed to
classifier
.
tuneParams
A list specifying tuning parameters required during feature selection. The names of the list are the names of the parameters and the vectors are the values of the parameters to try. All possible combinations are generated.
getFeatures
A function may be specified that extracts the selected features from the trained model. This is relevant if using a classifier that does feature selection within training (e.g. random forest). The function must return a list of two vectors. The first vector contains the ranked features (or empty if the training algorithm doesn't produce rankings) and the second vector contains the selected features.
...
Other named parameters which will be used by the classifier.
trainParams
is a TrainParams
object.show(trainParams)
: Prints a short summary of what trainParams
contains.
Dario Strbenac
#if(require(sparsediscrim)) trainParams <- TrainParams("DLDA")
#if(require(sparsediscrim)) trainParams <- TrainParams("DLDA")
Collects and checks necessary parameters required for transformation within CV.
TransformParams(transform, characteristics = DataFrame(), intermediate = character(0), ...)
Creates a TransformParams
object which stores the function which will do the
transformation and parameters that the function will use.
transform
A character keyword referring to a registered transformation function. See available
for valid keywords.
characteristics
A DataFrame
describing the
characteristics of data transformation to be done. First column must be
named "charateristic"
and second column must be named "value"
.
If using wrapper functions for data transformation in this package, the data
transformation name will automatically be generated and therefore it is not
necessary to specify it.
intermediate
Character vector. Names of any variables created in
prior stages by runTest
that need to be passed to a feature selection
function.
...
Other named parameters which will be used by the transformation function.
transformParams
is a TransformParams
object.show(transformParams)
: Prints a short summary of what transformParams
contains.
Dario Strbenac
transformParams <- TransformParams("diffLoc", location = "median") # Subtract all values from training set median, to obtain absolute deviations.
transformParams <- TransformParams("diffLoc", location = "median") # Subtract all values from training set median, to obtain absolute deviations.