Title: | Selecting the number of mutational signatures through cross-validation |
---|---|
Description: | An unsupervised cross-validation method to select the optimal number of mutational signatures. A data set of mutational counts is split into training and validation data.Signatures are estimated in the training data and then used to predict the mutations in the validation data. |
Authors: | DongHyuk Lee [aut], Bin Zhu [aut], Bill Wheeler [cre] |
Maintainer: | Bill Wheeler <[email protected]> |
License: | GPL-2 |
Version: | 1.9.0 |
Built: | 2024-10-31 05:38:11 UTC |
Source: | https://github.com/bioc/SUITOR |
To select the number of mutational signatures through cross-validation.
SUITOR (Selecting the nUmber of mutatIonal signaTures thrOugh cRoss-validation), an unsupervised cross-validation method that requires little assumptions and no numerical approximations to select the optimal number of signatures without overfitting the data. The full dataset of mutation counts is split into a training set and a validation set; for a given number of signatures, these signatures are estimated in the training set and then they are used to predict the mutations in the validation set. Multiple candidate numbers of signatures are considered; and the number of signatures which predicts most closely the mutations in the validation set is selected.
The two main functions in this package are suitor
and
suitorExtractWH
.
Donghyuk Lee <[email protected]> and Bin Zhu <[email protected]>
Lee, D., Wang, D., Yang, X., Shi, J., Landi, M., Zhu, B. (2021) SUITOR: selecting the number of mutational signatures through cross-validation. bioRxiv, doi: https://doi.org/10.1101/2021.07.28.454269.
Compute summary results and the optimal rank from the matrix containing all results.
getSummary(obj, NC, NR=96)
getSummary(obj, NC, NR=96)
obj |
Matrix containing all results in the return list from
|
NC |
The number of columns in |
NR |
The number of rows in |
The input matrix obj
must have column 1 as the rank, column 2 as
the value of k
in 1:k.fold
, column 4 as the training errors,
and column 5 as the testing errors.
A list containing the objects:
rank
: The optimal rank
all.results
: Matrix containing training and testing errors
for all values of seeds, ranks, folds.
NA values appear for runs in which the EM
algorithm did not converge.
summary: Data frame of summarized results for each possible rank
created from all.results
.
The MSErr
column is defined as
sqrt({fold1 + ... +foldK}/{nrow(data)*ncol(data)})
Donghyuk Lee <[email protected]> and Bin Zhu <[email protected]>
data(SimData, package="SUITOR") data(results, package="SUITOR") ret <- getSummary(results$all.results, ncol(SimData)) ret$summary ret$rank
data(SimData, package="SUITOR") data(results, package="SUITOR") ret <- getSummary(results$all.results, ncol(SimData)) ret$summary ret$rank
A data frame with columns Rank, Type, and MSErr
data(plotData, package="SUITOR") plotData
data(plotData, package="SUITOR") plotData
Plot train and test errors
plotErrors(x)
plotErrors(x)
x |
Data frame of |
The optimal rank is the minimum at which the test error is attained, and appears as a red dot on the graph.
NULL
Donghyuk Lee <[email protected]> and Bin Zhu <[email protected]>
data(plotData, package="SUITOR") plotErrors(plotData)
data(plotData, package="SUITOR") plotErrors(plotData)
An object returned from the suitor function for examples
data(results, package="SUITOR") results
data(results, package="SUITOR") results
Example input data and results
Contains an example input data object of size 96 by 300.
It is generated by rpois
with mean WH where W (96 by 8) is profile
of 8 signatures (SBS 4, 6, 7a, 9, 17b, 22, 26, 39) obtained from
https://cancer.sanger.ac.uk/cosmic/signatures/SBS and H (8 by 300) is
rounded integer generated from a uniform distribution between 0 and 100
with some randomly selected cells being set to zero.
data(SimData, package="SUITOR") # Display a subset of data objects SimData[1:5, 1:5]
data(SimData, package="SUITOR") # Display a subset of data objects SimData[1:5, 1:5]
Selecting the number of mutational signatures through cross-validation
suitor(data, op=NULL)
suitor(data, op=NULL)
data |
Data frame or matrix containing mutational signatures. This object must contain non-negative values |
op |
List of options (see details). The default is NULL. |
The algorithm finds the optimal rank by applying k-fold cross validation.
Options list op:
Name | Description | Default Value |
em.eps | EM algorithm stopping tolerance | 1e-5 |
get.summary | 0 or 1 to create summary results | 1 |
k.fold | Number of folds | 10 |
max.iter | Maximum number of iterations in EM algorithm | 2000 |
max.rank | Maximum rank | 10 |
min.rank | Minimum rank | 1 |
min.value | Minimum value of matrix before factorizing | 1e-4 |
BPPARAM | See BiocParallelParam
|
NULL |
n.starts | Number of starting points | 30 |
plot | 0 or 1 to produce an error plot | 1 |
0 or 1 to print info | 1 | |
kfold.vec | Vector of values in 1:k.fold when running on a cluster | NULL |
Parallel computing
The BiocParallel
package is used for parallel computing.
If BPPARAM = NULL
, then BPPARAM
will be set to SerialParam
.
Utilizing a cluster
When running on a cluster, the option get.summary
should be set
to 0.
For fastest running jobs, set the options min.rank = max.rank
,
kfold.vec
to a single integer in 1:k.fold
, and n.starts
to 1.
A list containing the objects:
rank
: The optimal rank
all.results
: Matrix containing training and testing errors
for all values of seeds, ranks, folds.
summary: Data frame of summarized results for each possible rank
created from all.results
.
The MSErr
column is defined as
sqrt({fold1 + ... +foldK}/{nrow(data)*ncol(data)})
Donghyuk Lee <[email protected]> and Bin Zhu <[email protected]>
data(SimData, package="SUITOR") # Using the default options will take several minutes to run ret <- suitor(SimData)
data(SimData, package="SUITOR") # Using the default options will take several minutes to run ret <- suitor(SimData)
Extract the matrix of activities (exposures) and matrix of signatures
suitorExtractWH(data, rank, op=NULL)
suitorExtractWH(data, rank, op=NULL)
data |
Data frame or matrix containing mutational signatures. This object must contain non-negative values |
rank |
Integer > 0 |
op |
List of options (see details). The default is NULL. |
Options list op:
Name | Description | Default Value |
min.value | Minimum value of matrix before factorizing | 1e-4 |
BPPARAM | See BiocParallelParam
|
NULL |
n.starts | Number of starting points | 30 |
0 or 1 to print info | 1 | |
Parallel computing
The BiocParallel
package is used for parallel computing.
If BPPARAM = NULL
, then BPPARAM
will be set to SerialParam
.
A list containing the objects:
H
: Matrix of activities (exposures)
W
: Matrix of signatures
Donghyuk Lee <[email protected]> and Bin Zhu <[email protected]>
data(SimData, package="SUITOR") suitorExtractWH(SimData, 2)
data(SimData, package="SUITOR") suitorExtractWH(SimData, 2)