| Title: | Wavelet-based Feature Extraction for Copy-number Alteration Data |
|---|---|
| Description: | Provides tools for simulating copy-number alteration (CNA) profiles, applying a non-decimated Haar wavelet transform to genomic signals, and extracting wavelet-derived features for use in supervised learning. Multiple machine learning methods including lasso and elastic-net regularisation, random forest, partial least squares, neural networks and k-nearest neighbours are implemented to train predictive models from genomic feature vectors. The workflow enables end-to-end analysis from CNA simulation to feature extraction and classification. |
| Authors: | Maharani Ahsani Ummi [aut, cre], Arief Gusnanto [aut] |
| Maintainer: | Maharani Ahsani Ummi <[email protected]> |
| License: | GPL-3 |
| Version: | 1.1.0 |
| Built: | 2026-06-05 06:44:29 UTC |
| Source: | https://github.com/bioc/wavFeatExt |
Performs circular binary segmentation (CBS) using the DNAcopy package on a numeric vector representing a single copy-number profile.
CBS(obj, chr = rep(1L, length(obj)), verbose = FALSE)CBS(obj, chr = rep(1L, length(obj)), verbose = FALSE)
obj |
Numeric vector containing the copy-number (or log-ratio) values along the genome for a single sample. |
chr |
Vector of the same length as |
verbose |
Logical. If |
This function is a thin wrapper around CNA
and segment from the DNAcopy package.
The input vector is first converted to a "CNA" object using
artificial map locations 1:length(obj). Circular binary
segmentation is then applied, and the estimated segment means are
expanded back to the original resolution to form a segmented
profile of the same length as the input.
A numeric vector of length length(obj) containing
the segmented copy-number (or log-ratio) values, where each
element equals the estimated segment mean for the corresponding
genomic position.
Maharani Ummi [email protected]
Olshen, A. B., Venkatraman, E. S., Lucito, R. and Wigler, M. (2004). Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics, 5(4), 557–572.
## Generate noisy copy-number data under normal error set.seed(10) cn <- rnorm(100, mean = c(rep(2, 50), rep(1, 50)), sd = 0.4) ## Segmentation using CBS cn.cbs <- CBS(cn) ## Plot raw and segmented profiles plot(cn, type = "p", xlab = "Genomic position", ylab = "Copy-number") lines(cn.cbs, col = 2, lwd = 2)## Generate noisy copy-number data under normal error set.seed(10) cn <- rnorm(100, mean = c(rep(2, 50), rep(1, 50)), sd = 0.4) ## Segmentation using CBS cn.cbs <- CBS(cn) ## Plot raw and segmented profiles plot(cn, type = "p", xlab = "Genomic position", ylab = "Copy-number") lines(cn.cbs, col = 2, lwd = 2)
Performs classification using several machine learning methods on feature sets derived from principal component analysis (PCA) and independent component analysis (ICA), as well as on the original (or segmented) data matrix. Supported methods include Lasso, elastic net, random forest, neural network, partial least squares, and k-nearest neighbours.
classifyPcaIca( data, y, pca, ica, method = c("lasso", "elnet", "RF", "NN", "PLS", "KNN"), k = 5, ite = length(data) )classifyPcaIca( data, y, pca, ica, method = c("lasso", "elnet", "RF", "NN", "PLS", "KNN"), k = 5, ite = length(data) )
data |
A list of numeric matrices, each with observations in rows
and variables in columns, or an object with equivalent structure
(e.g. the output from |
y |
Response vector for the classification task, of length equal
to the number of rows in |
pca |
Result of applying |
ica |
Result of applying |
method |
Character string specifying the classification method to use.
One of |
k |
Number of folds for k-fold cross-validation. Default is 5.
The number of observations must be divisible by |
ite |
Number of cross-validation replications. Default is
|
For each replication, the function constructs a k-fold cross-validation
partition of the observations in data[[1]]. For every feature set
considered (each cumulative PCA feature set, each cumulative ICA feature
set, and the original/segmented data matrix), the chosen classification
method is trained on the training folds and evaluated on the held-out fold.
The cross-validated misclassification error (CE) and area under the ROC
curve (AUC) are computed for each fold and then averaged across folds.
This procedure is repeated for ite replications, and the resulting
averages are stored in matrices for CE and AUC.
The columns of the output matrices correspond to the feature sets used:
the first length(pca[[1]]) columns to PCA-based feature sets,
the next length(ica[[1]]) columns to ICA-based feature sets,
and the final column to the original (or segmented) data matrix
(labelled "seg").
A list with the following components:
Numeric matrix of cross-validated misclassification errors.
Rows correspond to replications (of length ite) and columns
to feature sets.
Numeric matrix of cross-validated areas under the ROC curve.
The dimensions and column names match those of CE.
Character string giving the classification method used.
Maharani Ahsani Ummi and Arief Gusnanto
Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B 58, 267–288.
Breiman, L. (2001). Random forests. Machine Learning 45(1), 5–32.
simulateCNA,
getPca,
getIca,
classifyWavFeatExt
set.seed(10) sim.dat <- simulateCNA( n.obs = 20, p = 32, n.sim = 1, n.block = 8, verbose = FALSE ) pca <- getPca(sim.dat, k = 4) ica <- getIca(sim.dat, k = 4) y <- factor(rep(c("Group1", "Group2"), each = 10)) res <- classifyPcaIca( sim.dat, y, pca, ica, method = "KNN", k = 5, ite = 1 ) resset.seed(10) sim.dat <- simulateCNA( n.obs = 20, p = 32, n.sim = 1, n.block = 8, verbose = FALSE ) pca <- getPca(sim.dat, k = 4) ica <- getIca(sim.dat, k = 4) y <- factor(rep(c("Group1", "Group2"), each = 10)) res <- classifyPcaIca( sim.dat, y, pca, ica, method = "KNN", k = 5, ite = 1 ) res
Performs classification using several machine learning methods on features extracted via non-decimated Haar wavelet transformation, namely detail and scaling coefficients, as well as on the original (or segmented) data matrix. Optionally, all wavelet coefficients can be combined into a single feature set.
classifyWavFeatExt( data, y, det, sca, method = c("lasso", "elnet", "RF", "NN", "PLS", "KNN"), k = 5, ite = length(data), all = FALSE, verbose = FALSE )classifyWavFeatExt( data, y, det, sca, method = c("lasso", "elnet", "RF", "NN", "PLS", "KNN"), k = 5, ite = length(data), all = FALSE, verbose = FALSE )
data |
A non-empty list of numeric matrices with equal number of rows (rows = observations, columns = variables). Each element of the list corresponds to one data set / replicate. |
y |
Response vector for the classification task, of length equal to
the number of rows in |
det |
Detail coefficients of |
sca |
Scaling coefficients of |
method |
Character string specifying the classification method to use.
One of |
k |
Number of folds for k-fold cross-validation (default |
ite |
Number of cross-validation replications (default
|
all |
Logical; if |
verbose |
Logical; if |
For each replication, the function randomly assigns observations to
k folds of equal size. For every feature set considered, the
chosen classification method is trained on the training folds and
evaluated on the held-out fold.
The misclassification error (CE) is computed as the proportion of incorrect predicted classes on the test fold. In addition, the area under the ROC curve (AUC) is computed from the predicted class probabilities using pROC.
Fold-wise CE and AUC are averaged across the k folds to obtain
one CE and one AUC per feature set and replication. This is repeated
for ite replications, producing two matrices (CE and AUC).
Feature sets and column names:
If all = FALSE: columns correspond to detail scales
(D1, D2, ..., Dq), scaling scales (S1, S2, ..., Sp),
and the original/segmented matrix ("seg").
If all = TRUE: columns are "ALL" (all wavelet
coefficients combined) and "seg".
An object of class "wavFeatExtClassifier", a list with:
Numeric matrix of cross-validated misclassification errors.
Numeric matrix of cross-validated AUC values.
Character string giving the classification method used.
Logical indicating whether all = TRUE was used.
Maharani Ahsani Ummi and Arief Gusnanto
Nason, G. P. (2008). Wavelet Methods in Statistics with R. Springer.
Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B 58, 267–288.
simulateCNA,
wavFeatExt,
classifyPcaIca,
nhwt
set.seed(10) sim.dat <- simulateCNA( n.obs = 20, p = 32, n.sim = 1, n.block = 8, verbose = FALSE ) det.coef <- wavFeatExt(sim.dat, type = "detail") sca.coef <- wavFeatExt(sim.dat, type = "scaling") y <- factor(rep(c("Group1", "Group2"), each = 10)) res <- classifyWavFeatExt( sim.dat, y, det.coef, sca.coef, method = "KNN", k = 5, ite = 1 ) resset.seed(10) sim.dat <- simulateCNA( n.obs = 20, p = 32, n.sim = 1, n.block = 8, verbose = FALSE ) det.coef <- wavFeatExt(sim.dat, type = "detail") sca.coef <- wavFeatExt(sim.dat, type = "scaling") y <- factor(rep(c("Group1", "Group2"), each = 10)) res <- classifyWavFeatExt( sim.dat, y, det.coef, sca.coef, method = "KNN", k = 5, ite = 1 ) res
Applies independent component analysis (ICA) to each matrix in a collection of data sets and returns cumulative independent components for use in downstream classification or feature extraction.
getIca(x, k, ...)getIca(x, k, ...)
x |
Either a numeric matrix, a list of numeric matrices (with rows
corresponding to observations and columns to variables), or the output
from |
k |
Integer specifying the number of independent components to extract from each matrix. Must be at least 2. |
... |
Additional arguments passed to |
For each matrix in x, the function calls
icafast to perform independent component analysis,
requesting k independent components. Let denote the
resulting matrix of independent components (rows = observations,
columns = components).
The function then constructs a list of k - 1 matrices per data set,
where the -th element (for ) contains the
first independent components, i.e. S[, 1:j]. This
cumulative structure mirrors the behaviour of getPca and
provides feature sets with increasing dimensionality.
When x is a single matrix or data frame, it is internally wrapped
into a list of length one, and the result is still a list of lists for
consistency.
A list of length equal to the number of matrices in x.
For each input matrix, an element is returned which is itself a list
of length k - 1. The -th element (for )
of the inner list is a numeric matrix containing the first
independent components for that data set.
The cumulative component representation (2, 3, ..., k
components) is designed to be used in combination with
classifyPcaIca for comparing classification performance
across different feature dimensions.
Maharani Ahsani Ummi and Arief Gusnanto
Hyv\"arinen, A. and Oja, E. (2000). Independent component analysis: algorithms and applications. Neural Networks 13(4–5), 411–430.
getPca,
classifyPcaIca,
simulateCNA,
icafast
set.seed(10) ## Small simulated data for fast execution sim.dat <- simulateCNA( n.obs = 20, p = 32, n.sim = 1, n.block = 8, verbose = FALSE ) ## Obtain ICA-based features ica <- getIca(sim.dat, k = 5) length(ica) length(ica[[1]]) sapply(ica[[1]], dim)set.seed(10) ## Small simulated data for fast execution sim.dat <- simulateCNA( n.obs = 20, p = 32, n.sim = 1, n.block = 8, verbose = FALSE ) ## Obtain ICA-based features ica <- getIca(sim.dat, k = 5) length(ica) length(ica[[1]]) sapply(ica[[1]], dim)
Applies principal component analysis (PCA) to each matrix in a collection of data sets and returns cumulative principal components for use in downstream classification or feature extraction.
getPca(x, k, ...)getPca(x, k, ...)
x |
Either a numeric matrix, a list of numeric matrices (with rows
corresponding to observations and columns to variables), or the output
from |
k |
Integer specifying the number of principal components to extract from each matrix. Must be at least 2. |
... |
Additional arguments passed to |
For each matrix in x, the function calls
prcomp to compute principal component scores.
Let denote the resulting score matrix (rows = observations,
columns = principal components).
The function then constructs a list of k - 1 matrices per data
set, where the -th element (for ) contains
the first principal components, i.e. Z[, 1:j]. This
cumulative structure mirrors the behaviour of getIca and
provides feature sets with increasing dimensionality.
A list of length equal to the number of matrices in x.
For each input matrix, an element is returned which is itself a list
of length k - 1. The -th element (for )
of the inner list is a numeric matrix containing the first
principal components for that data set.
The cumulative component representation is designed to be used in
combination with classifyPcaIca for comparing
classification performance across different feature dimensions.
Maharani Ahsani Ummi and Arief Gusnanto
Jolliffe, I. T. (2002). Principal Component Analysis. 2nd edition. Springer.
getIca,
classifyPcaIca,
simulateCNA,
prcomp
set.seed(10) ## Small simulated data for fast execution sim.dat <- simulateCNA( n.obs = 20, p = 32, n.sim = 1, n.block = 8, verbose = FALSE ) ## Obtain PCA-based features pca <- getPca(sim.dat, k = 5) length(pca) length(pca[[1]]) sapply(pca[[1]], dim)set.seed(10) ## Small simulated data for fast execution sim.dat <- simulateCNA( n.obs = 20, p = 32, n.sim = 1, n.block = 8, verbose = FALSE ) ## Obtain PCA-based features pca <- getPca(sim.dat, k = 5) length(pca) length(pca[[1]]) sapply(pca[[1]], dim)
Performs a non-decimated Haar wavelet transform (stationary wavelet transform) on one or more one-dimensional signals. This function is kept for compatibility with older code and is not used in the main workflow of the package.
nhwt(data, type = c("detail", "scaling"))nhwt(data, type = c("detail", "scaling"))
data |
Numeric vector, matrix, or data frame containing the data to be decomposed. If a matrix or data frame is supplied, rows are interpreted as observations and columns as ordered locations. |
type |
Character string indicating which coefficients to extract
from the transform. Use |
The function applies a non-decimated (stationary) Haar wavelet transform to each row of the input. If the number of columns is not a power of two, the series is extended to the next power of two by constant padding with value 1 on both sides before computing the transform. After the transform, the resulting coefficients are re-aligned and cropped back to the original length.
The underlying transform is computed using wd
with family = "DaubExPhase", filter.number = 1, and
type = "station". For each scale, either detail or scaling
coefficients can be extracted via accessD or
accessC, respectively.
A list of length , where
and is the number of columns in data. Each element is a
numeric matrix of dimension nrow(data) x n containing the detail
or scaling coefficients at a given scale.
This function is considered legacy and is not used by the main feature extraction and classification functions in the package. It is provided for backward compatibility with older analysis pipelines.
Maharani Ahsani Ummi and Arief Gusnanto
Nason, G. P. (2008). Wavelet Methods in Statistics with R. Springer.
wavFeatExt,
plot.nhwt,
wd,
accessD,
accessC
## Simple example: non-decimated Haar wavelet coefficients ## for a piecewise constant signal obj <- c(rep(1, 10), rep(3, 20)) ## Detail coefficients at all scales obj.nhwt.det <- nhwt(obj, type = "detail") length(obj.nhwt.det) sapply(obj.nhwt.det, dim)## Simple example: non-decimated Haar wavelet coefficients ## for a piecewise constant signal obj <- c(rep(1, 10), rep(3, 20)) ## Detail coefficients at all scales obj.nhwt.det <- nhwt(obj, type = "detail") length(obj.nhwt.det) sapply(obj.nhwt.det, dim)
Plot the non-decimated Haar wavelet transform (NHWT) coefficients
produced by nhwt() at one or more resolution scales.
The function can display either detail or scaling coefficients, and
supports global or per-level rescaling for visual comparison.
## S3 method for class 'nhwt' plot( x, coef = c("detail", "scaling"), type = c("global", "by.level"), scale = "all", ... )## S3 method for class 'nhwt' plot( x, coef = c("detail", "scaling"), type = c("global", "by.level"), scale = "all", ... )
x |
The output from |
coef |
Character string indicating the type of coefficients to be
plotted. One of |
type |
Character string indicating how the coefficients are rescaled
for plotting when
|
scale |
Either |
... |
Further graphical arguments (ignored). |
The function produces a vertical stacked line plot where each resolution scale is shown on a separate horizontal band. Within a scale, each coefficient is drawn as a vertical segment whose height is proportional to its (rescaled) value.
If obj.nhwt originates from nhwt() applied to
a data matrix with multiple observations, only the coefficients
for the first observation (row) are used. This behaviour is intended
for visualising a single time series or copy-number profile.
This function is called for its side effects of producing a plot.
It returns invisible(NULL).
This plotting function is primarily intended for exploratory visual inspection of non-decimated Haar wavelet coefficients, for example when assessing multi-scale structure in simulated copy number alteration (CNA) profiles.
Maharani Ahsani Ummi
Nason, G. P. (2008). Wavelet Methods in Statistics with R. Springer.
## Simple example: single synthetic profile set.seed(1) obj <- c(rep(1, 10), rep(3, 20)) ## Non-decimated Haar wavelet transform (detail coefficients) obj.nhwt.det <- nhwt(obj, type = "detail") ## Plot NHWT detail coefficients (all scales, global scaling) plot(obj.nhwt.det, coef = "detail", type = "global") ## Scaling coefficients obj.nhwt.sca <- nhwt(obj, type = "scaling") plot(obj.nhwt.sca, coef = "scaling", type = "global") ## Plot a single scale (scale 1) for detail and scaling coefficients plot(obj.nhwt.det, coef = "detail", scale = 1) plot(obj.nhwt.sca, coef = "scaling", scale = 1)## Simple example: single synthetic profile set.seed(1) obj <- c(rep(1, 10), rep(3, 20)) ## Non-decimated Haar wavelet transform (detail coefficients) obj.nhwt.det <- nhwt(obj, type = "detail") ## Plot NHWT detail coefficients (all scales, global scaling) plot(obj.nhwt.det, coef = "detail", type = "global") ## Scaling coefficients obj.nhwt.sca <- nhwt(obj, type = "scaling") plot(obj.nhwt.sca, coef = "scaling", type = "global") ## Plot a single scale (scale 1) for detail and scaling coefficients plot(obj.nhwt.det, coef = "detail", scale = 1) plot(obj.nhwt.sca, coef = "scaling", scale = 1)
classifyPcaIca
Create boxplots of cross-validated performance measures
(classification error or AUC) for the feature sets used in
classifyPcaIca, namely PCA-based, ICA-based,
and original/segmented data features.
## S3 method for class 'pcaIcaClassifier' plot( x, type = c("CE", "AUC"), d.type = c("both", "PCA", "ICA"), adjust = TRUE, ... )## S3 method for class 'pcaIcaClassifier' plot( x, type = c("CE", "AUC"), d.type = c("both", "PCA", "ICA"), adjust = TRUE, ... )
x |
The result object returned by |
type |
Character string specifying which performance measure to plot.
One of |
d.type |
Character string indicating which feature types to include
in the plot: |
adjust |
Logical flag reserved for future extensions of the plotting function. Currently not used. |
... |
Additional graphical parameters passed to
|
This function assumes that classifyPcaIca assigns column
names to its CE and AUC matrices, with prefixes
"PC" for PCA-based features, "I" for ICA-based features,
and the column name "seg" for the original (or segmented) data matrix.
This function is used for its side effect of producing a plot and returns
invisible(NULL).
classifyPcaIca,
classifyWavFeatExt,
simulateCNA,
getPca,
getIca
set.seed(10) sim.dat <- simulateCNA( n.obs = 20, p = 32, n.sim = 1, n.block = 8, verbose = FALSE ) pca <- getPca(sim.dat, k = 4) ica <- getIca(sim.dat, k = 4) y <- factor(rep(c("Group1", "Group2"), each = 10)) res <- classifyPcaIca( sim.dat, y, pca, ica, method = "KNN", k = 5, ite = 1 ) plot(res, type = "CE") plot(res, type = "AUC", d.type = "PCA")set.seed(10) sim.dat <- simulateCNA( n.obs = 20, p = 32, n.sim = 1, n.block = 8, verbose = FALSE ) pca <- getPca(sim.dat, k = 4) ica <- getIca(sim.dat, k = 4) y <- factor(rep(c("Group1", "Group2"), each = 10)) res <- classifyPcaIca( sim.dat, y, pca, ica, method = "KNN", k = 5, ite = 1 ) plot(res, type = "CE") plot(res, type = "AUC", d.type = "PCA")
classifyWavFeatExt
Create boxplots of cross-validated performance measures (classification
error or AUC) for wavelet-based feature sets used in
classifyWavFeatExt(), namely detail and scaling coefficients,
and the original/segmented data.
## S3 method for class 'wavFeatExtClassifier' plot(x, type = c("CE", "AUC"), adjust = TRUE, ...)## S3 method for class 'wavFeatExtClassifier' plot(x, type = c("CE", "AUC"), adjust = TRUE, ...)
x |
The result object returned by |
type |
Character string specifying which performance measure to plot.
One of |
adjust |
Logical flag reserved for future extensions of the plotting function. Currently not used. |
... |
Additional graphical parameters passed to
|
It is assumed that classifyWavFeatExt() assigns column names to its CE
and AUC matrices as follows:
Columns beginning with "D" correspond to wavelet detail coefficients.
Columns beginning with "S" correspond to wavelet scaling coefficients.
The column "seg" corresponds to the original (or segmented) data.
The corresponding columns are displayed as boxplots summarising cross-validated performance across replications.
Two horizontal reference lines are added:
Red dashed line: median performance of "seg"
Blue dotted line: best median performance among all features
Invisibly returns NULL.
classifyWavFeatExt(),
wavFeatExt(),
simulateCNA()
set.seed(10) ## Small simulated data for fast execution sim.dat <- simulateCNA( n.obs = 20, p = 32, n.sim = 1, n.block = 8, verbose = FALSE ) ## Extract wavelet features det.coef <- wavFeatExt(sim.dat, type = "detail") sca.coef <- wavFeatExt(sim.dat, type = "scaling") ## Binary response y <- factor(rep(c("Group1", "Group2"), each = 10)) ## Classification (fast method) res <- classifyWavFeatExt( sim.dat, y, det.coef, sca.coef, method = "KNN", k = 5, ite = 1 ) ## Plot results plot(res, type = "CE") plot(res, type = "AUC")set.seed(10) ## Small simulated data for fast execution sim.dat <- simulateCNA( n.obs = 20, p = 32, n.sim = 1, n.block = 8, verbose = FALSE ) ## Extract wavelet features det.coef <- wavFeatExt(sim.dat, type = "detail") sca.coef <- wavFeatExt(sim.dat, type = "scaling") ## Binary response y <- factor(rep(c("Group1", "Group2"), each = 10)) ## Classification (fast method) res <- classifyWavFeatExt( sim.dat, y, det.coef, sca.coef, method = "KNN", k = 5, ite = 1 ) ## Plot results plot(res, type = "CE") plot(res, type = "AUC")
Applies circular binary segmentation (CBS) to each row of a data matrix.
This function is a convenience wrapper around CBS for
performing segmentation per observation.
seg(test.noise, denoise = "CBS", verbose = FALSE)seg(test.noise, denoise = "CBS", verbose = FALSE)
test.noise |
Numeric matrix or data frame containing the noisy data to be segmented. Rows correspond to observations (e.g. samples or patients) and columns to genomic locations or predictor variables. |
denoise |
Character string specifying the segmentation method.
Currently only |
verbose |
Logical. If |
For each row of test.noise, the function calls CBS
to perform circular binary segmentation using the DNAcopy
infrastructure. The result is a matrix of segmented values with the
same dimensions as the input, where each row represents the segmented
copy-number (or intensity) profile for that observation.
The denoise argument is retained for future extensions, where
alternative segmentation methods might be implemented.
A numeric matrix of the same dimension as test.noise,
containing the segmented profiles for all observations. Each row
corresponds to one observation and each column to one position.
Maharani Ummi [email protected]
Olshen, A. B., Venkatraman, E. S., Lucito, R. and Wigler, M. (2004). Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5(4), 557–572.
set.seed(10) cn <- replicate( 10, rnorm(100, mean = c(rep(2, 50), rep(1, 50)), sd = 0.4) ) cn <- t(cn) cn.seg <- seg(cn, denoise = "CBS", verbose = FALSE) dim(cn.seg) par(mfrow = c(1, 2)) plot(cn[1, ], type = "p", main = "Observation 1", xlab = "Position", ylab = "Copy-number") lines(cn.seg[1, ], col = 2, lwd = 2) plot(cn[2, ], type = "p", main = "Observation 2", xlab = "Position", ylab = "Copy-number") lines(cn.seg[2, ], col = 2, lwd = 2)set.seed(10) cn <- replicate( 10, rnorm(100, mean = c(rep(2, 50), rep(1, 50)), sd = 0.4) ) cn <- t(cn) cn.seg <- seg(cn, denoise = "CBS", verbose = FALSE) dim(cn.seg) par(mfrow = c(1, 2)) plot(cn[1, ], type = "p", main = "Observation 1", xlab = "Position", ylab = "Copy-number") lines(cn.seg[1, ], col = 2, lwd = 2) plot(cn[2, ], type = "p", main = "Observation 2", xlab = "Position", ylab = "Copy-number") lines(cn.seg[2, ], col = 2, lwd = 2)
Generates simulated copy-number alteration (CNA) data for a two-group setting, under a multivariate normal model with block correlation structure. Mean differences between the two groups are introduced in selected blocks, and each simulated profile is subsequently segmented using circular binary segmentation (CBS).
simulateCNA( n.obs = 100, p = 1024, n.sim = 1, effect.diff = 1, n.block = 32, true.rho = 0.9, block.cor = 0.4, block.diff = "A", true.mu = NULL, verbose = FALSE )simulateCNA( n.obs = 100, p = 1024, n.sim = 1, effect.diff = 1, n.block = 32, true.rho = 0.9, block.cor = 0.4, block.diff = "A", true.mu = NULL, verbose = FALSE )
n.obs |
Number of observations in each simulated data set. Must be even, so that the first half belong to group 1 and the second half to group 2. |
p |
Number of variables (genomic locations) in each simulated data set. It is recommended to use a power of two for compatibility with subsequent wavelet-based analyses. |
n.sim |
Number of data sets to generate. Each data set has
|
effect.diff |
Magnitude of the mean difference between the two groups in the blocks where differences are present. |
n.block |
Number of correlation blocks along the genome. Must be a
divisor of |
true.rho |
Correlation between genomic regions within a block. |
block.cor |
Correlation between adjacent blocks. A value of zero implies that blocks are independent from each other. |
block.diff |
Pattern of blocks that carry mean differences between
the two groups. Currently one of |
true.mu |
Optional numeric vector of length |
verbose |
Logical. If |
The function first constructs a block-structured covariance matrix
of dimension p x p, with within-block correlations
given by true.rho and between-block correlations given by
block.cor. Copy-number data are then simulated from a multivariate
normal distribution with mean vector true.mu and covariance
.
To create a two-group setting, a mean shift vector is added to the first
half of the observations (group 1), while the second half (group 2)
remains at the baseline mean. The pattern of mean shifts depends on
block.diff.
Finally, for each simulated data set, circular binary segmentation (CBS)
is applied to every observation (row) via seg and
CBS. The returned data therefore correspond to segmented
CNA profiles rather than raw multivariate normal values.
A list of length n.sim. Each element is a numeric matrix of
dimension n.obs x p, containing the segmented CNA profiles for
one simulated data set.
Arief Gusnanto
seg,
CBS,
wavFeatExt,
classifyPcaIca,
classifyWavFeatExt
set.seed(10) sim.dat <- simulateCNA( n.obs = 20, p = 32, n.sim = 1, n.block = 8, verbose = FALSE ) length(sim.dat) dim(sim.dat[[1]]) plot(sim.dat[[1]][1, ], type = "l", xlab = "Genomic location", ylab = "Segmented CNA", main = "Simulated segmented CNA profile")set.seed(10) sim.dat <- simulateCNA( n.obs = 20, p = 32, n.sim = 1, n.block = 8, verbose = FALSE ) length(sim.dat) dim(sim.dat[[1]]) plot(sim.dat[[1]][1, ], type = "l", xlab = "Genomic location", ylab = "Segmented CNA", main = "Simulated segmented CNA profile")
Performs feature extraction from copy-number (or other one-dimensional) profiles using the non-decimated (stationary) Haar wavelet transform. The function computes detail or scaling coefficients at multiple scales for each observation and organises them into a structured list.
wavFeatExt(data, type = c("detail", "scaling"))wavFeatExt(data, type = c("detail", "scaling"))
data |
Either a numeric matrix, a list of numeric matrices (with rows
corresponding to observations and columns to variables or genomic
locations), or the output from |
type |
Character string indicating which type of coefficients to
extract from the non-decimated wavelet transform. One of |
For each matrix in data, the function applies a non-decimated
Haar wavelet transform (via nhwt) to each row. This yields,
for each scale, a matrix of wavelet coefficients of the same dimension
as the original data matrix.
The input data is first normalised to a list of matrices. If a
single matrix is provided, it is internally wrapped into a list of length
one. When the input is the output of simulateCNA, each
element corresponds to one simulated data set.
The resulting object is a list of length equal to the number of matrices in the input. Each element of this list is itself a list of length equal to the number of available scales. At each scale, a numeric matrix contains the wavelet coefficients (detail or scaling) for all observations.
A list of length equal to the number of matrices in data.
For each data matrix, an element is returned which is a list of length
, where is the number of scales determined by the
wavelet transform. Each sub-list element is a numeric matrix of the
same dimension as the original data matrix.
Maharani Ahsani Ummi and Arief Gusnanto
Nason, G. P. (2008). Wavelet Methods in Statistics with R. Springer.
nhwt,
simulateCNA,
classifyWavFeatExt,
classifyPcaIca
set.seed(10) sim.dat <- simulateCNA( n.obs = 20, p = 32, n.sim = 1, n.block = 8, verbose = FALSE ) det.coef <- wavFeatExt(sim.dat, type = "detail") sca.coef <- wavFeatExt(sim.dat, type = "scaling") length(det.coef) length(sca.coef)set.seed(10) sim.dat <- simulateCNA( n.obs = 20, p = 32, n.sim = 1, n.block = 8, verbose = FALSE ) det.coef <- wavFeatExt(sim.dat, type = "detail") sca.coef <- wavFeatExt(sim.dat, type = "scaling") length(det.coef) length(sca.coef)