Title: | Light-Weight Methods for Normalization and Visualization of Microarray Data using Only Basic R Data Types |
---|---|
Description: | Methods for microarray analysis that take basic data types such as matrices and lists of vectors. These methods can be used standalone, be utilized in other packages, or be wrapped up in higher-level classes. |
Authors: | Henrik Bengtsson [aut, cre, cph], Pierre Neuvial [ctb], Aaron Lun [ctb] |
Maintainer: | Henrik Bengtsson <[email protected]> |
License: | GPL (>= 2) |
Version: | 3.37.0 |
Built: | 2024-11-29 03:22:23 UTC |
Source: | https://github.com/bioc/aroma.light |
Methods for microarray analysis that take basic data types such as matrices and lists of vectors. These methods can be used standalone, be utilized in other packages, or be wrapped up in higher-level classes.
To install this package, see https://bioconductor.org/packages/release/bioc/html/aroma.light.html.
For scanner calibration:
see calibrateMultiscan
() - scan the same array two or more times to calibrate for scanner effects and extended dynamical range.
To normalize multiple single-channel arrays all with the same number of probes/spots:
normalizeAffine
() - normalizes, on the intensity scale, for differences in offset and scale between channels.
normalizeQuantileRank
(), normalizeQuantileSpline
() - normalizes, on the intensity scale, for differences in empirical distribution between channels.
To normalize multiple single-channel arrays with varying number probes/spots:
normalizeQuantileRank
(), normalizeQuantileSpline
() - normalizes, on the intensity scale, for differences in empirical distribution between channels.
To normalize two-channel arrays:
normalizeAffine
() - normalizes, on the intensity scale, for differences in offset and scale between channels. This will also correct for intensity-dependent affects on the log scale.
normalizeCurveFit
() - Classical intensity-dependent normalization, on the log scale, e.g. lowess normalization.
To normalize three or more channels:
normalizeAffine
() - normalizes, on the intensity scale, for differences in offset and scale between channels. This will minimize the curvature on the log scale between any two channels.
Several of the normalization methods proposed in [1]-[7] are available in this package.
Whenever using this package, please cite one or more of [1]-[7].
Here is a list of features that would be useful, but which I have too little time to add myself. Contributions are appreciated.
At the moment, nothing.
If you consider to contribute, make sure it is not already implemented by downloading the latest "devel" version!
The releases of this package is licensed under GPL version 2 or newer.
NB: Except for the robustSmoothSpline()
method,
it is alright to distribute the rest of the package under
LGPL version 2.1 or newer.
The development code of the packages is under a private licence (where applicable) and patches sent to the author fall under the latter license, but will be, if incorporated, released under the "release" license above.
Henrik Bengtsson, Pierre Neuvial, Aaron Lun
Some of the reference below can be found at
https://www.aroma-project.org/publications/.
[1] H. Bengtsson, Identification and normalization of plate effects
in cDNA microarray data, Preprints in Mathematical Sciences,
2002:28, Mathematical Statistics, Centre for Mathematical Sciences,
Lund University, 2002.
[2] H. Bengtsson, The R.oo package - Object-Oriented Programming with References Using Standard R Code, In Kurt Hornik, Friedrich Leisch and Achim Zeileis, editors, Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003), March 20-22, Vienna, Austria. http://www.ci.tuwien.ac.at/Conferences/DSC-2003/Proceedings/
[3] H. Bengtsson, aroma - An R Object-oriented Microarray
Analysis environment, Preprints in Mathematical Sciences (manuscript
in preparation), Mathematical Statistics, Centre for Mathematical
Sciences, Lund University, 2004.
[4] H. Bengtsson, J. Vallon-Christersson and G. Jönsson, Calibration and assessment of channel-specific biases in microarray data with extended dynamical range, BMC Bioinformatics, 5:177, 2004.
[5] Henrik Bengtsson and Ola Hössjer, Methodological Study of Affine Transformations of Gene Expression Data, Methodological study of affine transformations of gene expression data with proposed robust non-parametric multi-dimensional normalization method, BMC Bioinformatics, 2006, 7:100.
[6] H. Bengtsson, R. Irizarry, B. Carvalho, and T. Speed, Estimation and assessment of raw copy numbers at the single locus level, Bioinformatics, 2008.
[7] H. Bengtsson, A. Ray, P. Spellman and T.P. Speed, A single-sample method for normalizing and combining full-resolutioncopy numbers from multiple platforms, labs and analysis methods, Bioinformatics, 2009.
[8] H. Bengtsson, P. Neuvial and T.P. Speed, TumorBoost: Normalization of allele-specific tumor copy numbers from a single pair of tumor-normal genotyping microarrays, BMC Bioinformatics, 2010, 11:245. [PMID 20462408]
In this section we give our recommendation on how spotted two-color (or multi-color) microarray data is best calibrated and normalized.
We do not recommend background subtraction in classical means where background is estimated by various image analysis methods. This means that we will only consider foreground signals in the analysis.
We estimate "background" by other means. In what is explain below, only a global background, that is, a global bias, is estimated and removed.
In Bengtsson et al (2004) we give evidence that microarray scanners can introduce a significant bias in data. This bias, which is about 15-25 out of 65535, will introduce intensity dependency in the log-ratios, as explained in Bengtsson & Hössjer (2006).
In Bengtsson et al (2004) we find that this bias is stable across arrays (and a couple of months), but further research is needed in order to tell if this is true over a longer time period.
To calibrate signals for scanner biases, scan the same array at multiple PMT-settings at three or more (K >= 3) different PMT settings (preferably in decreasing order). While doing this, do not adjust the laser power settings. Also, do the multiscan without washing, cleaning or by other means changing the array between subsequent scans. Although not necessary, it is preferred that the array remains in the scanner between subsequent scans. This will simplify the image analysis since spot identification can be made once if images aligns perfectly.
After image analysis, read all K scans for the same array into the two matrices, one for the red and one for the green channel, where the K columns corresponds to scans and the N rows to the spots. It is enough to use foreground signals.
In order to multiscan calibrate the data, for each channel
separately call Xc <- calibrateMultiscan(X)
where X
is the NxK matrix of signals for one channel across all scans. The
calibrated signals are returned in the Nx1 matrix Xc
.
Multiscan calibration may sometimes be skipped, especially if affine
normalization is applied immediately after, but we do recommend that
every lab check at least once if their scanner introduce bias.
If the offsets in a scanner is already estimated from earlier
multiscan analyses, or known by other means, they can readily be
subtracted from the signals of each channel. If arrays are still
multiscanned, it is possible to force the calibration method to
fit the model with zero intercept (assuming the scanner offsets
have been subtracted) by adding argument center=FALSE
.
In Bengtsson & Hössjer (2006), we carry out a detailed study on how biases in each channel introduce so called intensity-dependent log-ratios among other systematic artifacts. Data with (additive) bias in each channel is said to be affinely transformed. Data without such bias, is said to be linearly (proportionally) transform. Ideally, observed signals (data) is a linear (proportional) function of true gene expression levels.
We do not assume proportional observations. The scanner bias is real evidence that assuming linearity is not correct. Affine normalization corrects for affine transformation in data. Without control spots it is not possible to estimate the bias in each of the channels but only the relative bias such that after normalization the effective bias are the same in all channels. This is why we call it normalization and not calibration.
In its simplest form, affine normalization is done by
Xn <- normalizeAffine(X)
where X
is a Nx2 matrix with
the first column holds the foreground signals from the red channel and
the second holds the signals from the green channel. If three- or
four-channel data is used these are added the same way. The normalized
data is returned as a Nx2 matrix Xn
.
To normalize all arrays and all channels at once, one may put all
data into one big NxK matrix where the K columns hold the all channels
from the first array, then all channels from the second array and so
on. Then Xn <- normalizeAffine(X)
will return the across-array
and across-channel normalized data in the NxK matrix Xn
where
the columns are stored in the same order as in matrix X
.
Equal effective bias in all channels is much better. First of all, any intensity-dependent bias in the log-ratios is removed for all non-differentially expressed genes. There is still an intensity-dependent bias in the log-ratios for differentially expressed genes, but this is now symmetric around log-ratio zero.
Affine normalization will (by default and recommended) normalize all arrays together and at once. This will guarantee that all arrays are "on the same scale". Thus, it not recommended to apply a classical between-array scale normalization afterward. Moreover, the average log-ratio will be zero after an affine normalization.
Note that an affine normalization will only remove curvature in the log-ratios at lower intensities. If a strong intensity-dependent bias at high intensities remains, this is most likely due to saturation effects, such as too high PMT settings or quenching.
Note that for a perfect affine normalization you should expect much higher noise levels in the log-ratios at lower intensities than at higher. It should also be approximately symmetric around zero log-ratio. In other words, a strong fanning effect is a good sign.
Due to different noise levels in red and green channels, different PMT settings in different channels, plus the fact that the minimum signal is zero, "odd shapes" may be seen in the log-ratio vs log-intensity graphs at lower intensities. Typically, these show themselves as non-symmetric in positive and negative log-ratios. Note that you should not see this at higher intensities.
If there is a strong intensity-dependent effect left after the affine normalization, we recommend, for now, that a subsequent curve-fit or quantile normalization is done. Which one, we do not know.
Why negative signals?
By default, 5% of the normalized signals will have a non-positive
signal in one or both channels. This is on purpose, although
the exact number 5% is chosen by experience. The reason for
introducing negative signals is that they are indeed expected.
For instance, when measure a zero gene expression level, there is
a chance that the observed value is (should be) negative due to
measurement noise. (For this reason it is possible that the scanner
manufacturers have introduced scanner bias on purpose to avoid
negative signals, which then all would be truncated to zero.)
To adjust the ratio (or number) of negative signals allowed, use
for example normalizeAffine(X, constraint=0.01)
for 1%
negative signals. If set to zero (or "max"
) only as much
bias is removed such that no negative signals exist afterward.
Note that this is also true if there were negative signals on
beforehand.
Why not lowess normalization? Curve-fit normalization methods such as lowess normalization are basically designed based on linearity assumptions and will for this reason not correct for channel biases. Curve-fit normalization methods can by definition only be applied to one pair of channels at the time and do therefore require a subsequent between-array scale normalization, which is by the way very ad hoc.
Why not quantile normalization? Affine normalization can be though of a special case of quantile normalization that is more robust than the latter. See Bengtsson & Hössjer (2006) for details. Quantile normalization is probably better to apply than curve-fit normalization methods, but less robust than affine normalization, especially at extreme (low and high) intensities. For this reason, we do recommend to use affine normalization first, and if this is not satisfactory, quantile normalization may be applied.
If the channel offsets are zero, already corrected for, or estimated
by other means, it is possible to normalize the data robustly by
fitting the above affine model without intercept, that is, fitting
a truly linear model. This is done adding argument center=FALSE
when calling normalizeAffine()
.
Henrik Bengtsson
Gets the average empirical distribution for a set of samples.
## S3 method for class 'list' averageQuantile(X, ...) ## S3 method for class 'matrix' averageQuantile(X, ...)
## S3 method for class 'list' averageQuantile(X, ...) ## S3 method for class 'matrix' averageQuantile(X, ...)
X |
A |
... |
Not used. |
Returns a numeric
vector
of length equal to the longest vector
in argument X
.
Missing values are excluded.
Parts adopted from Gordon Smyth (http://www.statsci.org/) in 2002 & 2006. Original code by Ben Bolstad at Statistics Department, University of California.
normalizeQuantileRank
().
normalizeQuantileSpline
().
quantile
.
Reverse affine transformation.
## S3 method for class 'matrix' backtransformAffine(X, a=NULL, b=NULL, project=FALSE, ...)
## S3 method for class 'matrix' backtransformAffine(X, a=NULL, b=NULL, project=FALSE, ...)
X |
An NxK |
a |
A scalar, |
b |
A scalar, |
project |
returned (K values per data point are returned).
If |
... |
Not used. |
The "(X-a)/b
" backtransformed NxK matrix
is returned.
If project
is TRUE
, an Nx1 matrix
is returned, because
all columns are identical anyway.
Missing values remain missing values. If projected, data points that contain missing values are projected without these.
X <- matrix(1:8, nrow=4, ncol=2) X[2,2] <- NA print(X) # Returns a 4x2 matrix print(backtransformAffine(X, a=c(1,5))) # Returns a 4x2 matrix print(backtransformAffine(X, b=c(1,1/2))) # Returns a 4x2 matrix print(backtransformAffine(X, a=matrix(1:4,ncol=1))) # Returns a 4x2 matrix print(backtransformAffine(X, a=matrix(1:3,ncol=1))) # Returns a 4x2 matrix print(backtransformAffine(X, a=matrix(1:2,ncol=1), b=c(1,2))) # Returns a 4x1 matrix print(backtransformAffine(X, b=c(1,1/2), project=TRUE)) # If the columns of X are identical, and a identity # backtransformation is applied and projected, the # same matrix is returned. X <- matrix(1:4, nrow=4, ncol=3) Y <- backtransformAffine(X, b=c(1,1,1), project=TRUE) print(X) print(Y) stopifnot(sum(X[,1]-Y) <= .Machine$double.eps) # If the columns of X are identical, and a identity # backtransformation is applied and projected, the # same matrix is returned. X <- matrix(1:4, nrow=4, ncol=3) X[,2] <- X[,2]*2; X[,3] <- X[,3]*3 print(X) Y <- backtransformAffine(X, b=c(1,2,3)) print(Y) Y <- backtransformAffine(X, b=c(1,2,3), project=TRUE) print(Y) stopifnot(sum(X[,1]-Y) <= .Machine$double.eps)
X <- matrix(1:8, nrow=4, ncol=2) X[2,2] <- NA print(X) # Returns a 4x2 matrix print(backtransformAffine(X, a=c(1,5))) # Returns a 4x2 matrix print(backtransformAffine(X, b=c(1,1/2))) # Returns a 4x2 matrix print(backtransformAffine(X, a=matrix(1:4,ncol=1))) # Returns a 4x2 matrix print(backtransformAffine(X, a=matrix(1:3,ncol=1))) # Returns a 4x2 matrix print(backtransformAffine(X, a=matrix(1:2,ncol=1), b=c(1,2))) # Returns a 4x1 matrix print(backtransformAffine(X, b=c(1,1/2), project=TRUE)) # If the columns of X are identical, and a identity # backtransformation is applied and projected, the # same matrix is returned. X <- matrix(1:4, nrow=4, ncol=3) Y <- backtransformAffine(X, b=c(1,1,1), project=TRUE) print(X) print(Y) stopifnot(sum(X[,1]-Y) <= .Machine$double.eps) # If the columns of X are identical, and a identity # backtransformation is applied and projected, the # same matrix is returned. X <- matrix(1:4, nrow=4, ncol=3) X[,2] <- X[,2]*2; X[,3] <- X[,3]*3 print(X) Y <- backtransformAffine(X, b=c(1,2,3)) print(Y) Y <- backtransformAffine(X, b=c(1,2,3), project=TRUE) print(Y) stopifnot(sum(X[,1]-Y) <= .Machine$double.eps)
Reverse transformation of principal-curve fit.
## S3 method for class 'matrix' backtransformPrincipalCurve(X, fit, dimensions=NULL, targetDimension=NULL, ...) ## S3 method for class 'numeric' backtransformPrincipalCurve(X, ...)
## S3 method for class 'matrix' backtransformPrincipalCurve(X, fit, dimensions=NULL, targetDimension=NULL, ...) ## S3 method for class 'numeric' backtransformPrincipalCurve(X, ...)
X |
An NxK |
fit |
An MxL principal-curve fit object of class
|
dimensions |
An (optional) subset of of D dimensions all in [1,L] to be returned (and backtransform). |
targetDimension |
An (optional) index specifying the dimension
in [1,L] to be used as the target dimension of the |
... |
Passed internally to |
Each column in X ("dimension") is backtransformed independently of the others.
The backtransformed NxK (or NxD) matrix
.
By default, the backtransform is such that afterward the signals are
approximately proportional to the (first) principal curve as fitted
by fitPrincipalCurve
(). This scale and origin of this
principal curve is not uniquely defined.
If targetDimension
is specified, then the backtransformed signals
are approximately proportional to the signals of the target dimension,
and the signals in the target dimension are unchanged.
Argument dimensions
can be used to backtransform a subset of
dimensions (K) based on a subset of the fitted dimensions (L).
If , then both
X
and fit
is subsetted.
If , then it is assumed that
X
is already
subsetted/expanded and only fit
is subsetted.
# Consider the case where K=4 measurements have been done # for the same underlying signals 'x'. The different measurements # have different systematic variation # # y_k = f(x_k) + eps_k; k = 1,...,K. # # In this example, we assume non-linear measurement functions # # f(x) = a + b*x + x^c + eps(b*x) # # where 'a' is an offset, 'b' a scale factor, and 'c' an exponential. # We also assume heteroscedastic zero-mean noise with standard # deviation proportional to the rescaled underlying signal 'x'. # # Furthermore, we assume that measurements k=2 and k=3 undergo the # same transformation, which may illustrate that the come from # the same batch. However, when *fitting* the model below we # will assume they are independent. # Transforms a <- c(2, 15, 15, 3) b <- c(2, 3, 3, 4) c <- c(1, 2, 2, 1/2) K <- length(a) # The true signal N <- 1000 x <- rexp(N) # The noise bX <- outer(b,x) E <- apply(bX, MARGIN=2, FUN=function(x) rnorm(K, mean=0, sd=0.1*x)) # The transformed signals with noise Xc <- t(sapply(c, FUN=function(c) x^c)) Y <- a + bX + Xc + E Y <- t(Y) # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # Fit principal curve # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # Fit principal curve through Y = (y_1, y_2, ..., y_K) fit <- fitPrincipalCurve(Y) # Flip direction of 'lambda'? rho <- cor(fit$lambda, Y[,1], use="complete.obs") flip <- (rho < 0) if (flip) { fit$lambda <- max(fit$lambda, na.rm=TRUE)-fit$lambda } L <- ncol(fit$s) # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # Backtransform data according to model fit # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # Backtransform toward the principal curve (the "common scale") YN1 <- backtransformPrincipalCurve(Y, fit=fit) stopifnot(ncol(YN1) == K) # Backtransform toward the first dimension YN2 <- backtransformPrincipalCurve(Y, fit=fit, targetDimension=1) stopifnot(ncol(YN2) == K) # Backtransform toward the last (fitted) dimension YN3 <- backtransformPrincipalCurve(Y, fit=fit, targetDimension=L) stopifnot(ncol(YN3) == K) # Backtransform toward the third dimension (dimension by dimension) # Note, this assumes that K == L. YN4 <- Y for (cc in 1:L) { YN4[,cc] <- backtransformPrincipalCurve(Y, fit=fit, targetDimension=1, dimensions=cc) } stopifnot(identical(YN4, YN2)) # Backtransform a subset toward the first dimension # Note, this assumes that K == L. YN5 <- backtransformPrincipalCurve(Y, fit=fit, targetDimension=1, dimensions=2:3) stopifnot(identical(YN5, YN2[,2:3])) stopifnot(ncol(YN5) == 2) # Extract signals from measurement #2 and backtransform according # its model fit. Signals are standardized to target dimension 1. y6 <- Y[,2,drop=FALSE] yN6 <- backtransformPrincipalCurve(y6, fit=fit, dimensions=2, targetDimension=1) stopifnot(identical(yN6, YN2[,2,drop=FALSE])) stopifnot(ncol(yN6) == 1) # Extract signals from measurement #2 and backtransform according # the the model fit of measurement #3 (because we believe these # two have undergone very similar transformations. # Signals are standardized to target dimension 1. y7 <- Y[,2,drop=FALSE] yN7 <- backtransformPrincipalCurve(y7, fit=fit, dimensions=3, targetDimension=1) stopifnot(ncol(yN7) == 1) stopifnot(cor(yN7, yN6) > 0.9999)
# Consider the case where K=4 measurements have been done # for the same underlying signals 'x'. The different measurements # have different systematic variation # # y_k = f(x_k) + eps_k; k = 1,...,K. # # In this example, we assume non-linear measurement functions # # f(x) = a + b*x + x^c + eps(b*x) # # where 'a' is an offset, 'b' a scale factor, and 'c' an exponential. # We also assume heteroscedastic zero-mean noise with standard # deviation proportional to the rescaled underlying signal 'x'. # # Furthermore, we assume that measurements k=2 and k=3 undergo the # same transformation, which may illustrate that the come from # the same batch. However, when *fitting* the model below we # will assume they are independent. # Transforms a <- c(2, 15, 15, 3) b <- c(2, 3, 3, 4) c <- c(1, 2, 2, 1/2) K <- length(a) # The true signal N <- 1000 x <- rexp(N) # The noise bX <- outer(b,x) E <- apply(bX, MARGIN=2, FUN=function(x) rnorm(K, mean=0, sd=0.1*x)) # The transformed signals with noise Xc <- t(sapply(c, FUN=function(c) x^c)) Y <- a + bX + Xc + E Y <- t(Y) # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # Fit principal curve # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # Fit principal curve through Y = (y_1, y_2, ..., y_K) fit <- fitPrincipalCurve(Y) # Flip direction of 'lambda'? rho <- cor(fit$lambda, Y[,1], use="complete.obs") flip <- (rho < 0) if (flip) { fit$lambda <- max(fit$lambda, na.rm=TRUE)-fit$lambda } L <- ncol(fit$s) # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # Backtransform data according to model fit # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # Backtransform toward the principal curve (the "common scale") YN1 <- backtransformPrincipalCurve(Y, fit=fit) stopifnot(ncol(YN1) == K) # Backtransform toward the first dimension YN2 <- backtransformPrincipalCurve(Y, fit=fit, targetDimension=1) stopifnot(ncol(YN2) == K) # Backtransform toward the last (fitted) dimension YN3 <- backtransformPrincipalCurve(Y, fit=fit, targetDimension=L) stopifnot(ncol(YN3) == K) # Backtransform toward the third dimension (dimension by dimension) # Note, this assumes that K == L. YN4 <- Y for (cc in 1:L) { YN4[,cc] <- backtransformPrincipalCurve(Y, fit=fit, targetDimension=1, dimensions=cc) } stopifnot(identical(YN4, YN2)) # Backtransform a subset toward the first dimension # Note, this assumes that K == L. YN5 <- backtransformPrincipalCurve(Y, fit=fit, targetDimension=1, dimensions=2:3) stopifnot(identical(YN5, YN2[,2:3])) stopifnot(ncol(YN5) == 2) # Extract signals from measurement #2 and backtransform according # its model fit. Signals are standardized to target dimension 1. y6 <- Y[,2,drop=FALSE] yN6 <- backtransformPrincipalCurve(y6, fit=fit, dimensions=2, targetDimension=1) stopifnot(identical(yN6, YN2[,2,drop=FALSE])) stopifnot(ncol(yN6) == 1) # Extract signals from measurement #2 and backtransform according # the the model fit of measurement #3 (because we believe these # two have undergone very similar transformations. # Signals are standardized to target dimension 1. y7 <- Y[,2,drop=FALSE] yN7 <- backtransformPrincipalCurve(y7, fit=fit, dimensions=3, targetDimension=1) stopifnot(ncol(yN7) == 1) stopifnot(cor(yN7, yN6) > 0.9999)
Weighted affine calibration of a multiple re-scanned channel.
## S3 method for class 'matrix' calibrateMultiscan(X, weights=NULL, typeOfWeights=c("datapoint"), method="L1", constraint="diagonal", satSignal=2^16 - 1, ..., average=median, deviance=NULL, project=FALSE, .fitOnly=FALSE)
## S3 method for class 'matrix' calibrateMultiscan(X, weights=NULL, typeOfWeights=c("datapoint"), method="L1", constraint="diagonal", satSignal=2^16 - 1, ..., average=median, deviance=NULL, project=FALSE, .fitOnly=FALSE)
X |
An NxK |
weights |
If |
typeOfWeights |
A |
method |
A |
constraint |
Constraint making the bias parameters identifiable.
See |
satSignal |
Signals equal to or above this threshold is considered saturated signals. |
... |
Other arguments passed to |
average |
A |
deviance |
A |
project |
If |
.fitOnly |
If |
Fitting is done by iterated re-weighted principal component analysis (IWPCA).
If average
is specified or project
is TRUE
,
an Nx1 matrix
is returned, otherwise an NxK matrix
is returned.
If deviance
is specified, a deviance Nx1 matrix
is returned
as attribute deviance
.
In addition, the fitted model is returned as attribute modelFit
.
Affine multiscan calibration applies also to negative values, which are therefor also calibrated, if they exist.
Saturated signals in any scan are set to NA
. Thus, they will not be
used to estimate the calibration function, nor will they affect an
optional projection.
Only observations (rows) in X
that contain all finite values are
used in the estimation of the calibration functions. Thus,
observations can be excluded by setting them to NA
.
Each data point/observation, that is, each row in X
, which is a
vector of length K, can be assigned a weight in [0,1] specifying how much
it should affect the fitting of the calibration function.
Weights are given by argument weights
,
which should be a numeric
vector
of length N. Regardless of weights,
all data points are calibrated based on the fitted calibration
function.
By default, the model fit of multiscan calibration is done in
(
method="L1"
). This way, outliers affect the parameter estimates
less than ordinary least-square methods.
When calculating the average calibrated signal from multiple scans, by default the median is used, which further robustify against outliers.
For further robustness, downweight outliers such as saturated signals, if possible.
Tukey's biweight function is supported, but not used by default because
then a "bandwidth" parameter has to selected. This can indeed be done
automatically by estimating the standard deviation, for instance using
MAD. However, since scanner signals have heteroscedastic noise
(standard deviation is approximately proportional to the non-logged
signal), Tukey's bandwidth parameter has to be a function of the
signal too, cf. loess
. We have experimented with this
too, but found that it does not significantly improve the robustness
compared to .
Moreover, using Tukey's biweight as is, that is, assuming homoscedastic
noise, seems to introduce a (scale dependent) bias in the estimates
of the offset terms.
If the scanner offsets can be assumed to be known, for instance,
from prior multiscan analyses on the scanner, then it is possible
to fit the scanner model with no (zero) offset by specifying
argument center=FALSE
.
Note that you cannot specify the offset. Instead, subtract it
from all signals before calibrating, e.g.
Xc <- calibrateMultiscan(X-e, center=FALSE)
where e
is the scanner offset (a scalar).
You can assert that the model is fitted without offset by
stopifnot(all(attr(Xc, "modelFit")$adiag == 0))
.
Henrik Bengtsson
[1] H. Bengtsson, J. Vallon-Christersson and G. Jönsson, Calibration and assessment of channel-specific biases in microarray data with extended dynamical range, BMC Bioinformatics, 5:177, 2004.
1. Calibration and Normalization
.
normalizeAffine
().
## Not run: # For an example, see help(normalizeAffine).
## Not run: # For an example, see help(normalizeAffine).
Calls genotypes in a normal sample.
## S3 method for class 'numeric' callNaiveGenotypes(y, cn=rep(2L, times = length(y)), ..., modelFit=NULL, verbose=FALSE)
## S3 method for class 'numeric' callNaiveGenotypes(y, cn=rep(2L, times = length(y)), ..., modelFit=NULL, verbose=FALSE)
y |
A |
cn |
An optional |
... |
Additional arguments passed to |
modelFit |
A optional model fit as returned
by |
verbose |
Returns a numeric
vector
of length J containing the genotype calls
in allele B fraction space, that is, in [0,1] where 1/2 corresponds
to a heterozygous call, and 0 and 1 corresponds to homozygous A
and B, respectively.
Non called genotypes have value NA
.
A missing value always gives a missing (NA
) genotype call.
Negative infinity (-Inf
) always gives genotype call 0.
Positive infinity (+Inf
) always gives genotype call 1.
Henrik Bengtsson
Internally fitNaiveGenotypes
() is used to identify the thresholds.
layout(matrix(1:3, ncol=1)) par(mar=c(2,4,4,1)+0.1) # - - - - - - - - - - - - - - - - - - - - - - - - - - - - # A bimodal distribution # - - - - - - - - - - - - - - - - - - - - - - - - - - - - xAA <- rnorm(n=10000, mean=0, sd=0.1) xBB <- rnorm(n=10000, mean=1, sd=0.1) x <- c(xAA,xBB) fit <- findPeaksAndValleys(x) print(fit) calls <- callNaiveGenotypes(x, cn=rep(1,length(x)), verbose=-20) xc <- split(x, calls) print(table(calls)) xx <- c(list(x),xc) plotDensity(xx, adjust=1.5, lwd=2, col=seq_along(xx), main="(AA,BB)") abline(v=fit$x) # - - - - - - - - - - - - - - - - - - - - - - - - - - - - # A trimodal distribution with missing values # - - - - - - - - - - - - - - - - - - - - - - - - - - - - xAB <- rnorm(n=10000, mean=1/2, sd=0.1) x <- c(xAA,xAB,xBB) x[sample(length(x), size=0.05*length(x))] <- NA; x[sample(length(x), size=0.01*length(x))] <- -Inf; x[sample(length(x), size=0.01*length(x))] <- +Inf; fit <- findPeaksAndValleys(x) print(fit) calls <- callNaiveGenotypes(x) xc <- split(x, calls) print(table(calls)) xx <- c(list(x),xc) plotDensity(xx, adjust=1.5, lwd=2, col=seq_along(xx), main="(AA,AB,BB)") abline(v=fit$x) # - - - - - - - - - - - - - - - - - - - - - - - - - - - - # A trimodal distribution with clear separation # - - - - - - - - - - - - - - - - - - - - - - - - - - - - xAA <- rnorm(n=10000, mean=0, sd=0.02) xAB <- rnorm(n=10000, mean=1/2, sd=0.02) xBB <- rnorm(n=10000, mean=1, sd=0.02) x <- c(xAA,xAB,xBB) fit <- findPeaksAndValleys(x) print(fit) calls <- callNaiveGenotypes(x) xc <- split(x, calls) print(table(calls)) xx <- c(list(x),xc) plotDensity(xx, adjust=1.5, lwd=2, col=seq_along(xx), main="(AA',AB',BB')") abline(v=fit$x)
layout(matrix(1:3, ncol=1)) par(mar=c(2,4,4,1)+0.1) # - - - - - - - - - - - - - - - - - - - - - - - - - - - - # A bimodal distribution # - - - - - - - - - - - - - - - - - - - - - - - - - - - - xAA <- rnorm(n=10000, mean=0, sd=0.1) xBB <- rnorm(n=10000, mean=1, sd=0.1) x <- c(xAA,xBB) fit <- findPeaksAndValleys(x) print(fit) calls <- callNaiveGenotypes(x, cn=rep(1,length(x)), verbose=-20) xc <- split(x, calls) print(table(calls)) xx <- c(list(x),xc) plotDensity(xx, adjust=1.5, lwd=2, col=seq_along(xx), main="(AA,BB)") abline(v=fit$x) # - - - - - - - - - - - - - - - - - - - - - - - - - - - - # A trimodal distribution with missing values # - - - - - - - - - - - - - - - - - - - - - - - - - - - - xAB <- rnorm(n=10000, mean=1/2, sd=0.1) x <- c(xAA,xAB,xBB) x[sample(length(x), size=0.05*length(x))] <- NA; x[sample(length(x), size=0.01*length(x))] <- -Inf; x[sample(length(x), size=0.01*length(x))] <- +Inf; fit <- findPeaksAndValleys(x) print(fit) calls <- callNaiveGenotypes(x) xc <- split(x, calls) print(table(calls)) xx <- c(list(x),xc) plotDensity(xx, adjust=1.5, lwd=2, col=seq_along(xx), main="(AA,AB,BB)") abline(v=fit$x) # - - - - - - - - - - - - - - - - - - - - - - - - - - - - # A trimodal distribution with clear separation # - - - - - - - - - - - - - - - - - - - - - - - - - - - - xAA <- rnorm(n=10000, mean=0, sd=0.02) xAB <- rnorm(n=10000, mean=1/2, sd=0.02) xBB <- rnorm(n=10000, mean=1, sd=0.02) x <- c(xAA,xAB,xBB) fit <- findPeaksAndValleys(x) print(fit) calls <- callNaiveGenotypes(x) xc <- split(x, calls) print(table(calls)) xx <- c(list(x),xc) plotDensity(xx, adjust=1.5, lwd=2, col=seq_along(xx), main="(AA',AB',BB')") abline(v=fit$x)
Finds the shortest distance between two lines.
Consider the two lines
and
in an K-space where the offset and direction vector
s are
and
(in
) that define the line
(
is a scalar). Similar for the line
.
This function finds the point
for which
is minimal.
## Default S3 method: distanceBetweenLines(ax, bx, ay, by, ...)
## Default S3 method: distanceBetweenLines(ax, bx, ay, by, ...)
ax , bx
|
Offset and direction |
ay , by
|
Offset and direction |
... |
Not used. |
Returns the a list
containing
ax , bx
|
The given line |
ay , by
|
The given line |
s , t
|
The values of |
xs , yt
|
The values of |
distance |
The distance between the lines, i.e. |
Henrik Bengtsson
[1] M. Bard and D. Himel, The Minimum Distance Between Two
Lines in n-Space, September 2001, Advisor Dennis Merino.
[2] Dan Sunday, Distance between 3D Lines and Segments,
Jan 2016, https://www.geomalgorithms.com/algorithms.html
.
for (zzz in 0) { # This example requires plot3d() in R.basic [http://www.braju.com/R/] if (!require(pkgName <- "R.basic", character.only=TRUE)) break layout(matrix(1:4, nrow=2, ncol=2, byrow=TRUE)) ############################################################ # Lines in two-dimensions ############################################################ x <- list(a=c(1,0), b=c(1,2)) y <- list(a=c(0,2), b=c(1,1)) fit <- distanceBetweenLines(ax=x$a, bx=x$b, ay=y$a, by=y$b) xlim <- ylim <- c(-1,8) plot(NA, xlab="", ylab="", xlim=ylim, ylim=ylim) # Highlight the offset coordinates for both lines points(t(x$a), pch="+", col="red") text(t(x$a), label=expression(a[x]), adj=c(-1,0.5)) points(t(y$a), pch="+", col="blue") text(t(y$a), label=expression(a[y]), adj=c(-1,0.5)) v <- c(-1,1)*10; xv <- list(x=x$a[1]+x$b[1]*v, y=x$a[2]+x$b[2]*v) yv <- list(x=y$a[1]+y$b[1]*v, y=y$a[2]+y$b[2]*v) lines(xv, col="red") lines(yv, col="blue") points(t(fit$xs), cex=2.0, col="red") text(t(fit$xs), label=expression(x(s)), adj=c(+2,0.5)) points(t(fit$yt), cex=1.5, col="blue") text(t(fit$yt), label=expression(y(t)), adj=c(-1,0.5)) print(fit) ############################################################ # Lines in three-dimensions ############################################################ x <- list(a=c(0,0,0), b=c(1,1,1)) # The 'diagonal' y <- list(a=c(2,1,2), b=c(2,1,3)) # A 'fitted' line fit <- distanceBetweenLines(ax=x$a, bx=x$b, ay=y$a, by=y$b) xlim <- ylim <- zlim <- c(-1,3) dummy <- t(c(1,1,1))*100; # Coordinates for the lines in 3d v <- seq(-10,10, by=1); xv <- list(x=x$a[1]+x$b[1]*v, y=x$a[2]+x$b[2]*v, z=x$a[3]+x$b[3]*v) yv <- list(x=y$a[1]+y$b[1]*v, y=y$a[2]+y$b[2]*v, z=y$a[3]+y$b[3]*v) for (theta in seq(30,140,length.out=3)) { plot3d(dummy, theta=theta, phi=30, xlab="", ylab="", zlab="", xlim=ylim, ylim=ylim, zlim=zlim) # Highlight the offset coordinates for both lines points3d(t(x$a), pch="+", col="red") text3d(t(x$a), label=expression(a[x]), adj=c(-1,0.5)) points3d(t(y$a), pch="+", col="blue") text3d(t(y$a), label=expression(a[y]), adj=c(-1,0.5)) # Draw the lines lines3d(xv, col="red") lines3d(yv, col="blue") # Draw the two points that are closest to each other points3d(t(fit$xs), cex=2.0, col="red") text3d(t(fit$xs), label=expression(x(s)), adj=c(+2,0.5)) points3d(t(fit$yt), cex=1.5, col="blue") text3d(t(fit$yt), label=expression(y(t)), adj=c(-1,0.5)) # Draw the distance between the two points lines3d(rbind(fit$xs,fit$yt), col="purple", lwd=2) } print(fit) } # for (zzz in 0) rm(zzz)
for (zzz in 0) { # This example requires plot3d() in R.basic [http://www.braju.com/R/] if (!require(pkgName <- "R.basic", character.only=TRUE)) break layout(matrix(1:4, nrow=2, ncol=2, byrow=TRUE)) ############################################################ # Lines in two-dimensions ############################################################ x <- list(a=c(1,0), b=c(1,2)) y <- list(a=c(0,2), b=c(1,1)) fit <- distanceBetweenLines(ax=x$a, bx=x$b, ay=y$a, by=y$b) xlim <- ylim <- c(-1,8) plot(NA, xlab="", ylab="", xlim=ylim, ylim=ylim) # Highlight the offset coordinates for both lines points(t(x$a), pch="+", col="red") text(t(x$a), label=expression(a[x]), adj=c(-1,0.5)) points(t(y$a), pch="+", col="blue") text(t(y$a), label=expression(a[y]), adj=c(-1,0.5)) v <- c(-1,1)*10; xv <- list(x=x$a[1]+x$b[1]*v, y=x$a[2]+x$b[2]*v) yv <- list(x=y$a[1]+y$b[1]*v, y=y$a[2]+y$b[2]*v) lines(xv, col="red") lines(yv, col="blue") points(t(fit$xs), cex=2.0, col="red") text(t(fit$xs), label=expression(x(s)), adj=c(+2,0.5)) points(t(fit$yt), cex=1.5, col="blue") text(t(fit$yt), label=expression(y(t)), adj=c(-1,0.5)) print(fit) ############################################################ # Lines in three-dimensions ############################################################ x <- list(a=c(0,0,0), b=c(1,1,1)) # The 'diagonal' y <- list(a=c(2,1,2), b=c(2,1,3)) # A 'fitted' line fit <- distanceBetweenLines(ax=x$a, bx=x$b, ay=y$a, by=y$b) xlim <- ylim <- zlim <- c(-1,3) dummy <- t(c(1,1,1))*100; # Coordinates for the lines in 3d v <- seq(-10,10, by=1); xv <- list(x=x$a[1]+x$b[1]*v, y=x$a[2]+x$b[2]*v, z=x$a[3]+x$b[3]*v) yv <- list(x=y$a[1]+y$b[1]*v, y=y$a[2]+y$b[2]*v, z=y$a[3]+y$b[3]*v) for (theta in seq(30,140,length.out=3)) { plot3d(dummy, theta=theta, phi=30, xlab="", ylab="", zlab="", xlim=ylim, ylim=ylim, zlim=zlim) # Highlight the offset coordinates for both lines points3d(t(x$a), pch="+", col="red") text3d(t(x$a), label=expression(a[x]), adj=c(-1,0.5)) points3d(t(y$a), pch="+", col="blue") text3d(t(y$a), label=expression(a[y]), adj=c(-1,0.5)) # Draw the lines lines3d(xv, col="red") lines3d(yv, col="blue") # Draw the two points that are closest to each other points3d(t(fit$xs), cex=2.0, col="red") text3d(t(fit$xs), label=expression(x(s)), adj=c(+2,0.5)) points3d(t(fit$yt), cex=1.5, col="blue") text3d(t(fit$yt), label=expression(y(t)), adj=c(-1,0.5)) # Draw the distance between the two points lines3d(rbind(fit$xs,fit$yt), col="purple", lwd=2) } print(fit) } # for (zzz in 0) rm(zzz)
Robust fit of linear subspace through multidimensional data.
## S3 method for class 'matrix' fitIWPCA(X, constraint=c("diagonal", "baseline", "max"), baselineChannel=NULL, ..., aShift=rep(0, times = ncol(X)), Xmin=NULL)
## S3 method for class 'matrix' fitIWPCA(X, constraint=c("diagonal", "baseline", "max"), baselineChannel=NULL, ..., aShift=rep(0, times = ncol(X)), Xmin=NULL)
X |
NxK |
constraint |
A If If If |
baselineChannel |
Index of channel toward which all other
channels are conform.
This argument is required if |
... |
Additional arguments accepted by |
aShift , Xmin
|
For internal use only. |
This method uses re-weighted principal component analysis (IWPCA)
to fit a the model where
,
,
, and
are vector of the K and
is a scalar.
The algorithm is:
For iteration i:
1) Fit a line through the data close using weighted PCA
with weights
. Let
be the
principal components.
2) Update the weights as
where we have used the residuals of all but the first principal
component.
3) Find the point a on
that is closest to the
line
. Similarly, denote the point on D that is
closest to
by
.
Returns a list
that contains estimated parameters and algorithm
details;
a |
A |
b |
A |
adiag |
If identifiability constraint |
eigen |
A KxK |
converged |
|
nbrOfIterations |
The number of iterations for the algorithm to converge, or zero if it did not converge. |
t0 |
Internal parameter estimates, which contains no more information than the above listed elements. |
t |
Always |
Henrik Bengtsson
This is an internal method used by the calibrateMultiscan
()
and normalizeAffine
() methods.
Internally the function iwpca
() is used to fit a line
through the data cloud and the function distanceBetweenLines
() to
find the closest point to the diagonal (1,1,...,1).
Fit naive genotype model from a normal sample.
## S3 method for class 'numeric' fitNaiveGenotypes(y, cn=rep(2L, times = length(y)), subsetToFit=NULL, flavor=c("density", "fixed"), adjust=1.5, ..., censorAt=c(-0.1, 1.1), verbose=FALSE)
## S3 method for class 'numeric' fitNaiveGenotypes(y, cn=rep(2L, times = length(y)), subsetToFit=NULL, flavor=c("density", "fixed"), adjust=1.5, ..., censorAt=c(-0.1, 1.1), verbose=FALSE)
y |
A |
cn |
An optional |
subsetToFit |
An optional |
flavor |
A |
adjust |
A positive |
... |
Additional arguments passed to |
censorAt |
A |
verbose |
Henrik Bengtsson
To call genotypes see callNaiveGenotypes
().
Internally findPeaksAndValleys
() is used to identify the thresholds.
Fit a principal curve in K dimensions.
## S3 method for class 'matrix' fitPrincipalCurve(X, ..., verbose=FALSE)
## S3 method for class 'matrix' fitPrincipalCurve(X, ..., verbose=FALSE)
X |
An NxK |
... |
Other arguments passed to |
verbose |
Returns a principal_curve object (which is a list
).
See principal_curve
for more details.
The estimation of the normalization function will only be made
based on complete observations, i.e. observations that contains no NA
values in any of the channels.
Henrik Bengtsson
[1] Hastie, T. and Stuetzle, W, Principal Curves, JASA, 1989.
[2] H. Bengtsson, A. Ray, P. Spellman and T.P. Speed, A single-sample method for normalizing and combining full-resolutioncopy numbers from multiple platforms, labs and analysis methods, Bioinformatics, 2009.
backtransformPrincipalCurve
().
principal_curve
.
# Simulate data from the model y <- a + bx + x^c + eps(bx) J <- 1000 x <- rexp(J) a <- c(2,15,3) b <- c(2,3,4) c <- c(1,2,1/2) bx <- outer(b,x) xc <- t(sapply(c, FUN=function(c) x^c)) eps <- apply(bx, MARGIN=2, FUN=function(x) rnorm(length(b), mean=0, sd=0.1*x)) y <- a + bx + xc + eps y <- t(y) # Fit principal curve through (y_1, y_2, y_3) fit <- fitPrincipalCurve(y, verbose=TRUE) # Flip direction of 'lambda'? rho <- cor(fit$lambda, y[,1], use="complete.obs") flip <- (rho < 0) if (flip) { fit$lambda <- max(fit$lambda, na.rm=TRUE)-fit$lambda } # Backtransform (y_1, y_2, y_3) to be proportional to each other yN <- backtransformPrincipalCurve(y, fit=fit) # Same backtransformation dimension by dimension yN2 <- y for (cc in 1:ncol(y)) { yN2[,cc] <- backtransformPrincipalCurve(y, fit=fit, dimensions=cc) } stopifnot(identical(yN2, yN)) xlim <- c(0, 1.04*max(x)) ylim <- range(c(y,yN), na.rm=TRUE) # Pairwise signals vs x before and after transform layout(matrix(1:4, nrow=2, byrow=TRUE)) par(mar=c(4,4,3,2)+0.1) for (cc in 1:3) { ylab <- substitute(y[c], env=list(c=cc)) plot(NA, xlim=xlim, ylim=ylim, xlab="x", ylab=ylab) abline(h=a[cc], lty=3) mtext(side=4, at=a[cc], sprintf("a=%g", a[cc]), cex=0.8, las=2, line=0, adj=1.1, padj=-0.2) points(x, y[,cc]) points(x, yN[,cc], col="tomato") legend("topleft", col=c("black", "tomato"), pch=19, c("orignal", "transformed"), bty="n") } title(main="Pairwise signals vs x before and after transform", outer=TRUE, line=-2) # Pairwise signals before and after transform layout(matrix(1:4, nrow=2, byrow=TRUE)) par(mar=c(4,4,3,2)+0.1) for (rr in 3:2) { ylab <- substitute(y[c], env=list(c=rr)) for (cc in 1:2) { if (cc == rr) { plot.new() next } xlab <- substitute(y[c], env=list(c=cc)) plot(NA, xlim=ylim, ylim=ylim, xlab=xlab, ylab=ylab) abline(a=0, b=1, lty=2) points(y[,c(cc,rr)]) points(yN[,c(cc,rr)], col="tomato") legend("topleft", col=c("black", "tomato"), pch=19, c("orignal", "transformed"), bty="n") } } title(main="Pairwise signals before and after transform", outer=TRUE, line=-2)
# Simulate data from the model y <- a + bx + x^c + eps(bx) J <- 1000 x <- rexp(J) a <- c(2,15,3) b <- c(2,3,4) c <- c(1,2,1/2) bx <- outer(b,x) xc <- t(sapply(c, FUN=function(c) x^c)) eps <- apply(bx, MARGIN=2, FUN=function(x) rnorm(length(b), mean=0, sd=0.1*x)) y <- a + bx + xc + eps y <- t(y) # Fit principal curve through (y_1, y_2, y_3) fit <- fitPrincipalCurve(y, verbose=TRUE) # Flip direction of 'lambda'? rho <- cor(fit$lambda, y[,1], use="complete.obs") flip <- (rho < 0) if (flip) { fit$lambda <- max(fit$lambda, na.rm=TRUE)-fit$lambda } # Backtransform (y_1, y_2, y_3) to be proportional to each other yN <- backtransformPrincipalCurve(y, fit=fit) # Same backtransformation dimension by dimension yN2 <- y for (cc in 1:ncol(y)) { yN2[,cc] <- backtransformPrincipalCurve(y, fit=fit, dimensions=cc) } stopifnot(identical(yN2, yN)) xlim <- c(0, 1.04*max(x)) ylim <- range(c(y,yN), na.rm=TRUE) # Pairwise signals vs x before and after transform layout(matrix(1:4, nrow=2, byrow=TRUE)) par(mar=c(4,4,3,2)+0.1) for (cc in 1:3) { ylab <- substitute(y[c], env=list(c=cc)) plot(NA, xlim=xlim, ylim=ylim, xlab="x", ylab=ylab) abline(h=a[cc], lty=3) mtext(side=4, at=a[cc], sprintf("a=%g", a[cc]), cex=0.8, las=2, line=0, adj=1.1, padj=-0.2) points(x, y[,cc]) points(x, yN[,cc], col="tomato") legend("topleft", col=c("black", "tomato"), pch=19, c("orignal", "transformed"), bty="n") } title(main="Pairwise signals vs x before and after transform", outer=TRUE, line=-2) # Pairwise signals before and after transform layout(matrix(1:4, nrow=2, byrow=TRUE)) par(mar=c(4,4,3,2)+0.1) for (rr in 3:2) { ylab <- substitute(y[c], env=list(c=rr)) for (cc in 1:2) { if (cc == rr) { plot.new() next } xlab <- substitute(y[c], env=list(c=cc)) plot(NA, xlim=ylim, ylim=ylim, xlab=xlab, ylab=ylab) abline(a=0, b=1, lty=2) points(y[,c(cc,rr)]) points(yN[,c(cc,rr)], col="tomato") legend("topleft", col=c("black", "tomato"), pch=19, c("orignal", "transformed"), bty="n") } } title(main="Pairwise signals before and after transform", outer=TRUE, line=-2)
Fitting a smooth curve through paired (x,y) data.
## S3 method for class 'matrix' fitXYCurve(X, weights=NULL, typeOfWeights=c("datapoint"), method=c("loess", "lowess", "spline", "robustSpline"), bandwidth=NULL, satSignal=2^16 - 1, ...)
## S3 method for class 'matrix' fitXYCurve(X, weights=NULL, typeOfWeights=c("datapoint"), method=c("loess", "lowess", "spline", "robustSpline"), bandwidth=NULL, satSignal=2^16 - 1, ...)
X |
An Nx2 |
weights |
If |
typeOfWeights |
A |
method |
|
bandwidth |
A |
satSignal |
Signals equal to or above this threshold will not be used in the fitting. |
... |
Not used. |
A named list
structure of class XYCurve
.
The estimation of the function will only be made based on complete
non-saturated observations, i.e. observations that contains no NA
values nor saturated values as defined by satSignal
.
Each data point, that is, each row in X
, which is a
vector of length 2, can be assigned a weight in [0,1] specifying how much
it should affect the fitting of the normalization function.
Weights are given by argument weights
, which should be a numeric
vector
of length N.
Note that the lowess and the spline method only support zero-one {0,1} weights. For such methods, all weights that are less than a half are set to zero.
For loess
, the arguments family="symmetric"
,
degree=1
, span=3/4
,
control=loess.control(trace.hat="approximate"
,
iterations=5
, surface="direct")
are used.
Henrik Bengtsson
# Simulate data from the model y <- a + bx + x^c + eps(bx) x <- rexp(1000) a <- c(2,15) b <- c(2,1) c <- c(1,2) bx <- outer(b,x) xc <- t(sapply(c, FUN=function(c) x^c)) eps <- apply(bx, MARGIN=2, FUN=function(x) rnorm(length(x), mean=0, sd=0.1*x)) Y <- a + bx + xc + eps Y <- t(Y) lim <- c(0,70) plot(Y, xlim=lim, ylim=lim) # Fit principal curve through a subset of (y_1, y_2) subset <- sample(nrow(Y), size=0.3*nrow(Y)) fit <- fitXYCurve(Y[subset,], bandwidth=0.2) lines(fit, col="red", lwd=2) # Backtransform (y_1, y_2) keeping y_1 unchanged YN <- backtransformXYCurve(Y, fit=fit) points(YN, col="blue") abline(a=0, b=1, col="red", lwd=2)
# Simulate data from the model y <- a + bx + x^c + eps(bx) x <- rexp(1000) a <- c(2,15) b <- c(2,1) c <- c(1,2) bx <- outer(b,x) xc <- t(sapply(c, FUN=function(c) x^c)) eps <- apply(bx, MARGIN=2, FUN=function(x) rnorm(length(x), mean=0, sd=0.1*x)) Y <- a + bx + xc + eps Y <- t(Y) lim <- c(0,70) plot(Y, xlim=lim, ylim=lim) # Fit principal curve through a subset of (y_1, y_2) subset <- sample(nrow(Y), size=0.3*nrow(Y)) fit <- fitXYCurve(Y[subset,], bandwidth=0.2) lines(fit, col="red", lwd=2) # Backtransform (y_1, y_2) keeping y_1 unchanged YN <- backtransformXYCurve(Y, fit=fit) points(YN, col="blue") abline(a=0, b=1, col="red", lwd=2)
Fits an R-dimensional hyperplane using iterative re-weighted PCA.
## S3 method for class 'matrix' iwpca(X, w=NULL, R=1, method=c("symmetric", "bisquare", "tricube", "L1"), maxIter=30, acc=1e-04, reps=0.02, fit0=NULL, ...)
## S3 method for class 'matrix' iwpca(X, w=NULL, R=1, method=c("symmetric", "bisquare", "tricube", "L1"), maxIter=30, acc=1e-04, reps=0.02, fit0=NULL, ...)
X |
N-times-K |
w |
An N |
R |
Number of principal components to fit. By default a line is fitted. |
method |
If |
maxIter |
Maximum number of iterations. |
acc |
The (Euclidean) distance between two subsequent parameters fit for which the algorithm is considered to have converged. |
reps |
Small value to be added to the residuals before the the weights are calculated based on their inverse. This is to avoid infinite weights. |
fit0 |
A |
... |
Additional arguments accepted by |
This method uses weighted principal component analysis (WPCA) to fit a
R-dimensional hyperplane through the data with initial internal
weights all equal.
At each iteration the internal weights are recalculated based on
the "residuals".
If method=="L1"
, the internal weights are 1 / sum(abs(r) + reps).
This is the same as method=function(r) 1/sum(abs(r)+reps)
.
The "residuals" are orthogonal Euclidean distance of the principal
components R,R+1,...,K.
In each iteration before doing WPCA, the internal weighted are
multiplied by the weights given by argument w
, if specified.
Returns the fit (a list
) from the last call to wpca
()
with the additional elements nbrOfIterations
and
converged
.
Henrik Bengtsson
Internally wpca
() is used for calculating the weighted PCA.
for (zzz in 0) { # This example requires plot3d() in R.basic [http://www.braju.com/R/] if (!require(pkgName <- "R.basic", character.only=TRUE)) break # Simulate data from the model y <- a + bx + eps(bx) x <- rexp(1000) a <- c(2,15,3) b <- c(2,3,4) bx <- outer(b,x) eps <- apply(bx, MARGIN=2, FUN=function(x) rnorm(length(x), mean=0, sd=0.1*x)) y <- a + bx + eps y <- t(y) # Add some outliers by permuting the dimensions for 1/10 of the observations idx <- sample(1:nrow(y), size=1/10*nrow(y)) y[idx,] <- y[idx,c(2,3,1)] # Plot the data with fitted lines at four different view points opar <- par(mar=c(1,1,1,1)+0.1) N <- 4 layout(matrix(1:N, nrow=2, byrow=TRUE)) theta <- seq(0,270,length.out=N) phi <- rep(20, length.out=N) xlim <- ylim <- zlim <- c(0,45); persp <- list(); for (kk in seq_along(theta)) { # Plot the data persp[[kk]] <- plot3d(y, theta=theta[kk], phi=phi[kk], xlim=xlim, ylim=ylim, zlim=zlim) } # Weights on the observations # Example a: Equal weights w <- NULL # Example b: More weight on the outliers (uncomment to test) w <- rep(1, length(x)); w[idx] <- 0.8 # ...and show all iterations too with different colors. maxIter <- c(seq(1,20,length.out=10),Inf) col <- topo.colors(length(maxIter)) # Show the fitted value for every iteration for (ii in seq_along(maxIter)) { # Fit a line using IWPCA through data fit <- iwpca(y, w=w, maxIter=maxIter[ii], swapDirections=TRUE) ymid <- fit$xMean d0 <- apply(y, MARGIN=2, FUN=min) - ymid d1 <- apply(y, MARGIN=2, FUN=max) - ymid b <- fit$vt[1,] y0 <- -b * max(abs(d0)) y1 <- b * max(abs(d1)) yline <- matrix(c(y0,y1), nrow=length(b), ncol=2) yline <- yline + ymid for (kk in seq_along(theta)) { # Set pane to draw in par(mfg=c((kk-1) %/% 2, (kk-1) %% 2) + 1); # Set the viewpoint of the pane options(persp.matrix=persp[[kk]]); # Get the first principal component points3d(t(ymid), col=col[ii]) lines3d(t(yline), col=col[ii]) # Highlight the last one if (ii == length(maxIter)) lines3d(t(yline), col="red", lwd=3) } } par(opar) } # for (zzz in 0) rm(zzz)
for (zzz in 0) { # This example requires plot3d() in R.basic [http://www.braju.com/R/] if (!require(pkgName <- "R.basic", character.only=TRUE)) break # Simulate data from the model y <- a + bx + eps(bx) x <- rexp(1000) a <- c(2,15,3) b <- c(2,3,4) bx <- outer(b,x) eps <- apply(bx, MARGIN=2, FUN=function(x) rnorm(length(x), mean=0, sd=0.1*x)) y <- a + bx + eps y <- t(y) # Add some outliers by permuting the dimensions for 1/10 of the observations idx <- sample(1:nrow(y), size=1/10*nrow(y)) y[idx,] <- y[idx,c(2,3,1)] # Plot the data with fitted lines at four different view points opar <- par(mar=c(1,1,1,1)+0.1) N <- 4 layout(matrix(1:N, nrow=2, byrow=TRUE)) theta <- seq(0,270,length.out=N) phi <- rep(20, length.out=N) xlim <- ylim <- zlim <- c(0,45); persp <- list(); for (kk in seq_along(theta)) { # Plot the data persp[[kk]] <- plot3d(y, theta=theta[kk], phi=phi[kk], xlim=xlim, ylim=ylim, zlim=zlim) } # Weights on the observations # Example a: Equal weights w <- NULL # Example b: More weight on the outliers (uncomment to test) w <- rep(1, length(x)); w[idx] <- 0.8 # ...and show all iterations too with different colors. maxIter <- c(seq(1,20,length.out=10),Inf) col <- topo.colors(length(maxIter)) # Show the fitted value for every iteration for (ii in seq_along(maxIter)) { # Fit a line using IWPCA through data fit <- iwpca(y, w=w, maxIter=maxIter[ii], swapDirections=TRUE) ymid <- fit$xMean d0 <- apply(y, MARGIN=2, FUN=min) - ymid d1 <- apply(y, MARGIN=2, FUN=max) - ymid b <- fit$vt[1,] y0 <- -b * max(abs(d0)) y1 <- b * max(abs(d1)) yline <- matrix(c(y0,y1), nrow=length(b), ncol=2) yline <- yline + ymid for (kk in seq_along(theta)) { # Set pane to draw in par(mfg=c((kk-1) %/% 2, (kk-1) %% 2) + 1); # Set the viewpoint of the pane options(persp.matrix=persp[[kk]]); # Get the first principal component points3d(t(ymid), col=col[ii]) lines3d(t(yline), col=col[ii]) # Highlight the last one if (ii == length(maxIter)) lines3d(t(yline), col="red", lwd=3) } } par(opar) } # for (zzz in 0) rm(zzz)
Median polish.
## S3 method for class 'matrix' medianPolish(X, tol=0.01, maxIter=10L, na.rm=NA, ..., .addExtra=TRUE)
## S3 method for class 'matrix' medianPolish(X, tol=0.01, maxIter=10L, na.rm=NA, ..., .addExtra=TRUE)
X |
N-times-K |
tol |
A |
maxIter |
Maximum number of iterations. |
na.rm |
If |
.addExtra |
If |
... |
Not used. |
The implementation of this method give identical estimates as
medpolish
, but is about 3-5 times more efficient when
there are no NA
values.
Returns a named list
structure with elements:
overall |
The fitted constant term. |
row |
The fitted row effect. |
col |
The fitted column effect. |
residuals |
The residuals. |
converged |
If |
Henrik Bengtsson
# Deaths from sport parachuting; from ABC of EDA, p.224: deaths <- matrix(c(14,15,14, 7,4,7, 8,2,10, 15,9,10, 0,2,0), ncol=3, byrow=TRUE) rownames(deaths) <- c("1-24", "25-74", "75-199", "200++", "NA") colnames(deaths) <- 1973:1975 print(deaths) mp <- medianPolish(deaths) mp1 <- medpolish(deaths, trace=FALSE) print(mp) ff <- c("overall", "row", "col", "residuals") stopifnot(all.equal(mp[ff], mp1[ff])) # Validate decomposition: stopifnot(all.equal(deaths, mp$overall+outer(mp$row,mp$col,"+")+mp$resid))
# Deaths from sport parachuting; from ABC of EDA, p.224: deaths <- matrix(c(14,15,14, 7,4,7, 8,2,10, 15,9,10, 0,2,0), ncol=3, byrow=TRUE) rownames(deaths) <- c("1-24", "25-74", "75-199", "200++", "NA") colnames(deaths) <- 1973:1975 print(deaths) mp <- medianPolish(deaths) mp1 <- medpolish(deaths, trace=FALSE) print(mp) ff <- c("overall", "row", "col", "residuals") stopifnot(all.equal(mp[ff], mp1[ff])) # Validate decomposition: stopifnot(all.equal(deaths, mp$overall+outer(mp$row,mp$col,"+")+mp$resid))
Weighted affine normalization between channels and arrays.
This method will remove curvature in the M vs A plots that are due to an affine transformation of the data. In other words, if there are (small or large) biases in the different (red or green) channels, biases that can be equal too, you will get curvature in the M vs A plots and this type of curvature will be removed by this normalization method.
Moreover, if you normalize all slides at once, this method will also bring the signals on the same scale such that the log-ratios for different slides are comparable. Thus, do not normalize the scale of the log-ratios between slides afterward.
It is recommended to normalize as many slides as possible in one run. The result is that if creating log-ratios between any channels and any slides, they will contain as little curvature as possible.
Furthermore, since the relative scale between any two channels on any two slides will be one if one normalizes all slides (and channels) at once it is possible to add or multiply with the same constant to all channels/arrays without introducing curvature. Thus, it is easy to rescale the data afterwards as demonstrated in the example.
## S3 method for class 'matrix' normalizeAffine(X, weights=NULL, typeOfWeights=c("datapoint"), method="L1", constraint=0.05, satSignal=2^16 - 1, ..., .fitOnly=FALSE)
## S3 method for class 'matrix' normalizeAffine(X, weights=NULL, typeOfWeights=c("datapoint"), method="L1", constraint=0.05, satSignal=2^16 - 1, ..., .fitOnly=FALSE)
X |
An NxK |
weights |
If |
typeOfWeights |
A |
method |
A |
constraint |
Constraint making the bias parameters identifiable.
See |
satSignal |
Signals equal to or above this threshold will not be used in the fitting. |
... |
Other arguments passed to |
.fitOnly |
If |
A line is fitted robustly through the observations
using an iterated re-weighted principal component analysis (IWPCA),
which minimized the residuals that are orthogonal to the fitted line.
Each observation is down-weighted by the inverse of the absolute
residuals, i.e. the fit is done in
.
A NxK matrix
of the normalized channels.
The fitted model is returned as attribute modelFit
.
Affine normalization applies equally well to negative values. Thus,
contrary to normalization methods applied to log-ratios, such as curve-fit
normalization methods, affine normalization, will not set these to NA
.
Data points that are saturated in one or more channels are not used to estimate the normalization function, but they are normalized.
The estimation of the affine normalization function will only be made
based on complete non-saturated observations, i.e. observations that
contains no NA
values nor saturated values as defined by satSignal
.
Each data point/observation, that is, each row in X
, which is a
vector of length K, can be assigned a weight in [0,1] specifying how much
it should affect the fitting of the affine normalization function.
Weights are given by argument weights
,
which should be a numeric
vector
of length N. Regardless of weights,
all data points are normalized based on the fitted normalization
function.
By default, the model fit of affine normalization is done in
(
method="L1"
). This way, outliers affect the parameter estimates
less than ordinary least-square methods.
For further robustness, downweight outliers such as saturated signals, if possible.
We do not use Tukey's biweight function for reasons similar to those
outlined in calibrateMultiscan
().
If the channel offsets can be assumed to be known, then it is
possible to fit the affine model with no (zero) offset, which
formally is a linear (proportional) model, by specifying
argument center=FALSE
.
In order to do this, the channel offsets have to be subtracted
from the signals manually before normalizing, e.g.
Xa <- t(t(X)-a)
where e
is vector
of length
ncol(X)
. Then normalize by
Xn <- normalizeAffine(Xa, center=FALSE)
.
You can assert that the model is fitted without offset by
stopifnot(all(attr(Xn, "modelFit")$adiag == 0))
.
Henrik Bengtsson
[1] Henrik Bengtsson and Ola Hössjer, Methodological Study of Affine Transformations of Gene Expression Data, Methodological study of affine transformations of gene expression data with proposed robust non-parametric multi-dimensional normalization method, BMC Bioinformatics, 2006, 7:100.
pathname <- system.file("data-ex", "PMT-RGData.dat", package="aroma.light") rg <- read.table(pathname, header=TRUE, sep="\t") nbrOfScans <- max(rg$slide) rg <- as.list(rg) for (field in c("R", "G")) rg[[field]] <- matrix(as.double(rg[[field]]), ncol=nbrOfScans) rg$slide <- rg$spot <- NULL rg <- as.matrix(as.data.frame(rg)) colnames(rg) <- rep(c("R", "G"), each=nbrOfScans) layout(matrix(c(1,2,0,3,4,0,5,6,7), ncol=3, byrow=TRUE)) rgC <- rg for (channel in c("R", "G")) { sidx <- which(colnames(rg) == channel) channelColor <- switch(channel, R="red", G="green") # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # The raw data # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - plotMvsAPairs(rg[,sidx]) title(main=paste("Observed", channel)) box(col=channelColor) # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # The calibrated data # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - rgC[,sidx] <- calibrateMultiscan(rg[,sidx], average=NULL) plotMvsAPairs(rgC[,sidx]) title(main=paste("Calibrated", channel)) box(col=channelColor) } # for (channel ...) # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # The average calibrated data # # Note how the red signals are weaker than the green. The reason # for this can be that the scale factor in the green channel is # greater than in the red channel, but it can also be that there # is a remaining relative difference in bias between the green # and the red channel, a bias that precedes the scanning. # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - rgCA <- rg for (channel in c("R", "G")) { sidx <- which(colnames(rg) == channel) rgCA[,sidx] <- calibrateMultiscan(rg[,sidx]) } rgCAavg <- matrix(NA_real_, nrow=nrow(rgCA), ncol=2) colnames(rgCAavg) <- c("R", "G") for (channel in c("R", "G")) { sidx <- which(colnames(rg) == channel) rgCAavg[,channel] <- apply(rgCA[,sidx], MARGIN=1, FUN=median, na.rm=TRUE) } # Add some "fake" outliers outliers <- 1:600 rgCAavg[outliers,"G"] <- 50000 plotMvsA(rgCAavg) title(main="Average calibrated (AC)") # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # Normalize data # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # Weight-down outliers when normalizing weights <- rep(1, nrow(rgCAavg)) weights[outliers] <- 0.001 # Affine normalization of channels rgCANa <- normalizeAffine(rgCAavg, weights=weights) # It is always ok to rescale the affine normalized data if its # done on (R,G); not on (A,M)! However, this is only needed for # esthetic purposes. rgCANa <- rgCANa *2^1.4 plotMvsA(rgCANa) title(main="Normalized AC") # Curve-fit (lowess) normalization rgCANlw <- normalizeLowess(rgCAavg, weights=weights) plotMvsA(rgCANlw, col="orange", add=TRUE) # Curve-fit (loess) normalization rgCANl <- normalizeLoess(rgCAavg, weights=weights) plotMvsA(rgCANl, col="red", add=TRUE) # Curve-fit (robust spline) normalization rgCANrs <- normalizeRobustSpline(rgCAavg, weights=weights) plotMvsA(rgCANrs, col="blue", add=TRUE) legend(x=0,y=16, legend=c("affine", "lowess", "loess", "r. spline"), pch=19, col=c("black", "orange", "red", "blue"), ncol=2, x.intersp=0.3, bty="n") plotMvsMPairs(cbind(rgCANa, rgCANlw), col="orange", xlab=expression(M[affine])) title(main="Normalized AC") plotMvsMPairs(cbind(rgCANa, rgCANl), col="red", add=TRUE) plotMvsMPairs(cbind(rgCANa, rgCANrs), col="blue", add=TRUE) abline(a=0, b=1, lty=2) legend(x=-6,y=6, legend=c("lowess", "loess", "r. spline"), pch=19, col=c("orange", "red", "blue"), ncol=2, x.intersp=0.3, bty="n")
pathname <- system.file("data-ex", "PMT-RGData.dat", package="aroma.light") rg <- read.table(pathname, header=TRUE, sep="\t") nbrOfScans <- max(rg$slide) rg <- as.list(rg) for (field in c("R", "G")) rg[[field]] <- matrix(as.double(rg[[field]]), ncol=nbrOfScans) rg$slide <- rg$spot <- NULL rg <- as.matrix(as.data.frame(rg)) colnames(rg) <- rep(c("R", "G"), each=nbrOfScans) layout(matrix(c(1,2,0,3,4,0,5,6,7), ncol=3, byrow=TRUE)) rgC <- rg for (channel in c("R", "G")) { sidx <- which(colnames(rg) == channel) channelColor <- switch(channel, R="red", G="green") # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # The raw data # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - plotMvsAPairs(rg[,sidx]) title(main=paste("Observed", channel)) box(col=channelColor) # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # The calibrated data # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - rgC[,sidx] <- calibrateMultiscan(rg[,sidx], average=NULL) plotMvsAPairs(rgC[,sidx]) title(main=paste("Calibrated", channel)) box(col=channelColor) } # for (channel ...) # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # The average calibrated data # # Note how the red signals are weaker than the green. The reason # for this can be that the scale factor in the green channel is # greater than in the red channel, but it can also be that there # is a remaining relative difference in bias between the green # and the red channel, a bias that precedes the scanning. # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - rgCA <- rg for (channel in c("R", "G")) { sidx <- which(colnames(rg) == channel) rgCA[,sidx] <- calibrateMultiscan(rg[,sidx]) } rgCAavg <- matrix(NA_real_, nrow=nrow(rgCA), ncol=2) colnames(rgCAavg) <- c("R", "G") for (channel in c("R", "G")) { sidx <- which(colnames(rg) == channel) rgCAavg[,channel] <- apply(rgCA[,sidx], MARGIN=1, FUN=median, na.rm=TRUE) } # Add some "fake" outliers outliers <- 1:600 rgCAavg[outliers,"G"] <- 50000 plotMvsA(rgCAavg) title(main="Average calibrated (AC)") # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # Normalize data # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # Weight-down outliers when normalizing weights <- rep(1, nrow(rgCAavg)) weights[outliers] <- 0.001 # Affine normalization of channels rgCANa <- normalizeAffine(rgCAavg, weights=weights) # It is always ok to rescale the affine normalized data if its # done on (R,G); not on (A,M)! However, this is only needed for # esthetic purposes. rgCANa <- rgCANa *2^1.4 plotMvsA(rgCANa) title(main="Normalized AC") # Curve-fit (lowess) normalization rgCANlw <- normalizeLowess(rgCAavg, weights=weights) plotMvsA(rgCANlw, col="orange", add=TRUE) # Curve-fit (loess) normalization rgCANl <- normalizeLoess(rgCAavg, weights=weights) plotMvsA(rgCANl, col="red", add=TRUE) # Curve-fit (robust spline) normalization rgCANrs <- normalizeRobustSpline(rgCAavg, weights=weights) plotMvsA(rgCANrs, col="blue", add=TRUE) legend(x=0,y=16, legend=c("affine", "lowess", "loess", "r. spline"), pch=19, col=c("black", "orange", "red", "blue"), ncol=2, x.intersp=0.3, bty="n") plotMvsMPairs(cbind(rgCANa, rgCANlw), col="orange", xlab=expression(M[affine])) title(main="Normalized AC") plotMvsMPairs(cbind(rgCANa, rgCANl), col="red", add=TRUE) plotMvsMPairs(cbind(rgCANa, rgCANrs), col="blue", add=TRUE) abline(a=0, b=1, lty=2) legend(x=-6,y=6, legend=c("lowess", "loess", "r. spline"), pch=19, col=c("orange", "red", "blue"), ncol=2, x.intersp=0.3, bty="n")
Rescales channel vectors to get the same average.
## S3 method for class 'matrix' normalizeAverage(x, baseline=1, avg=stats::median, targetAvg=2200, ...) ## S3 method for class 'list' normalizeAverage(x, baseline=1, avg=stats::median, targetAvg=2200, ...)
## S3 method for class 'matrix' normalizeAverage(x, baseline=1, avg=stats::median, targetAvg=2200, ...) ## S3 method for class 'list' normalizeAverage(x, baseline=1, avg=stats::median, targetAvg=2200, ...)
x |
|
baseline |
An |
avg |
A |
targetAvg |
The average that each channel should have afterwards.
If |
... |
Additional arguments passed to the |
Returns a normalized numeric
NxK matrix
(or list
of length K).
Henrik Bengtsson
Weighted curve-fit normalization between a pair of channels.
This method will estimate a smooth function of the dependency between the log-ratios and the log-intensity of the two channels and then correct the log-ratios (only) in order to remove the dependency. This is method is also known as intensity-dependent or lowess normalization.
The curve-fit methods are by nature limited to paired-channel data. There exist at least one method trying to overcome this limitation, namely the cyclic-lowess [1], which applies the paired curve-fit method iteratively over all pairs of channels/arrays. Cyclic-lowess is not implemented here.
We recommend that affine normalization [2] is used instead of curve-fit normalization.
## S3 method for class 'matrix' normalizeCurveFit(X, weights=NULL, typeOfWeights=c("datapoint"), method=c("loess", "lowess", "spline", "robustSpline"), bandwidth=NULL, satSignal=2^16 - 1, ...) ## S3 method for class 'matrix' normalizeLoess(X, ...) ## S3 method for class 'matrix' normalizeLowess(X, ...) ## S3 method for class 'matrix' normalizeSpline(X, ...) ## S3 method for class 'matrix' normalizeRobustSpline(X, ...)
## S3 method for class 'matrix' normalizeCurveFit(X, weights=NULL, typeOfWeights=c("datapoint"), method=c("loess", "lowess", "spline", "robustSpline"), bandwidth=NULL, satSignal=2^16 - 1, ...) ## S3 method for class 'matrix' normalizeLoess(X, ...) ## S3 method for class 'matrix' normalizeLowess(X, ...) ## S3 method for class 'matrix' normalizeSpline(X, ...) ## S3 method for class 'matrix' normalizeRobustSpline(X, ...)
X |
An Nx2 |
weights |
If |
typeOfWeights |
A |
method |
|
bandwidth |
A |
satSignal |
Signals equal to or above this threshold will not be used in the fitting. |
... |
Not used. |
A smooth function is fitted through data in
,
where
and
. Data is
normalized by
.
Loess is by far the slowest method of the four, then lowess, and then robust spline, which iteratively calls the spline method.
A Nx2 matrix
of the normalized two channels.
The fitted model is returned as attribute modelFit
.
Non-positive values are set to not-a-number (NaN
).
Data points that are saturated in one or more channels are not used
to estimate the normalization function, but they are normalized.
The estimation of the normalization function will only be made
based on complete non-saturated observations, i.e. observations that
contains no NA
values nor saturated values as defined by satSignal
.
Each data point, that is, each row in X
, which is a
vector of length 2, can be assigned a weight in [0,1] specifying how much
it should affect the fitting of the normalization function.
Weights are given by argument weights
, which should be a numeric
vector
of length N. Regardless of weights, all data points are
normalized based on the fitted normalization function.
Note that the lowess and the spline method only support zero-one {0,1} weights. For such methods, all weights that are less than a half are set to zero.
For loess
, the arguments family="symmetric"
,
degree=1
, span=3/4
,
control=loess.control(trace.hat="approximate"
,
iterations=5
, surface="direct")
are used.
Henrik Bengtsson
[1] M. Åstrand,
Contrast Normalization of Oligonucleotide Arrays,
Journal Computational Biology, 2003, 10, 95-102.
[2] Henrik Bengtsson and Ola Hössjer, Methodological Study of Affine Transformations of Gene Expression Data, Methodological study of affine transformations of gene expression data with proposed robust non-parametric multi-dimensional normalization method, BMC Bioinformatics, 2006, 7:100.
pathname <- system.file("data-ex", "PMT-RGData.dat", package="aroma.light") rg <- read.table(pathname, header=TRUE, sep="\t") nbrOfScans <- max(rg$slide) rg <- as.list(rg) for (field in c("R", "G")) rg[[field]] <- matrix(as.double(rg[[field]]), ncol=nbrOfScans) rg$slide <- rg$spot <- NULL rg <- as.matrix(as.data.frame(rg)) colnames(rg) <- rep(c("R", "G"), each=nbrOfScans) layout(matrix(c(1,2,0,3,4,0,5,6,7), ncol=3, byrow=TRUE)) rgC <- rg for (channel in c("R", "G")) { sidx <- which(colnames(rg) == channel) channelColor <- switch(channel, R="red", G="green") # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # The raw data # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - plotMvsAPairs(rg[,sidx]) title(main=paste("Observed", channel)) box(col=channelColor) # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # The calibrated data # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - rgC[,sidx] <- calibrateMultiscan(rg[,sidx], average=NULL) plotMvsAPairs(rgC[,sidx]) title(main=paste("Calibrated", channel)) box(col=channelColor) } # for (channel ...) # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # The average calibrated data # # Note how the red signals are weaker than the green. The reason # for this can be that the scale factor in the green channel is # greater than in the red channel, but it can also be that there # is a remaining relative difference in bias between the green # and the red channel, a bias that precedes the scanning. # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - rgCA <- rg for (channel in c("R", "G")) { sidx <- which(colnames(rg) == channel) rgCA[,sidx] <- calibrateMultiscan(rg[,sidx]) } rgCAavg <- matrix(NA_real_, nrow=nrow(rgCA), ncol=2) colnames(rgCAavg) <- c("R", "G") for (channel in c("R", "G")) { sidx <- which(colnames(rg) == channel) rgCAavg[,channel] <- apply(rgCA[,sidx], MARGIN=1, FUN=median, na.rm=TRUE) } # Add some "fake" outliers outliers <- 1:600 rgCAavg[outliers,"G"] <- 50000 plotMvsA(rgCAavg) title(main="Average calibrated (AC)") # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # Normalize data # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # Weight-down outliers when normalizing weights <- rep(1, nrow(rgCAavg)) weights[outliers] <- 0.001 # Affine normalization of channels rgCANa <- normalizeAffine(rgCAavg, weights=weights) # It is always ok to rescale the affine normalized data if its # done on (R,G); not on (A,M)! However, this is only needed for # esthetic purposes. rgCANa <- rgCANa *2^1.4 plotMvsA(rgCANa) title(main="Normalized AC") # Curve-fit (lowess) normalization rgCANlw <- normalizeLowess(rgCAavg, weights=weights) plotMvsA(rgCANlw, col="orange", add=TRUE) # Curve-fit (loess) normalization rgCANl <- normalizeLoess(rgCAavg, weights=weights) plotMvsA(rgCANl, col="red", add=TRUE) # Curve-fit (robust spline) normalization rgCANrs <- normalizeRobustSpline(rgCAavg, weights=weights) plotMvsA(rgCANrs, col="blue", add=TRUE) legend(x=0,y=16, legend=c("affine", "lowess", "loess", "r. spline"), pch=19, col=c("black", "orange", "red", "blue"), ncol=2, x.intersp=0.3, bty="n") plotMvsMPairs(cbind(rgCANa, rgCANlw), col="orange", xlab=expression(M[affine])) title(main="Normalized AC") plotMvsMPairs(cbind(rgCANa, rgCANl), col="red", add=TRUE) plotMvsMPairs(cbind(rgCANa, rgCANrs), col="blue", add=TRUE) abline(a=0, b=1, lty=2) legend(x=-6,y=6, legend=c("lowess", "loess", "r. spline"), pch=19, col=c("orange", "red", "blue"), ncol=2, x.intersp=0.3, bty="n")
pathname <- system.file("data-ex", "PMT-RGData.dat", package="aroma.light") rg <- read.table(pathname, header=TRUE, sep="\t") nbrOfScans <- max(rg$slide) rg <- as.list(rg) for (field in c("R", "G")) rg[[field]] <- matrix(as.double(rg[[field]]), ncol=nbrOfScans) rg$slide <- rg$spot <- NULL rg <- as.matrix(as.data.frame(rg)) colnames(rg) <- rep(c("R", "G"), each=nbrOfScans) layout(matrix(c(1,2,0,3,4,0,5,6,7), ncol=3, byrow=TRUE)) rgC <- rg for (channel in c("R", "G")) { sidx <- which(colnames(rg) == channel) channelColor <- switch(channel, R="red", G="green") # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # The raw data # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - plotMvsAPairs(rg[,sidx]) title(main=paste("Observed", channel)) box(col=channelColor) # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # The calibrated data # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - rgC[,sidx] <- calibrateMultiscan(rg[,sidx], average=NULL) plotMvsAPairs(rgC[,sidx]) title(main=paste("Calibrated", channel)) box(col=channelColor) } # for (channel ...) # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # The average calibrated data # # Note how the red signals are weaker than the green. The reason # for this can be that the scale factor in the green channel is # greater than in the red channel, but it can also be that there # is a remaining relative difference in bias between the green # and the red channel, a bias that precedes the scanning. # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - rgCA <- rg for (channel in c("R", "G")) { sidx <- which(colnames(rg) == channel) rgCA[,sidx] <- calibrateMultiscan(rg[,sidx]) } rgCAavg <- matrix(NA_real_, nrow=nrow(rgCA), ncol=2) colnames(rgCAavg) <- c("R", "G") for (channel in c("R", "G")) { sidx <- which(colnames(rg) == channel) rgCAavg[,channel] <- apply(rgCA[,sidx], MARGIN=1, FUN=median, na.rm=TRUE) } # Add some "fake" outliers outliers <- 1:600 rgCAavg[outliers,"G"] <- 50000 plotMvsA(rgCAavg) title(main="Average calibrated (AC)") # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # Normalize data # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # Weight-down outliers when normalizing weights <- rep(1, nrow(rgCAavg)) weights[outliers] <- 0.001 # Affine normalization of channels rgCANa <- normalizeAffine(rgCAavg, weights=weights) # It is always ok to rescale the affine normalized data if its # done on (R,G); not on (A,M)! However, this is only needed for # esthetic purposes. rgCANa <- rgCANa *2^1.4 plotMvsA(rgCANa) title(main="Normalized AC") # Curve-fit (lowess) normalization rgCANlw <- normalizeLowess(rgCAavg, weights=weights) plotMvsA(rgCANlw, col="orange", add=TRUE) # Curve-fit (loess) normalization rgCANl <- normalizeLoess(rgCAavg, weights=weights) plotMvsA(rgCANl, col="red", add=TRUE) # Curve-fit (robust spline) normalization rgCANrs <- normalizeRobustSpline(rgCAavg, weights=weights) plotMvsA(rgCANrs, col="blue", add=TRUE) legend(x=0,y=16, legend=c("affine", "lowess", "loess", "r. spline"), pch=19, col=c("black", "orange", "red", "blue"), ncol=2, x.intersp=0.3, bty="n") plotMvsMPairs(cbind(rgCANa, rgCANlw), col="orange", xlab=expression(M[affine])) title(main="Normalized AC") plotMvsMPairs(cbind(rgCANa, rgCANl), col="red", add=TRUE) plotMvsMPairs(cbind(rgCANa, rgCANrs), col="blue", add=TRUE) abline(a=0, b=1, lty=2) legend(x=-6,y=6, legend=c("lowess", "loess", "r. spline"), pch=19, col=c("orange", "red", "blue"), ncol=2, x.intersp=0.3, bty="n")
Rescales channel vectors to get the same average.
## S3 method for class 'list' normalizeDifferencesToAverage(x, baseline=1, FUN=median, ...)
## S3 method for class 'list' normalizeDifferencesToAverage(x, baseline=1, FUN=median, ...)
x |
|
baseline |
An |
FUN |
A |
... |
Additional arguments passed to the |
Returns a normalized list
of length K.
Henrik Bengtsson
# Simulate three shifted tracks of different lengths with same profiles ns <- c(A=2, B=1, C=0.25)*1000 xx <- lapply(ns, FUN=function(n) { seq(from=1, to=max(ns), length.out=n) }) zz <- mapply(seq_along(ns), ns, FUN=function(z,n) rep(z,n)) yy <- list( A = rnorm(ns["A"], mean=0, sd=0.5), B = rnorm(ns["B"], mean=5, sd=0.4), C = rnorm(ns["C"], mean=-5, sd=1.1) ) yy <- lapply(yy, FUN=function(y) { n <- length(y) y[1:(n/2)] <- y[1:(n/2)] + 2 y[1:(n/4)] <- y[1:(n/4)] - 4 y }) # Shift all tracks toward the first track yyN <- normalizeDifferencesToAverage(yy, baseline=1) # The baseline channel is not changed stopifnot(identical(yy[[1]], yyN[[1]])) # Get the estimated parameters fit <- attr(yyN, "fit") # Plot the tracks layout(matrix(1:2, ncol=1)) x <- unlist(xx) col <- unlist(zz) y <- unlist(yy) yN <- unlist(yyN) plot(x, y, col=col, ylim=c(-10,10)) plot(x, yN, col=col, ylim=c(-10,10))
# Simulate three shifted tracks of different lengths with same profiles ns <- c(A=2, B=1, C=0.25)*1000 xx <- lapply(ns, FUN=function(n) { seq(from=1, to=max(ns), length.out=n) }) zz <- mapply(seq_along(ns), ns, FUN=function(z,n) rep(z,n)) yy <- list( A = rnorm(ns["A"], mean=0, sd=0.5), B = rnorm(ns["B"], mean=5, sd=0.4), C = rnorm(ns["C"], mean=-5, sd=1.1) ) yy <- lapply(yy, FUN=function(y) { n <- length(y) y[1:(n/2)] <- y[1:(n/2)] + 2 y[1:(n/4)] <- y[1:(n/4)] - 4 y }) # Shift all tracks toward the first track yyN <- normalizeDifferencesToAverage(yy, baseline=1) # The baseline channel is not changed stopifnot(identical(yy[[1]], yyN[[1]])) # Get the estimated parameters fit <- attr(yyN, "fit") # Plot the tracks layout(matrix(1:2, ncol=1)) x <- unlist(xx) col <- unlist(zz) y <- unlist(yy) yN <- unlist(yyN) plot(x, y, col=col, ylim=c(-10,10)) plot(x, yN, col=col, ylim=c(-10,10))
Normalizes signals for PCR fragment-length effects. Some or all signals are used to estimated the normalization function. All signals are normalized.
## Default S3 method: normalizeFragmentLength(y, fragmentLengths, targetFcns=NULL, subsetToFit=NULL, onMissing=c("ignore", "median"), .isLogged=TRUE, ..., .returnFit=FALSE)
## Default S3 method: normalizeFragmentLength(y, fragmentLengths, targetFcns=NULL, subsetToFit=NULL, onMissing=c("ignore", "median"), .isLogged=TRUE, ..., .returnFit=FALSE)
y |
A |
fragmentLengths |
|
targetFcns |
An optional |
subsetToFit |
The subset of data points used to fit the
normalization function.
If |
onMissing |
Specifies how data points for which there is no
fragment length is normalized.
If |
.isLogged |
A |
... |
Additional arguments passed to |
.returnFit |
A |
Returns a numeric
vector
of the normalized signals.
It is assumed that the fragment-length effects from multiple enzymes added (with equal weights) on the intensity scale. The fragment-length effects are fitted for each enzyme separately based on units that are exclusively for that enzyme. If there are no or very such units for an enzyme, the assumptions of the model are not met and the fit will fail with an error. Then, from the above single-enzyme fits the average effect across enzymes is the calculated for each unit that is on multiple enzymes.
It is possible to specify custom target function effects for each
enzyme via argument targetFcns
. This argument has to be a
list
containing one function
per enzyme and ordered in the same
order as the enzyme are in the columns of argument
fragmentLengths
.
For instance, if one wish to normalize the signals such that their
mean signal as a function of fragment length effect is constantly
equal to 2200 (or the intensity scale), the use
targetFcns=function(fl, ...) log2(2200)
which completely
ignores fragment-length argument 'fl' and always returns a
constant.
If two enzymes are used, then use
targetFcns=rep(list(function(fl, ...) log2(2200)), 2)
.
Note, if targetFcns
is NULL
, this corresponds to
targetFcns=rep(list(function(fl, ...) 0), ncol(fragmentLengths))
.
Alternatively, if one wants to only apply minimal corrections to the signals, then one can normalize toward target functions that correspond to the fragment-length effect of the average array.
Henrik Bengtsson
[1] H. Bengtsson, R. Irizarry, B. Carvalho, and T. Speed, Estimation and assessment of raw copy numbers at the single locus level, Bioinformatics, 2008.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # Example 1: Single-enzyme fragment-length normalization of 6 arrays # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # Number samples I <- 9 # Number of loci J <- 1000 # Fragment lengths fl <- seq(from=100, to=1000, length.out=J) # Simulate data points with unknown fragment lengths hasUnknownFL <- seq(from=1, to=J, by=50) fl[hasUnknownFL] <- NA # Simulate data y <- matrix(0, nrow=J, ncol=I) maxY <- 12 for (kk in 1:I) { k <- runif(n=1, min=3, max=5) mu <- function(fl) { mu <- rep(maxY, length(fl)) ok <- !is.na(fl) mu[ok] <- mu[ok] - fl[ok]^{1/k} mu } eps <- rnorm(J, mean=0, sd=1) y[,kk] <- mu(fl) + eps } # Normalize data (to a zero baseline) yN <- apply(y, MARGIN=2, FUN=function(y) { normalizeFragmentLength(y, fragmentLengths=fl, onMissing="median") }) # The correction factors rho <- y-yN print(summary(rho)) # The correction for units with unknown fragment lengths # equals the median correction factor of all other units print(summary(rho[hasUnknownFL,])) # Plot raw data layout(matrix(1:9, ncol=3)) xlim <- c(0,max(fl, na.rm=TRUE)) ylim <- c(0,max(y, na.rm=TRUE)) xlab <- "Fragment length" ylab <- expression(log2(theta)) for (kk in 1:I) { plot(fl, y[,kk], xlim=xlim, ylim=ylim, xlab=xlab, ylab=ylab) ok <- (is.finite(fl) & is.finite(y[,kk])) lines(lowess(fl[ok], y[ok,kk]), col="red", lwd=2) } # Plot normalized data layout(matrix(1:9, ncol=3)) ylim <- c(-1,1)*max(y, na.rm=TRUE)/2 for (kk in 1:I) { plot(fl, yN[,kk], xlim=xlim, ylim=ylim, xlab=xlab, ylab=ylab) ok <- (is.finite(fl) & is.finite(y[,kk])) lines(lowess(fl[ok], yN[ok,kk]), col="blue", lwd=2) } # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # Example 2: Two-enzyme fragment-length normalization of 6 arrays # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - set.seed(0xbeef) # Number samples I <- 5 # Number of loci J <- 3000 # Fragment lengths (two enzymes) fl <- matrix(0, nrow=J, ncol=2) fl[,1] <- seq(from=100, to=1000, length.out=J) fl[,2] <- seq(from=1000, to=100, length.out=J) # Let 1/2 of the units be on both enzymes fl[seq(from=1, to=J, by=4),1] <- NA fl[seq(from=2, to=J, by=4),2] <- NA # Let some have unknown fragment lengths hasUnknownFL <- seq(from=1, to=J, by=15) fl[hasUnknownFL,] <- NA # Sty/Nsp mixing proportions: rho <- rep(1, I) rho[1] <- 1/3; # Less Sty in 1st sample rho[3] <- 3/2; # More Sty in 3rd sample # Simulate data z <- array(0, dim=c(J,2,I)) maxLog2Theta <- 12 for (ii in 1:I) { # Common effect for both enzymes mu <- function(fl) { k <- runif(n=1, min=3, max=5) mu <- rep(maxLog2Theta, length(fl)) ok <- is.finite(fl) mu[ok] <- mu[ok] - fl[ok]^{1/k} mu } # Calculate the effect for each data point for (ee in 1:2) { z[,ee,ii] <- mu(fl[,ee]) } # Update the Sty/Nsp mixing proportions ee <- 2 z[,ee,ii] <- rho[ii]*z[,ee,ii] # Add random errors for (ee in 1:2) { eps <- rnorm(J, mean=0, sd=1/sqrt(2)) z[,ee,ii] <- z[,ee,ii] + eps } } hasFl <- is.finite(fl) unitSets <- list( nsp = which( hasFl[,1] & !hasFl[,2]), sty = which(!hasFl[,1] & hasFl[,2]), both = which( hasFl[,1] & hasFl[,2]), none = which(!hasFl[,1] & !hasFl[,2]) ) # The observed data is a mix of two enzymes theta <- matrix(NA_real_, nrow=J, ncol=I) # Single-enzyme units for (ee in 1:2) { uu <- unitSets[[ee]] theta[uu,] <- 2^z[uu,ee,] } # Both-enzyme units (sum on intensity scale) uu <- unitSets$both theta[uu,] <- (2^z[uu,1,]+2^z[uu,2,])/2 # Missing units (sample from the others) uu <- unitSets$none theta[uu,] <- apply(theta, MARGIN=2, sample, size=length(uu)) # Calculate target array thetaT <- rowMeans(theta, na.rm=TRUE) targetFcns <- list() for (ee in 1:2) { uu <- unitSets[[ee]] fit <- lowess(fl[uu,ee], log2(thetaT[uu])) class(fit) <- "lowess" targetFcns[[ee]] <- function(fl, ...) { predict(fit, newdata=fl) } } # Fit model only to a subset of the data subsetToFit <- setdiff(1:J, seq(from=1, to=J, by=10)) # Normalize data (to a target baseline) thetaN <- matrix(NA_real_, nrow=J, ncol=I) fits <- vector("list", I) for (ii in 1:I) { lthetaNi <- normalizeFragmentLength(log2(theta[,ii]), targetFcns=targetFcns, fragmentLengths=fl, onMissing="median", subsetToFit=subsetToFit, .returnFit=TRUE) fits[[ii]] <- attr(lthetaNi, "modelFit") thetaN[,ii] <- 2^lthetaNi } # Plot raw data xlim <- c(0, max(fl, na.rm=TRUE)) ylim <- c(0, max(log2(theta), na.rm=TRUE)) Mlim <- c(-1,1)*4 xlab <- "Fragment length" ylab <- expression(log2(theta)) Mlab <- expression(M==log[2](theta/theta[R])) layout(matrix(1:(3*I), ncol=I, byrow=TRUE)) for (ii in 1:I) { plot(NA, xlim=xlim, ylim=ylim, xlab=xlab, ylab=ylab, main="raw") # Single-enzyme units for (ee in 1:2) { # The raw data uu <- unitSets[[ee]] points(fl[uu,ee], log2(theta[uu,ii]), col=ee+1) } # Both-enzyme units (use fragment-length for enzyme #1) uu <- unitSets$both points(fl[uu,1], log2(theta[uu,ii]), col=3+1) for (ee in 1:2) { # The true effects uu <- unitSets[[ee]] lines(lowess(fl[uu,ee], log2(theta[uu,ii])), col="black", lwd=4, lty=3) # The estimated effects fit <- fits[[ii]][[ee]]$fit lines(fit, col="orange", lwd=3) muT <- targetFcns[[ee]](fl[uu,ee]) lines(fl[uu,ee], muT, col="cyan", lwd=1) } } # Calculate log-ratios thetaR <- rowMeans(thetaN, na.rm=TRUE) M <- log2(thetaN/thetaR) # Plot normalized data for (ii in 1:I) { plot(NA, xlim=xlim, ylim=Mlim, xlab=xlab, ylab=Mlab, main="normalized") # Single-enzyme units for (ee in 1:2) { # The normalized data uu <- unitSets[[ee]] points(fl[uu,ee], M[uu,ii], col=ee+1) } # Both-enzyme units (use fragment-length for enzyme #1) uu <- unitSets$both points(fl[uu,1], M[uu,ii], col=3+1) } ylim <- c(0,1.5) for (ii in 1:I) { data <- list() for (ee in 1:2) { # The normalized data uu <- unitSets[[ee]] data[[ee]] <- M[uu,ii] } uu <- unitSets$both if (length(uu) > 0) data[[3]] <- M[uu,ii] uu <- unitSets$none if (length(uu) > 0) data[[4]] <- M[uu,ii] cols <- seq_along(data)+1 plotDensity(data, col=cols, xlim=Mlim, xlab=Mlab, main="normalized") abline(v=0, lty=2) }
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # Example 1: Single-enzyme fragment-length normalization of 6 arrays # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # Number samples I <- 9 # Number of loci J <- 1000 # Fragment lengths fl <- seq(from=100, to=1000, length.out=J) # Simulate data points with unknown fragment lengths hasUnknownFL <- seq(from=1, to=J, by=50) fl[hasUnknownFL] <- NA # Simulate data y <- matrix(0, nrow=J, ncol=I) maxY <- 12 for (kk in 1:I) { k <- runif(n=1, min=3, max=5) mu <- function(fl) { mu <- rep(maxY, length(fl)) ok <- !is.na(fl) mu[ok] <- mu[ok] - fl[ok]^{1/k} mu } eps <- rnorm(J, mean=0, sd=1) y[,kk] <- mu(fl) + eps } # Normalize data (to a zero baseline) yN <- apply(y, MARGIN=2, FUN=function(y) { normalizeFragmentLength(y, fragmentLengths=fl, onMissing="median") }) # The correction factors rho <- y-yN print(summary(rho)) # The correction for units with unknown fragment lengths # equals the median correction factor of all other units print(summary(rho[hasUnknownFL,])) # Plot raw data layout(matrix(1:9, ncol=3)) xlim <- c(0,max(fl, na.rm=TRUE)) ylim <- c(0,max(y, na.rm=TRUE)) xlab <- "Fragment length" ylab <- expression(log2(theta)) for (kk in 1:I) { plot(fl, y[,kk], xlim=xlim, ylim=ylim, xlab=xlab, ylab=ylab) ok <- (is.finite(fl) & is.finite(y[,kk])) lines(lowess(fl[ok], y[ok,kk]), col="red", lwd=2) } # Plot normalized data layout(matrix(1:9, ncol=3)) ylim <- c(-1,1)*max(y, na.rm=TRUE)/2 for (kk in 1:I) { plot(fl, yN[,kk], xlim=xlim, ylim=ylim, xlab=xlab, ylab=ylab) ok <- (is.finite(fl) & is.finite(y[,kk])) lines(lowess(fl[ok], yN[ok,kk]), col="blue", lwd=2) } # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # Example 2: Two-enzyme fragment-length normalization of 6 arrays # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - set.seed(0xbeef) # Number samples I <- 5 # Number of loci J <- 3000 # Fragment lengths (two enzymes) fl <- matrix(0, nrow=J, ncol=2) fl[,1] <- seq(from=100, to=1000, length.out=J) fl[,2] <- seq(from=1000, to=100, length.out=J) # Let 1/2 of the units be on both enzymes fl[seq(from=1, to=J, by=4),1] <- NA fl[seq(from=2, to=J, by=4),2] <- NA # Let some have unknown fragment lengths hasUnknownFL <- seq(from=1, to=J, by=15) fl[hasUnknownFL,] <- NA # Sty/Nsp mixing proportions: rho <- rep(1, I) rho[1] <- 1/3; # Less Sty in 1st sample rho[3] <- 3/2; # More Sty in 3rd sample # Simulate data z <- array(0, dim=c(J,2,I)) maxLog2Theta <- 12 for (ii in 1:I) { # Common effect for both enzymes mu <- function(fl) { k <- runif(n=1, min=3, max=5) mu <- rep(maxLog2Theta, length(fl)) ok <- is.finite(fl) mu[ok] <- mu[ok] - fl[ok]^{1/k} mu } # Calculate the effect for each data point for (ee in 1:2) { z[,ee,ii] <- mu(fl[,ee]) } # Update the Sty/Nsp mixing proportions ee <- 2 z[,ee,ii] <- rho[ii]*z[,ee,ii] # Add random errors for (ee in 1:2) { eps <- rnorm(J, mean=0, sd=1/sqrt(2)) z[,ee,ii] <- z[,ee,ii] + eps } } hasFl <- is.finite(fl) unitSets <- list( nsp = which( hasFl[,1] & !hasFl[,2]), sty = which(!hasFl[,1] & hasFl[,2]), both = which( hasFl[,1] & hasFl[,2]), none = which(!hasFl[,1] & !hasFl[,2]) ) # The observed data is a mix of two enzymes theta <- matrix(NA_real_, nrow=J, ncol=I) # Single-enzyme units for (ee in 1:2) { uu <- unitSets[[ee]] theta[uu,] <- 2^z[uu,ee,] } # Both-enzyme units (sum on intensity scale) uu <- unitSets$both theta[uu,] <- (2^z[uu,1,]+2^z[uu,2,])/2 # Missing units (sample from the others) uu <- unitSets$none theta[uu,] <- apply(theta, MARGIN=2, sample, size=length(uu)) # Calculate target array thetaT <- rowMeans(theta, na.rm=TRUE) targetFcns <- list() for (ee in 1:2) { uu <- unitSets[[ee]] fit <- lowess(fl[uu,ee], log2(thetaT[uu])) class(fit) <- "lowess" targetFcns[[ee]] <- function(fl, ...) { predict(fit, newdata=fl) } } # Fit model only to a subset of the data subsetToFit <- setdiff(1:J, seq(from=1, to=J, by=10)) # Normalize data (to a target baseline) thetaN <- matrix(NA_real_, nrow=J, ncol=I) fits <- vector("list", I) for (ii in 1:I) { lthetaNi <- normalizeFragmentLength(log2(theta[,ii]), targetFcns=targetFcns, fragmentLengths=fl, onMissing="median", subsetToFit=subsetToFit, .returnFit=TRUE) fits[[ii]] <- attr(lthetaNi, "modelFit") thetaN[,ii] <- 2^lthetaNi } # Plot raw data xlim <- c(0, max(fl, na.rm=TRUE)) ylim <- c(0, max(log2(theta), na.rm=TRUE)) Mlim <- c(-1,1)*4 xlab <- "Fragment length" ylab <- expression(log2(theta)) Mlab <- expression(M==log[2](theta/theta[R])) layout(matrix(1:(3*I), ncol=I, byrow=TRUE)) for (ii in 1:I) { plot(NA, xlim=xlim, ylim=ylim, xlab=xlab, ylab=ylab, main="raw") # Single-enzyme units for (ee in 1:2) { # The raw data uu <- unitSets[[ee]] points(fl[uu,ee], log2(theta[uu,ii]), col=ee+1) } # Both-enzyme units (use fragment-length for enzyme #1) uu <- unitSets$both points(fl[uu,1], log2(theta[uu,ii]), col=3+1) for (ee in 1:2) { # The true effects uu <- unitSets[[ee]] lines(lowess(fl[uu,ee], log2(theta[uu,ii])), col="black", lwd=4, lty=3) # The estimated effects fit <- fits[[ii]][[ee]]$fit lines(fit, col="orange", lwd=3) muT <- targetFcns[[ee]](fl[uu,ee]) lines(fl[uu,ee], muT, col="cyan", lwd=1) } } # Calculate log-ratios thetaR <- rowMeans(thetaN, na.rm=TRUE) M <- log2(thetaN/thetaR) # Plot normalized data for (ii in 1:I) { plot(NA, xlim=xlim, ylim=Mlim, xlab=xlab, ylab=Mlab, main="normalized") # Single-enzyme units for (ee in 1:2) { # The normalized data uu <- unitSets[[ee]] points(fl[uu,ee], M[uu,ii], col=ee+1) } # Both-enzyme units (use fragment-length for enzyme #1) uu <- unitSets$both points(fl[uu,1], M[uu,ii], col=3+1) } ylim <- c(0,1.5) for (ii in 1:I) { data <- list() for (ee in 1:2) { # The normalized data uu <- unitSets[[ee]] data[[ee]] <- M[uu,ii] } uu <- unitSets$both if (length(uu) > 0) data[[3]] <- M[uu,ii] uu <- unitSets$none if (length(uu) > 0) data[[4]] <- M[uu,ii] cols <- seq_along(data)+1 plotDensity(data, col=cols, xlim=Mlim, xlab=Mlab, main="normalized") abline(v=0, lty=2) }
Normalizes the empirical distribution of one of more samples to a target distribution.
The average sample distribution is calculated either robustly or not
by utilizing either weightedMedian()
or weighted.mean()
.
A weighted method is used if any of the weights are different from one.
## S3 method for class 'numeric' normalizeQuantileRank(x, xTarget, ties=FALSE, ...) ## S3 method for class 'list' normalizeQuantileRank(X, xTarget=NULL, ...) ## Default S3 method: normalizeQuantile(x, ...)
## S3 method for class 'numeric' normalizeQuantileRank(x, xTarget, ties=FALSE, ...) ## S3 method for class 'list' normalizeQuantileRank(X, xTarget=NULL, ...) ## Default S3 method: normalizeQuantile(x, ...)
x , X
|
a |
xTarget |
The target empirical distribution as a sorted
|
ties |
Should ties in |
... |
Not used. |
Returns an object of the same shape as the input argument.
Missing values are excluded when estimating the "common" (the baseline).
Values that are NA
remain NA
after normalization.
No new NA
s are introduced.
Currently only channel weights are support due to the way quantile normalization is done. If signal weights are given, channel weights are calculated from these by taking the mean of the signal weights in each channel.
Adopted from Gordon Smyth (http://www.statsci.org/) in 2002 & 2006. Original code by Ben Bolstad at Statistics Department, University of California.
To calculate a target distribution from a set of samples, see
averageQuantile
().
For an alternative empirical density normalization methods, see
normalizeQuantileSpline
().
# Simulate ten samples of different lengths N <- 10000 X <- list() for (kk in 1:8) { rfcn <- list(rnorm, rgamma)[[sample(2, size=1)]] size <- runif(1, min=0.3, max=1) a <- rgamma(1, shape=20, rate=10) b <- rgamma(1, shape=10, rate=10) values <- rfcn(size*N, a, b) # "Censor" values values[values < 0 | values > 8] <- NA X[[kk]] <- values } # Add 20% missing values X <- lapply(X, FUN=function(x) { x[sample(length(x), size=0.20*length(x))] <- NA x }) # Normalize quantiles Xn <- normalizeQuantile(X) # Plot the data layout(matrix(1:2, ncol=1)) xlim <- range(X, na.rm=TRUE) plotDensity(X, lwd=2, xlim=xlim, main="The original distributions") plotDensity(Xn, lwd=2, xlim=xlim, main="The normalized distributions")
# Simulate ten samples of different lengths N <- 10000 X <- list() for (kk in 1:8) { rfcn <- list(rnorm, rgamma)[[sample(2, size=1)]] size <- runif(1, min=0.3, max=1) a <- rgamma(1, shape=20, rate=10) b <- rgamma(1, shape=10, rate=10) values <- rfcn(size*N, a, b) # "Censor" values values[values < 0 | values > 8] <- NA X[[kk]] <- values } # Add 20% missing values X <- lapply(X, FUN=function(x) { x[sample(length(x), size=0.20*length(x))] <- NA x }) # Normalize quantiles Xn <- normalizeQuantile(X) # Plot the data layout(matrix(1:2, ncol=1)) xlim <- range(X, na.rm=TRUE) plotDensity(X, lwd=2, xlim=xlim, main="The original distributions") plotDensity(Xn, lwd=2, xlim=xlim, main="The normalized distributions")
Normalizes the empirical distribution of a set of samples to a common target distribution.
The average sample distribution is calculated either robustly or not
by utilizing either weightedMedian()
or weighted.mean()
.
A weighted method is used if any of the weights are different from one.
## S3 method for class 'matrix' normalizeQuantileRank(X, ties=FALSE, robust=FALSE, weights=NULL, typeOfWeights=c("channel", "signal"), ...)
## S3 method for class 'matrix' normalizeQuantileRank(X, ties=FALSE, robust=FALSE, weights=NULL, typeOfWeights=c("channel", "signal"), ...)
X |
a numerical NxK |
robust |
If |
ties |
Should ties in |
weights |
If |
typeOfWeights |
A |
... |
Not used. |
Returns an object of the same shape as the input argument.
Missing values are excluded when estimating the "common" (the baseline).
Values that are NA
remain NA
after normalization.
No new NA
s are introduced.
Currently only channel weights are support due to the way quantile normalization is done. If signal weights are given, channel weights are calculated from these by taking the mean of the signal weights in each channel.
Adopted from Gordon Smyth (http://www.statsci.org/) in 2002 & 2006. Original code by Ben Bolstad at Statistics Department, University of California. Support for calculating the average sample distribution using (weighted) mean or median was added by Henrik Bengtsson.
median
, weightedMedian
,
mean
() and weighted.mean
.
normalizeQuantileSpline
().
# Simulate three samples with on average 20% missing values N <- 10000 X <- cbind(rnorm(N, mean=3, sd=1), rnorm(N, mean=4, sd=2), rgamma(N, shape=2, rate=1)) X[sample(3*N, size=0.20*3*N)] <- NA # Normalize quantiles Xn <- normalizeQuantile(X) # Plot the data layout(matrix(1:2, ncol=1)) xlim <- range(X, Xn, na.rm=TRUE) plotDensity(X, lwd=2, xlim=xlim, main="The three original distributions") plotDensity(Xn, lwd=2, xlim=xlim, main="The three normalized distributions")
# Simulate three samples with on average 20% missing values N <- 10000 X <- cbind(rnorm(N, mean=3, sd=1), rnorm(N, mean=4, sd=2), rgamma(N, shape=2, rate=1)) X[sample(3*N, size=0.20*3*N)] <- NA # Normalize quantiles Xn <- normalizeQuantile(X) # Plot the data layout(matrix(1:2, ncol=1)) xlim <- range(X, Xn, na.rm=TRUE) plotDensity(X, lwd=2, xlim=xlim, main="The three original distributions") plotDensity(Xn, lwd=2, xlim=xlim, main="The three normalized distributions")
Normalizes the empirical distribution of one or more samples to a target distribution. After normalization, all samples have the same average empirical density distribution.
## S3 method for class 'numeric' normalizeQuantileSpline(x, w=NULL, xTarget, sortTarget=TRUE, robust=TRUE, ...) ## S3 method for class 'matrix' normalizeQuantileSpline(X, w=NULL, xTarget=NULL, sortTarget=TRUE, robust=TRUE, ...) ## S3 method for class 'list' normalizeQuantileSpline(X, w=NULL, xTarget=NULL, sortTarget=TRUE, robust=TRUE, ...)
## S3 method for class 'numeric' normalizeQuantileSpline(x, w=NULL, xTarget, sortTarget=TRUE, robust=TRUE, ...) ## S3 method for class 'matrix' normalizeQuantileSpline(X, w=NULL, xTarget=NULL, sortTarget=TRUE, robust=TRUE, ...) ## S3 method for class 'list' normalizeQuantileSpline(X, w=NULL, xTarget=NULL, sortTarget=TRUE, robust=TRUE, ...)
x , X
|
A single ( |
w |
An optional |
xTarget |
The target empirical distribution as a sorted
|
sortTarget |
If |
robust |
If |
... |
Arguments passed to ( |
Returns an object of the same type and dimensions as the input.
Both argument X
and xTarget
may contain non-finite values.
These values do not affect the estimation of the normalization function.
Missing values and other non-finite values in X
,
remain in the output as is. No new missing values are introduced.
Henrik Bengtsson
[1] H. Bengtsson, R. Irizarry, B. Carvalho, and T. Speed, Estimation and assessment of raw copy numbers at the single locus level, Bioinformatics, 2008.
The target distribution can be calculated as the average
using averageQuantile
().
Internally either
robustSmoothSpline
(robust=TRUE
) or
smooth.spline
(robust=FALSE
) is used.
An alternative normalization method that is also normalizing the
empirical densities of samples is normalizeQuantileRank
().
Contrary to this method, that method requires that all samples are
based on the exact same set of data points and it is also more likely
to over-correct in the tails of the distributions.
# Simulate three samples with on average 20% missing values N <- 10000 X <- cbind(rnorm(N, mean=3, sd=1), rnorm(N, mean=4, sd=2), rgamma(N, shape=2, rate=1)) X[sample(3*N, size=0.20*3*N)] <- NA # Plot the data layout(matrix(c(1,0,2:5), ncol=2, byrow=TRUE)) xlim <- range(X, na.rm=TRUE) plotDensity(X, lwd=2, xlim=xlim, main="The three original distributions") Xn <- normalizeQuantile(X) plotDensity(Xn, lwd=2, xlim=xlim, main="The three normalized distributions") plotXYCurve(X, Xn, xlim=xlim, main="The three normalized distributions") Xn2 <- normalizeQuantileSpline(X, xTarget=Xn[,1], spar=0.99) plotDensity(Xn2, lwd=2, xlim=xlim, main="The three normalized distributions") plotXYCurve(X, Xn2, xlim=xlim, main="The three normalized distributions")
# Simulate three samples with on average 20% missing values N <- 10000 X <- cbind(rnorm(N, mean=3, sd=1), rnorm(N, mean=4, sd=2), rgamma(N, shape=2, rate=1)) X[sample(3*N, size=0.20*3*N)] <- NA # Plot the data layout(matrix(c(1,0,2:5), ncol=2, byrow=TRUE)) xlim <- range(X, na.rm=TRUE) plotDensity(X, lwd=2, xlim=xlim, main="The three original distributions") Xn <- normalizeQuantile(X) plotDensity(Xn, lwd=2, xlim=xlim, main="The three normalized distributions") plotXYCurve(X, Xn, xlim=xlim, main="The three normalized distributions") Xn2 <- normalizeQuantileSpline(X, xTarget=Xn[,1], spar=0.99) plotDensity(Xn2, lwd=2, xlim=xlim, main="The three normalized distributions") plotXYCurve(X, Xn2, xlim=xlim, main="The three normalized distributions")
TumorBoost [1] is a normalization method that normalizes the allele B fractions of a tumor sample given the allele B fractions and genotypes of a matched normal. The method is a single-sample (single-pair) method. It does not require total copy-number estimates. The normalization is done such that the total copy number is unchanged afterwards.
## S3 method for class 'numeric' normalizeTumorBoost(betaT, betaN, muN=callNaiveGenotypes(betaN), preserveScale=FALSE, flavor=c("v4", "v3", "v2", "v1"), ...)
## S3 method for class 'numeric' normalizeTumorBoost(betaT, betaN, muN=callNaiveGenotypes(betaN), preserveScale=FALSE, flavor=c("v4", "v3", "v2", "v1"), ...)
betaT , betaN
|
Two |
muN |
An optional |
preserveScale |
If |
flavor |
A |
... |
Not used. |
Allele B fractions are defined as the ratio between the allele B signal and the sum of both (all) allele signals at the same locus. Allele B fractions are typically within [0,1], but may have a slightly wider support due to for instance negative noise. This is typically also the case for the returned normalized allele B fractions.
Returns a numeric
vector
of length J containing the normalized
allele B fractions for the tumor.
Attribute modelFit
is a list
containing model fit parameters.
This method provides a few different "flavors" for normalizing the
data. The following values of argument flavor
are accepted:
v4: (default) The TumorBoost method, i.e. Eqns. (8)-(9) in [1].
v3: Eqn (9) in [1] is applied to both heterozygous and homozygous SNPs, which effectively is v4 where the normalized allele B fractions for homozygous SNPs becomes 0 and 1.
v2: ...
v1: TumorBoost where correction factor is forced to one, i.e.
. As explained in [1], this is a suboptimal
normalization method. See also the discussion in the
paragraph following Eqn (12) in [1].
As of aroma.light v1.33.3 (March 30, 2014),
argument preserveScale
no longer has a default value and has
to be specified explicitly. This is done in order to change the
default to FALSE
in a future version, while minimizing the risk
for surprises.
Allele B fractions are more or less compressed toward a half, e.g. the signals for homozygous SNPs are slightly away from zero and one. The TumorBoost method decreases the correlation in allele B fractions between the tumor and the normal conditioned on the genotype. What it does not control for is the mean level of the allele B fraction conditioned on the genotype.
By design, most flavors of the method will correct the homozygous SNPs such that their mean levels get close to the expected zero and one levels. However, the heterozygous SNPs will typically keep the same mean levels as before. One possibility is to adjust the signals such as the mean levels of the heterozygous SNPs relative to that of the homozygous SNPs is the same after as before the normalization.
If argument preserveScale=TRUE
, then SNPs that are heterozygous
(in the matched normal) are corrected for signal compression using
an estimate of signal compression based on the amount of correction
performed by TumorBoost on SNPs that are homozygous
(in the matched normal).
The option of preserving the scale is not discussed in the
TumorBoost paper [1], which presents the preserveScale=FALSE
version.
Henrik Bengtsson, Pierre Neuvial
[1] H. Bengtsson, P. Neuvial and T.P. Speed, TumorBoost: Normalization of allele-specific tumor copy numbers from a single pair of tumor-normal genotyping microarrays, BMC Bioinformatics, 2010, 11:245. [PMID 20462408]
library(R.utils) # Load data pathname <- system.file("data-ex/TumorBoost,fracB,exampleData.Rbin", package="aroma.light") data <- loadObject(pathname) attachLocally(data) pos <- position/1e6 muN <- genotypeN layout(matrix(1:4, ncol=1)) par(mar=c(2.5,4,0.5,1)+0.1) ylim <- c(-0.05, 1.05) col <- rep("#999999", length(muN)) col[muN == 1/2] <- "#000000" # Allele B fractions for the normal sample plot(pos, betaN, col=col, ylim=ylim) # Allele B fractions for the tumor sample plot(pos, betaT, col=col, ylim=ylim) # TumorBoost w/ naive genotype calls betaTN <- normalizeTumorBoost(betaT=betaT, betaN=betaN, preserveScale=FALSE) plot(pos, betaTN, col=col, ylim=ylim) # TumorBoost w/ external multi-sample genotype calls betaTNx <- normalizeTumorBoost(betaT=betaT, betaN=betaN, muN=muN, preserveScale=FALSE) plot(pos, betaTNx, col=col, ylim=ylim)
library(R.utils) # Load data pathname <- system.file("data-ex/TumorBoost,fracB,exampleData.Rbin", package="aroma.light") data <- loadObject(pathname) attachLocally(data) pos <- position/1e6 muN <- genotypeN layout(matrix(1:4, ncol=1)) par(mar=c(2.5,4,0.5,1)+0.1) ylim <- c(-0.05, 1.05) col <- rep("#999999", length(muN)) col[muN == 1/2] <- "#000000" # Allele B fractions for the normal sample plot(pos, betaN, col=col, ylim=ylim) # Allele B fractions for the tumor sample plot(pos, betaT, col=col, ylim=ylim) # TumorBoost w/ naive genotype calls betaTN <- normalizeTumorBoost(betaT=betaT, betaN=betaN, preserveScale=FALSE) plot(pos, betaTN, col=col, ylim=ylim) # TumorBoost w/ external multi-sample genotype calls betaTNx <- normalizeTumorBoost(betaT=betaT, betaN=betaN, muN=muN, preserveScale=FALSE) plot(pos, betaTNx, col=col, ylim=ylim)
Calculating tumor-normal paired allele-specific copy number stratified on genotypes. The method is a single-sample (single-pair) method. It requires paired tumor-normal parent-specific copy number signals.
## S3 method for class 'numeric' pairedAlleleSpecificCopyNumbers(thetaT, betaT, thetaN, betaN, muN=callNaiveGenotypes(betaN), ...)
## S3 method for class 'numeric' pairedAlleleSpecificCopyNumbers(thetaT, betaT, thetaN, betaN, muN=callNaiveGenotypes(betaN), ...)
thetaT , betaT
|
Theta and allele-B fraction signals for the tumor. |
thetaN , betaN
|
Total and allele-B fraction signals for the matched normal. |
muN |
An optional |
... |
Not used. |
Returns a data.frame
with elements CT
, betaT
and muN
.
Pierre Neuvial, Henrik Bengtsson
This definition of calculating tumor-normal paired ASCN is related
to how the normalizeTumorBoost
() method calculates normalized
tumor BAFs.
Plots density distributions for a set of vectors.
## S3 method for class 'data.frame' plotDensity(X, ..., xlab=NULL) ## S3 method for class 'matrix' plotDensity(X, ..., xlab=NULL) ## S3 method for class 'numeric' plotDensity(X, ..., xlab=NULL) ## S3 method for class 'list' plotDensity(X, W=NULL, xlim=NULL, ylim=NULL, xlab=NULL, ylab="density (integrates to one)", col=1:length(X), lty=NULL, lwd=NULL, ..., add=FALSE)
## S3 method for class 'data.frame' plotDensity(X, ..., xlab=NULL) ## S3 method for class 'matrix' plotDensity(X, ..., xlab=NULL) ## S3 method for class 'numeric' plotDensity(X, ..., xlab=NULL) ## S3 method for class 'list' plotDensity(X, W=NULL, xlim=NULL, ylim=NULL, xlab=NULL, ylab="density (integrates to one)", col=1:length(X), lty=NULL, lwd=NULL, ..., add=FALSE)
X |
A single of |
W |
(optional) weights of similar data types and dimensions as
|
xlim , ylim
|
|
xlab , ylab
|
|
col |
The color(s) of the curves. |
lty |
The types of curves. |
lwd |
The width of curves. |
... |
|
add |
If |
Henrik Bengtsson
Internally, density
is used to estimate the
empirical density.
Plot log-ratios vs log-intensities.
## S3 method for class 'matrix' plotMvsA(X, Alab="A", Mlab="M", Alim=c(0, 16), Mlim=c(-1, 1) * diff(Alim) * aspectRatio, aspectRatio=1, pch=".", ..., add=FALSE)
## S3 method for class 'matrix' plotMvsA(X, Alab="A", Mlab="M", Alim=c(0, 16), Mlim=c(-1, 1) * diff(Alim) * aspectRatio, aspectRatio=1, pch=".", ..., add=FALSE)
X |
Nx2 |
Alab , Mlab
|
Labels on the x and y axes. |
Alim , Mlim
|
Plot range on the A and M axes. |
aspectRatio |
Aspect ratio between |
pch |
Plot symbol used. |
... |
Additional arguments accepted by |
add |
If |
Red channel is assumed to be in column one and green in column two. Log-ratio are calculated taking channel one over channel two.
Returns nothing.
Henrik Bengtsson
Plot log-ratios/log-intensities for all unique pairs of data vectors.
## S3 method for class 'matrix' plotMvsAPairs(X, Alab="A", Mlab="M", Alim=c(0, 16), Mlim=c(-1, 1) * diff(Alim), pch=".", ..., add=FALSE)
## S3 method for class 'matrix' plotMvsAPairs(X, Alab="A", Mlab="M", Alim=c(0, 16), Mlim=c(-1, 1) * diff(Alim), pch=".", ..., add=FALSE)
X |
NxK |
Alab , Mlab
|
Labels on the x and y axes. |
Alim , Mlim
|
Plot range on the A and M axes. |
pch |
Plot symbol used. |
... |
Additional arguments accepted by |
add |
If |
Log-ratios and log-intensities are calculated for each neighboring pair of channels (columns) and plotted. Thus, in total there will be K-1 data set plotted.
The colors used for the plotted pairs are 1, 2, and so on. To change the colors, use a different color palette.
Returns nothing.
Henrik Bengtsson
Plot log-ratios vs log-ratios for all pairs of columns.
## S3 method for class 'matrix' plotMvsMPairs(X, xlab="M", ylab="M", xlim=c(-1, 1) * 6, ylim=xlim, pch=".", ..., add=FALSE)
## S3 method for class 'matrix' plotMvsMPairs(X, xlab="M", ylab="M", xlim=c(-1, 1) * 6, ylim=xlim, pch=".", ..., add=FALSE)
X |
Nx2K |
xlab , ylab
|
Labels on the x and y axes. |
xlim , ylim
|
Plot range on the x and y axes. |
pch |
Plot symbol used. |
... |
Additional arguments accepted by |
add |
If |
Log-ratio are calculated by over paired columns, e.g. column 1 and 2, column 3 and 4, and so on.
Returns nothing.
Henrik Bengtsson
Plot the relationship between two variables as a smooth curve.
## S3 method for class 'numeric' plotXYCurve(x, y, col=1L, lwd=2, dlwd=1, dcol=NA, xlim=NULL, ylim=xlim, xlab=NULL, ylab=NULL, curveFit=smooth.spline, ..., add=FALSE) ## S3 method for class 'matrix' plotXYCurve(X, Y, col=seq_len(nrow(X)), lwd=2, dlwd=1, dcol=NA, xlim=NULL, ylim=xlim, xlab=NULL, ylab=NULL, curveFit=smooth.spline, ..., add=FALSE)
## S3 method for class 'numeric' plotXYCurve(x, y, col=1L, lwd=2, dlwd=1, dcol=NA, xlim=NULL, ylim=xlim, xlab=NULL, ylab=NULL, curveFit=smooth.spline, ..., add=FALSE) ## S3 method for class 'matrix' plotXYCurve(X, Y, col=seq_len(nrow(X)), lwd=2, dlwd=1, dcol=NA, xlim=NULL, ylim=xlim, xlab=NULL, ylab=NULL, curveFit=smooth.spline, ..., add=FALSE)
x , y , X , Y
|
Two |
col |
The color of each curve.
Either a scalar specifying the same value of all curves,
or a |
lwd |
The line width of each curve.
Either a scalar specifying the same value of all curves,
or a |
dlwd |
The width of each density curve. |
dcol |
The fill color of the interior of each density curve. |
xlim , ylim
|
The x and y plotting limits. |
xlab , ylab
|
The x and y labels. |
curveFit |
The |
... |
Additional arguments passed to |
add |
If |
Returns nothing.
Data points (x,y) with non-finite values are excluded.
Henrik Bengtsson
Fits a smoothing spline robustly using the norm. Currently, the
algorithm is an iterative reweighted smooth spline algorithm which
calls
smooth.spline(x,y,w,...)
at each iteration with the weights
w
equal to the inverse of the absolute value of the residuals for
the last iteration step.
## Default S3 method: robustSmoothSpline(x, y=NULL, w=NULL, ..., minIter=3, maxIter=max(minIter, 50), method=c("L1", "symmetric"), sdCriteria=2e-04, reps=1e-15, tol=1e-06 * IQR(x), plotCurves=FALSE)
## Default S3 method: robustSmoothSpline(x, y=NULL, w=NULL, ..., minIter=3, maxIter=max(minIter, 50), method=c("L1", "symmetric"), sdCriteria=2e-04, reps=1e-15, tol=1e-06 * IQR(x), plotCurves=FALSE)
x |
a |
y |
responses. If |
w |
a |
... |
Other arguments passed to |
minIter |
the minimum number of iterations used to fit the smoothing spline robustly. Default value is 3. |
maxIter |
the maximum number of iterations used to fit the smoothing spline robustly. Default value is 25. |
method |
the method used to compute robustness weights at each
iteration. Default value is |
sdCriteria |
Convergence criteria, which the difference between the standard deviation of the residuals between two consecutive iteration steps. Default value is 2e-4. |
reps |
Small positive number added to residuals to avoid division by zero when calculating new weights for next iteration. |
tol |
Passed to |
plotCurves |
If |
Returns an object of class smooth.spline
.
Henrik Bengtsson
This implementation of this function was adopted from
smooth.spline
of the stats package.
Because of this, this function is also licensed under GPL v2.
data(cars) attach(cars) plot(speed, dist, main = "data(cars) & robust smoothing splines") # Fit a smoothing spline using L_2 norm cars.spl <- smooth.spline(speed, dist) lines(cars.spl, col = "blue") # Fit a smoothing spline using L_1 norm cars.rspl <- robustSmoothSpline(speed, dist) lines(cars.rspl, col = "red") # Fit a smoothing spline using L_2 norm with 10 degrees of freedom lines(smooth.spline(speed, dist, df=10), lty=2, col = "blue") # Fit a smoothing spline using L_1 norm with 10 degrees of freedom lines(robustSmoothSpline(speed, dist, df=10), lty=2, col = "red") legend(5,120, c( paste("smooth.spline [C.V.] => df =",round(cars.spl$df,1)), paste("robustSmoothSpline [C.V.] => df =",round(cars.rspl$df,1)), "standard with s( * , df = 10)", "robust with s( * , df = 10)" ), col = c("blue","red","blue","red"), lty = c(1,1,2,2), bg='bisque')
data(cars) attach(cars) plot(speed, dist, main = "data(cars) & robust smoothing splines") # Fit a smoothing spline using L_2 norm cars.spl <- smooth.spline(speed, dist) lines(cars.spl, col = "blue") # Fit a smoothing spline using L_1 norm cars.rspl <- robustSmoothSpline(speed, dist) lines(cars.rspl, col = "red") # Fit a smoothing spline using L_2 norm with 10 degrees of freedom lines(smooth.spline(speed, dist, df=10), lty=2, col = "blue") # Fit a smoothing spline using L_1 norm with 10 degrees of freedom lines(robustSmoothSpline(speed, dist, df=10), lty=2, col = "red") legend(5,120, c( paste("smooth.spline [C.V.] => df =",round(cars.spl$df,1)), paste("robustSmoothSpline [C.V.] => df =",round(cars.rspl$df,1)), "standard with s( * , df = 10)", "robust with s( * , df = 10)" ), col = c("blue","red","blue","red"), lty = c(1,1,2,2), bg='bisque')
Calculates the correlation for random pairs of observations.
## S3 method for class 'matrix' sampleCorrelations(X, MARGIN=1, pairs=NULL, npairs=max(5000, nrow(X)), ...)
## S3 method for class 'matrix' sampleCorrelations(X, MARGIN=1, pairs=NULL, npairs=max(5000, nrow(X)), ...)
X |
An NxK |
MARGIN |
The dimension (1 or 2) in which the observations are.
If |
pairs |
If a Lx2 |
npairs |
The number of correlations to calculate. |
... |
Not used. |
Returns a double
vector
of length npairs
.
Henrik Bengtsson
[1] A. Ploner, L. Miller, P. Hall, J. Bergh & Y. Pawitan. Correlation test to assess low-level processing of high-density oligonucleotide microarray data. BMC Bioinformatics, 2005, vol 6.
sample
().
# Simulate 20000 genes with 10 observations each X <- matrix(rnorm(n=20000), ncol=10) # Calculate the correlation for 5000 random gene pairs cor <- sampleCorrelations(X, npairs=5000) print(summary(cor))
# Simulate 20000 genes with 10 observations each X <- matrix(rnorm(n=20000), ncol=10) # Calculate the correlation for 5000 random gene pairs cor <- sampleCorrelations(X, npairs=5000) print(summary(cor))
Sample tuples of elements from a set. The elements within a sampled tuple are unique, i.e. no two elements are the same.
## Default S3 method: sampleTuples(x, size, length, ...)
## Default S3 method: sampleTuples(x, size, length, ...)
x |
A set of elements to sample from. |
size |
The number of tuples to sample. |
length |
The length of each tuple. |
... |
Additional arguments passed to |
Returns a NxK matrix
where N = size
and K = length
.
Henrik Bengtsson
sample
().
pairs <- sampleTuples(1:10, size=5, length=2) print(pairs) triples <- sampleTuples(1:10, size=5, length=3) print(triples) # Allow tuples with repeated elements quadruples <- sampleTuples(1:3, size=5, length=4, replace=TRUE) print(quadruples)
pairs <- sampleTuples(1:10, size=5, length=2) print(pairs) triples <- sampleTuples(1:10, size=5, length=3) print(triples) # Allow tuples with repeated elements quadruples <- sampleTuples(1:3, size=5, length=4, replace=TRUE) print(quadruples)
Calculates the (weighted) principal components of a matrix, that is, finds a new coordinate system (not unique) for representing the given multivariate data such that i) all dimensions are orthogonal to each other, and ii) all dimensions have maximal variances.
## S3 method for class 'matrix' wpca(x, w=NULL, center=TRUE, scale=FALSE, method=c("dgesdd", "dgesvd"), swapDirections=FALSE, ...)
## S3 method for class 'matrix' wpca(x, w=NULL, center=TRUE, scale=FALSE, method=c("dgesdd", "dgesvd"), swapDirections=FALSE, ...)
x |
An NxK |
w |
An N |
center |
If |
scale |
If |
method |
If |
swapDirections |
If |
... |
Not used. |
Returns a list
with elements:
pc |
An NxK |
d |
An K |
vt |
An KxK |
xMean |
The center coordinate. |
It holds that x == t(t(fit$pc %*% fit$vt) + fit$xMean)
.
A singular value decomposition (SVD) is carried out.
Let X=mat
, then the SVD of the matrix is , where
and
are orthogonal, and
is a diagonal matrix
with singular values. The principal returned by this method are
.
Internally La.svd()
(or svd()
) of the base
package is used.
For a popular and well written introduction to SVD see for instance [2].
Henrik Bengtsson
[1] J. Demmel and J. Dongarra, DOE2000 Progress Report, 2004.
https://people.eecs.berkeley.edu/~demmel/DOE2000/Report0100.html
[2] Todd Will, Introduction to the Singular Value Decomposition,
UW-La Crosse, 2004. http://websites.uwlax.edu/twill/svd/
For a iterative re-weighted PCA method, see iwpca
().
For Singular Value Decomposition, see svd
().
For other implementations of Principal Component Analysis functions see
(if they are installed):
prcomp
in package stats and pca()
in package
pcurve.
for (zzz in 0) { # This example requires plot3d() in R.basic [http://www.braju.com/R/] if (!require(pkgName <- "R.basic", character.only=TRUE)) break # ------------------------------------------------------------- # A first example # ------------------------------------------------------------- # Simulate data from the model y <- a + bx + eps(bx) x <- rexp(1000) a <- c(2,15,3) b <- c(2,3,15) bx <- outer(b,x) eps <- apply(bx, MARGIN=2, FUN=function(x) rnorm(length(x), mean=0, sd=0.1*x)) y <- a + bx + eps y <- t(y) # Add some outliers by permuting the dimensions for 1/3 of the observations idx <- sample(1:nrow(y), size=1/3*nrow(y)) y[idx,] <- y[idx,c(2,3,1)] # Down-weight the outliers W times to demonstrate how weights are used W <- 10 # Plot the data with fitted lines at four different view points N <- 4 theta <- seq(0,180,length.out=N) phi <- rep(30, length.out=N) # Use a different color for each set of weights col <- topo.colors(W) opar <- par(mar=c(1,1,1,1)+0.1) layout(matrix(1:N, nrow=2, byrow=TRUE)) for (kk in seq_along(theta)) { # Plot the data plot3d(y, theta=theta[kk], phi=phi[kk]) # First, same weights for all observations w <- rep(1, length=nrow(y)) for (ww in 1:W) { # Fit a line using IWPCA through data fit <- wpca(y, w=w, swapDirections=TRUE) # Get the first principal component ymid <- fit$xMean d0 <- apply(y, MARGIN=2, FUN=min) - ymid d1 <- apply(y, MARGIN=2, FUN=max) - ymid b <- fit$vt[1,] y0 <- -b * max(abs(d0)) y1 <- b * max(abs(d1)) yline <- matrix(c(y0,y1), nrow=length(b), ncol=2) yline <- yline + ymid points3d(t(ymid), col=col) lines3d(t(yline), col=col) # Down-weight outliers only, because here we know which they are. w[idx] <- w[idx]/2 } # Highlight the last one lines3d(t(yline), col="red", lwd=3) } par(opar) } # for (zzz in 0) rm(zzz) if (dev.cur() > 1) dev.off() # ------------------------------------------------------------- # A second example # ------------------------------------------------------------- # Data x <- c(1,2,3,4,5) y <- c(2,4,3,3,6) opar <- par(bty="L") opalette <- palette(c("blue", "red", "black")) xlim <- ylim <- c(0,6) # Plot the data and the center mass plot(x,y, pch=16, cex=1.5, xlim=xlim, ylim=ylim) points(mean(x), mean(y), cex=2, lwd=2, col="blue") # Linear regression y ~ x fit <- lm(y ~ x) abline(fit, lty=1, col=1) # Linear regression y ~ x through without intercept fit <- lm(y ~ x - 1) abline(fit, lty=2, col=1) # Linear regression x ~ y fit <- lm(x ~ y) c <- coefficients(fit) b <- 1/c[2] a <- -b*c[1] abline(a=a, b=b, lty=1, col=2) # Linear regression x ~ y through without intercept fit <- lm(x ~ y - 1) b <- 1/coefficients(fit) abline(a=0, b=b, lty=2, col=2) # Orthogonal linear "regression" fit <- wpca(cbind(x,y)) b <- fit$vt[1,2]/fit$vt[1,1] a <- fit$xMean[2]-b*fit$xMean[1] abline(a=a, b=b, lwd=2, col=3) # Orthogonal linear "regression" without intercept fit <- wpca(cbind(x,y), center=FALSE) b <- fit$vt[1,2]/fit$vt[1,1] a <- fit$xMean[2]-b*fit$xMean[1] abline(a=a, b=b, lty=2, lwd=2, col=3) legend(xlim[1],ylim[2], legend=c("lm(y~x)", "lm(y~x-1)", "lm(x~y)", "lm(x~y-1)", "pca", "pca w/o intercept"), lty=rep(1:2,3), lwd=rep(c(1,1,2),each=2), col=rep(1:3,each=2)) palette(opalette) par(opar)
for (zzz in 0) { # This example requires plot3d() in R.basic [http://www.braju.com/R/] if (!require(pkgName <- "R.basic", character.only=TRUE)) break # ------------------------------------------------------------- # A first example # ------------------------------------------------------------- # Simulate data from the model y <- a + bx + eps(bx) x <- rexp(1000) a <- c(2,15,3) b <- c(2,3,15) bx <- outer(b,x) eps <- apply(bx, MARGIN=2, FUN=function(x) rnorm(length(x), mean=0, sd=0.1*x)) y <- a + bx + eps y <- t(y) # Add some outliers by permuting the dimensions for 1/3 of the observations idx <- sample(1:nrow(y), size=1/3*nrow(y)) y[idx,] <- y[idx,c(2,3,1)] # Down-weight the outliers W times to demonstrate how weights are used W <- 10 # Plot the data with fitted lines at four different view points N <- 4 theta <- seq(0,180,length.out=N) phi <- rep(30, length.out=N) # Use a different color for each set of weights col <- topo.colors(W) opar <- par(mar=c(1,1,1,1)+0.1) layout(matrix(1:N, nrow=2, byrow=TRUE)) for (kk in seq_along(theta)) { # Plot the data plot3d(y, theta=theta[kk], phi=phi[kk]) # First, same weights for all observations w <- rep(1, length=nrow(y)) for (ww in 1:W) { # Fit a line using IWPCA through data fit <- wpca(y, w=w, swapDirections=TRUE) # Get the first principal component ymid <- fit$xMean d0 <- apply(y, MARGIN=2, FUN=min) - ymid d1 <- apply(y, MARGIN=2, FUN=max) - ymid b <- fit$vt[1,] y0 <- -b * max(abs(d0)) y1 <- b * max(abs(d1)) yline <- matrix(c(y0,y1), nrow=length(b), ncol=2) yline <- yline + ymid points3d(t(ymid), col=col) lines3d(t(yline), col=col) # Down-weight outliers only, because here we know which they are. w[idx] <- w[idx]/2 } # Highlight the last one lines3d(t(yline), col="red", lwd=3) } par(opar) } # for (zzz in 0) rm(zzz) if (dev.cur() > 1) dev.off() # ------------------------------------------------------------- # A second example # ------------------------------------------------------------- # Data x <- c(1,2,3,4,5) y <- c(2,4,3,3,6) opar <- par(bty="L") opalette <- palette(c("blue", "red", "black")) xlim <- ylim <- c(0,6) # Plot the data and the center mass plot(x,y, pch=16, cex=1.5, xlim=xlim, ylim=ylim) points(mean(x), mean(y), cex=2, lwd=2, col="blue") # Linear regression y ~ x fit <- lm(y ~ x) abline(fit, lty=1, col=1) # Linear regression y ~ x through without intercept fit <- lm(y ~ x - 1) abline(fit, lty=2, col=1) # Linear regression x ~ y fit <- lm(x ~ y) c <- coefficients(fit) b <- 1/c[2] a <- -b*c[1] abline(a=a, b=b, lty=1, col=2) # Linear regression x ~ y through without intercept fit <- lm(x ~ y - 1) b <- 1/coefficients(fit) abline(a=0, b=b, lty=2, col=2) # Orthogonal linear "regression" fit <- wpca(cbind(x,y)) b <- fit$vt[1,2]/fit$vt[1,1] a <- fit$xMean[2]-b*fit$xMean[1] abline(a=a, b=b, lwd=2, col=3) # Orthogonal linear "regression" without intercept fit <- wpca(cbind(x,y), center=FALSE) b <- fit$vt[1,2]/fit$vt[1,1] a <- fit$xMean[2]-b*fit$xMean[1] abline(a=a, b=b, lty=2, lwd=2, col=3) legend(xlim[1],ylim[2], legend=c("lm(y~x)", "lm(y~x-1)", "lm(x~y)", "lm(x~y-1)", "pca", "pca w/o intercept"), lty=rep(1:2,3), lwd=rep(c(1,1,2),each=2), col=rep(1:3,each=2)) palette(opalette) par(opar)