Title: | Supervised Normalization of Microarrays |
---|---|
Description: | SNM is a modeling strategy especially designed for normalizing high-throughput genomic data. The underlying premise of our approach is that your data is a function of what we refer to as study-specific variables. These variables are either biological variables that represent the target of the statistical analysis, or adjustment variables that represent factors arising from the experimental or biological setting the data is drawn from. The SNM approach aims to simultaneously model all study-specific variables in order to more accurately characterize the biological or clinical variables of interest. |
Authors: | Brig Mecham and John D. Storey <[email protected]> |
Maintainer: | John D. Storey <[email protected]> |
License: | LGPL |
Version: | 1.55.0 |
Built: | 2024-12-14 03:51:18 UTC |
Source: | https://github.com/bioc/snm |
Simulates data used in snm function examples.
sim.doubleChannel(seed)
sim.doubleChannel(seed)
seed |
Numeric value used to seed random number generator. |
Simulated data set influenced by a probe-specific biological variable, a probe-specific adjustment variables, and intensity-dependent array and dye effects. Data were simulated for a total of 25,000 probes and 20 arrays. The biological variable is a dichotomous variable specifying two groups (Group 1 and Group 2), with 10 channels per group. The continuous probe-specific adjustment variable is sampled from a Normal(1,0.1) and mimics an age effect. The baseline probe intensities were sampled from a chi(1,2) distribution. Any baseline intensities greater than 15 were set to a random value from the interval [15,16]. The random variation terms were sampled from a Normal(0,0.25) and the array and dye functions were defined by randomly sampling coefficients for a two-dimensional B-spline basis functions from a Normal(0,1).
Randomly selected subsets of 30% and 20% of the probes were defined as influenced by the biological groups and age variables, respectfully. The magnitude of the biological effects were sampled from a Normal(1,0.3) distribution, and the probe-specific age effects from a Normal(1,0.1). An instance of this simulated data can be generated using the code in the examples section below.
raw.data |
a 25,000 by 50 matrix of simulated data generated according to the description above. |
true.nulls |
a vector of indices corresponding to the rows in raw.data of the probes unaffected by the biological variable of interest |
bio.var |
a model matrix of the biological variable of interest. |
adj.var |
a model matrix of the adjustment variables |
int.var |
a data frame of the intensity-dependent adjustment variables |
Brig Mecham
snm
, sim.singleChannel
, sim.preProcessed
, sim.refDesign
## Not run: doubleChannel <- sim.doubleChannel(12346) snm.obj <- snm(doubleChannel$raw.data, doubleChannel$bio.var, doubleChannel$adj.var, doubleChannel$int.var) ks.test(snm.obj$pval[doubleChannel$true.nulls],"punif") ## End(Not run)
## Not run: doubleChannel <- sim.doubleChannel(12346) snm.obj <- snm(doubleChannel$raw.data, doubleChannel$bio.var, doubleChannel$adj.var, doubleChannel$int.var) ks.test(snm.obj$pval[doubleChannel$true.nulls],"punif") ## End(Not run)
Simulated data set influenced by a single probe-specific biological and two probe-specific adjustment variables. This parameters for this data are identical to single channel simulated data available as sim.singleChannel(seed) with the difference that this example does not include the intensity-dependent effects. Consult the corresponding help file for details on this simulation.
sim.preProcessed(seed)
sim.preProcessed(seed)
seed |
Numeric value used to seed random number generator. |
raw.data |
a 25,000 by 50 matrix of simulated data generated according to the description above. |
true.nulls |
a vector of indices corresponding to the rows in raw.data of the probes unaffected by the biological variable of interest |
bio.var |
a model matrix of the biological variable of interest. |
adj.var |
a model matrix of the adjustment variables |
int.var |
set to NULL |
Brig Mecham
snm
, sim.doubleChannel
, sim.singleChannel
, sim.refDesign
preProcessed <- sim.preProcessed(12347) snm.obj <- snm(preProcessed$raw.data, preProcessed$bio.var, preProcessed$adj.var, rm.adj=TRUE) ks.test(snm.obj$pval[preProcessed$true.nulls],"punif")
preProcessed <- sim.preProcessed(12347) snm.obj <- snm(preProcessed$raw.data, preProcessed$bio.var, preProcessed$adj.var, rm.adj=TRUE) ks.test(snm.obj$pval[preProcessed$true.nulls],"punif")
Simulates a two channel experiment with a reference design. Used as an example for snm function call.
sim.refDesign(seed)
sim.refDesign(seed)
seed |
Numeric value used to seed random number generator. |
Simulated data set influenced by a probe-specific biological variable, and intensity-dependent array and dye effects. Data were simulated assuming a uniform reference design for a total of 25,000 probes and 20 arrays, each consisting of two channels. The reference channel consists of a single reference RNA population. The experimental channel measures a dichotomous biological variable specifying two groups (Group 1 and Group 2), with 10 samples per group. The baseline probe intensities were sampled from a chi(1,2) distribution. Any baseline intensities greater than 15 were set to a random value from the interval [15,16]. The random variation terms were sampled from a Normal(0,0.25) and the array and dye functions were defined by randomly sampling coefficients for a two-dimensional B-spline basis functions from a Normal(0,1).
A randomly selected subset of 30% of the probes was defined as influenced by the biological group variable. The magnitude of the biological effects were sampled from a Normal(1,0.3) distribution. An instance of this simulated data can be generated using the code in the examples section below.
raw.data |
a 25,000 by 50 matrix of simulated data generated according to the description above. |
true.nulls |
a vector of indices corresponding to the rows in raw.data of the probes unaffected by the biological variable of interest |
bio.var |
a model matrix of the biological variable of interest. |
adj.var |
a model matrix of the probe-specific adjustment variables. In this case set to NULL. |
int.var |
a data frame of the intensity-dependent adjustment variables |
Brig Mecham and John D. Storey <[email protected]>
snm
, sim.singleChannel
, sim.doubleChannel
, sim.preProcessed
# reference design on channel level data # reference channel is included in bio.var refChannel <- sim.refDesign(12347) snm.obj <- snm(refChannel$raw.data,refChannel$bio.var, refChannel$adj.var, refChannel$int.var) ks.test(snm.obj$pval[refChannel$true.nulls],"punif") # this is a different model formulation # where reference channel is not included # in bio.var bio.var2 = matrix(c(rep(-1,10), rep(1, 10), rep(0,20)), ncol=1, dimnames=list(NULL, "treatment")) adj.var2 = matrix(c(rep(1,40), rep(1,20), rep(0,20)), ncol=2, byrow=FALSE, dimnames=list(NULL, c("intercept", "target1ref0"))) # compare bio.var2 to refChannel$bio.var and adj.var2 to refChannel$adj.var snm.obj <- snm(refChannel$raw.data, bio.var2, adj.var2, refChannel$int.var) ks.test(snm.obj$pval[refChannel$true.nulls],"punif") # reference design on log ratio data # that is, log(target/ref) data refChannel <- sim.refDesign(12347) refChannel$raw.data = refChannel$raw.data[,1:20]-refChannel$raw.data[,21:40] # removing reference channel variables refChannel$bio.var = refChannel$bio.var[1:20,-3] refChannel$adj.var = matrix(refChannel$adj.var[1:20,], ncol=1) refChannel$int.var = data.frame(refChannel$int.var[1:20,1]) snm.obj <- snm(refChannel$raw.data,refChannel$bio.var, refChannel$adj.var, refChannel$int.var) ks.test(snm.obj$pval[refChannel$true.nulls],"punif")
# reference design on channel level data # reference channel is included in bio.var refChannel <- sim.refDesign(12347) snm.obj <- snm(refChannel$raw.data,refChannel$bio.var, refChannel$adj.var, refChannel$int.var) ks.test(snm.obj$pval[refChannel$true.nulls],"punif") # this is a different model formulation # where reference channel is not included # in bio.var bio.var2 = matrix(c(rep(-1,10), rep(1, 10), rep(0,20)), ncol=1, dimnames=list(NULL, "treatment")) adj.var2 = matrix(c(rep(1,40), rep(1,20), rep(0,20)), ncol=2, byrow=FALSE, dimnames=list(NULL, c("intercept", "target1ref0"))) # compare bio.var2 to refChannel$bio.var and adj.var2 to refChannel$adj.var snm.obj <- snm(refChannel$raw.data, bio.var2, adj.var2, refChannel$int.var) ks.test(snm.obj$pval[refChannel$true.nulls],"punif") # reference design on log ratio data # that is, log(target/ref) data refChannel <- sim.refDesign(12347) refChannel$raw.data = refChannel$raw.data[,1:20]-refChannel$raw.data[,21:40] # removing reference channel variables refChannel$bio.var = refChannel$bio.var[1:20,-3] refChannel$adj.var = matrix(refChannel$adj.var[1:20,], ncol=1) refChannel$int.var = data.frame(refChannel$int.var[1:20,1]) snm.obj <- snm(refChannel$raw.data,refChannel$bio.var, refChannel$adj.var, refChannel$int.var) ks.test(snm.obj$pval[refChannel$true.nulls],"punif")
Simulates single channel data used as an example for snm function call.
sim.singleChannel(seed)
sim.singleChannel(seed)
seed |
Numeric value used to seed random number generator. |
Simulated data set influenced by a single probe-specific biological, two probe-specific adjustment variables, and intensity-dependent array effects. Data were simulated for a total of 25,000 probes and 50 arrays. The biological variable is a dichotomous variable specifying two groups (Group 1 and Group 2), with 25 arrays sampled from each group. The dichtomous probe-specific adjustment variables has 5 different levels and mimics a batch effect. The 5 batches each contain 10 samples, and are balanced with respect to the biological grouping factor. The continuous probe-specific adjustment variable is sampled from a Normal(1,0.1) and mimics an age effect. The baseline probe intensities were sampled from a chi(1,2) distribution. Any baseline intensities greater than 15 were set to a random value from the interval [15,16]. The random variation terms were sampled from a Normal(0,0.25) and the array functions were defined by randomly sampling coefficients for a two-dimensional B-spline basis functions from a Normal(0,1).
Randomly selected subsets of 30%, 10%, and 20% of the probes were defined as influenced by the biological groups, batch, and age variables, respectfully. The magnitude of the biological effects were sampled from a Normal(1,0.3) distribution, the probe-specific batch effects from a Normal(0,0.3) and the probe-specific age effects from a Normal(1,0.1). An instance of this simulated data can be generated using the code in the examples section below.
raw.data |
a 25,000 by 50 matrix of simulated data generated according to the description above. |
true.nulls |
a vector of indices corresponding to the rows in raw.data of the probes unaffected by the biological variable of interest |
bio.var |
a model matrix of the biological variable of interest. |
adj.var |
a model matrix of the adjustment variables |
int.var |
a data frame of the intensity-dependent adjustment variables |
Brig Mecham
snm
, sim.doubleChannel
, sim.preProcessed
, sim.refDesign
## Not run: singleChannel <- sim.singleChannel(12345) snm.obj <- snm(singleChannel$raw.data, singleChannel$bio.var, singleChannel$adj.var, singleChannel$int.var) ks.test(snm.obj$pval[singleChannel$true.nulls],"punif") ## End(Not run)
## Not run: singleChannel <- sim.singleChannel(12345) snm.obj <- snm(singleChannel$raw.data, singleChannel$bio.var, singleChannel$adj.var, singleChannel$int.var) ks.test(snm.obj$pval[singleChannel$true.nulls],"punif") ## End(Not run)
Implement Supervised Normalization of Microarrays on a gene expression matrix. Requires a set of biological covariates of interest and at least one probe-specific or intensity-dependent adjustment variable.
snm(raw.dat, bio.var=NULL, adj.var=NULL, int.var=NULL, weights=NULL, spline.dim = 4, num.iter = 10, lmer.max.iter=1000, nbins=20, rm.adj=FALSE, verbose=TRUE, diagnose=TRUE)
snm(raw.dat, bio.var=NULL, adj.var=NULL, int.var=NULL, weights=NULL, spline.dim = 4, num.iter = 10, lmer.max.iter=1000, nbins=20, rm.adj=FALSE, verbose=TRUE, diagnose=TRUE)
raw.dat |
An |
bio.var |
A model matrix (see |
adj.var |
A model matrix (see |
int.var |
A data frame with |
weights |
A vector of length |
spline.dim |
Dimension of basis spline used for array effects. |
num.iter |
Number of snm model fit iterations to run. |
lmer.max.iter |
Number of |
nbins |
Number of bins used by binning strategy. Array effects are calculated from a |
rm.adj |
If set to FALSE, then only the intensity dependent effects have been removed from the normalized data, implying the effects from the adjustment variables are still present. If TRUE, then the adjustment variables effects and the intensity dependent effects are both removed from the returned normalized data. |
verbose |
A flag telling the software whether or not to display a report after each iteration. TRUE produces the output. |
diagnose |
A flag telling the software whether or not to produce diagnostic output in the form of consecutive plots. TRUE produces the plot. |
This function implements the supervised normalization of microarrays algorithm described in Mecham, Nelson, and Storey (2010).
norm.dat |
The matrix of normalized data. The default setting is rm.adj=FALSE, which means that only the intensity-dependent effects have been subtracted from the data. If the user wants the adjustment variable effects removed as well, then set rm.adj=TRUE when calling the |
pvalues |
A vector of p-values testing the association of the biological variables with each probe. These p-values are obtained from an ANOVA comparing models where the full model contains both the probe-specific biological and adjustment variables versus a reduced model that just contains the probe-specific adjustment variables. The data used for this comparison has the intensity-dependent variables removed. These returned p-values are calculated after the final iteration of the algorithm. |
pi0 |
The estimated proportion of true null probes |
iter.pi0s |
A vector of length equal to num.iter containing the estimated |
nulls |
A vector indexing the probes utilized in estimating the intensity-dependent effects on the final iteration. |
M |
A matrix containing the estimated probe intensities for each array utilized in estimating the intensity-dependent effects on the final iteration. For memory parsimony, only a subset of values spanning the range is returned, currently nbins*100 values. |
array.fx |
A matrix of the final estimated intensity-dependent array effects. For memory parsimony, only a subset of values spanning the range is returned, currently nbins*100 values. |
bio.var |
The processed version of the same input variable. |
adj.var |
The processed version of the same input variable. |
int.var |
The processed version of the same input variable. |
df0 |
Degrees of freedom of the adjustment variables. |
df1 |
Degrees of freedom of the full model matrix, which includes the biological variables and the adjustment variables. |
raw.dat |
The input data. |
rm.var |
Same as the input (useful for later analyses). |
call |
Function call. |
It is necessary for adj.var and adj.var+bio.var to be valid model matrices (e.g., the models cannot be over-determined).
We suggest that the probe level data be analyzed on the log-transformed scale, particularly if the user wishes to remove intensity-dependent effects. It is recommended that the normalized data (and resulting inference) be inspected for latent structure using Surrogate Variable Analysis (Leek and Storey 2007, PLoS Genetics).
Brig Mecham and John D. Storey <[email protected]>
Mecham BH, Nelson PS, Storey JD (2010) Supervised normalization of microarrays. Bioinformatics, 26: 1308-1315.
model.matrix
, plot.snm
, fitted.snm
, summary.snm
, sim.singleChannel
, sim.doubleChannel
, sim.preProcessed
, sim.refDesign
singleChannel <- sim.singleChannel(12345) snm.obj <- snm(singleChannel$raw.data, singleChannel$bio.var, singleChannel$adj.var, singleChannel$int.var) ks.test(snm.obj$pval[singleChannel$true.nulls],"punif") plot(snm.obj) summary(snm.obj) snm.fit = fitted(snm.obj)
singleChannel <- sim.singleChannel(12345) snm.obj <- snm(singleChannel$raw.data, singleChannel$bio.var, singleChannel$adj.var, singleChannel$int.var) ks.test(snm.obj$pval[singleChannel$true.nulls],"punif") plot(snm.obj) summary(snm.obj) snm.fit = fitted(snm.obj)
Computes fitted values under models used in snm
normalization.
snm.fitted(object, ...) ## S3 method for class 'snm' fitted(object, ...)
snm.fitted(object, ...) ## S3 method for class 'snm' fitted(object, ...)
object |
Output from the |
... |
Not used. |
Returns the fitted values under the "null model" (adjustment variables only) and the "full model" (adjustment variables + biological variables).
fit0 |
Linear model fits when regressing each probe's normalized data on the null model, |
fit1 |
Linear model fits when regressing each probe's normalized data on the full model, |
These fits are useful for investigating the quality of the study-specific model used in the normalization. For example, the residuals can be obtained from the full model fit and examined for latent structure.
John D. Storey <[email protected]>
Mecham BH, Nelson PS, Storey JD (2010) Supervised normalization of microarrays. Bioinformatics, 26: 1308-1315.
## Not run: singleChannel <- sim.singleChannel(12345) snm.obj <- snm(singleChannel$raw.data, singleChannel$bio.var, singleChannel$adj.var[,-6], singleChannel$int.var, num.iter=10) snm.fit = fitted(snm.obj) res1 = snm.obj$norm.dat - snm.fit$fit1 snm.svd = fast.svd(res1) cor(snm.svd$v[,1], singleChannel$adj.var[,6]) plot(snm.svd$v[,1], singleChannel$adj.var[,6]) ## End(Not run)
## Not run: singleChannel <- sim.singleChannel(12345) snm.obj <- snm(singleChannel$raw.data, singleChannel$bio.var, singleChannel$adj.var[,-6], singleChannel$int.var, num.iter=10) snm.fit = fitted(snm.obj) res1 = snm.obj$norm.dat - snm.fit$fit1 snm.svd = fast.svd(res1) cor(snm.svd$v[,1], singleChannel$adj.var[,6]) plot(snm.svd$v[,1], singleChannel$adj.var[,6]) ## End(Not run)
Creates a diagnostic plot of the snm
fit.
snm.plot(x, col.by=NULL, ...) ## S3 method for class 'snm' plot(x, ...)
snm.plot(x, col.by=NULL, ...) ## S3 method for class 'snm' plot(x, ...)
x |
Output from the |
col.by |
A factor vector of length equal to the number of arrays providing a grouping by which to color the intensity-dependent effects. Instead of a factor vector, the input may also be a model matrix composed only of 0's and 1's with the number of rows equal to the number of arrays and number of columns less than or equal to the number of arrays. |
... |
Arguments passed to the plot functions. Not recommended. |
A four panel plot composed of the following:
Convergence of estimates over model fitting iterations. The
estimates for each iteration are compared to the
estimate calculated during the final model fit.
A scree plot of the principal components analysis of the full model residual matrix.
A plot of the estimated intensity-dependent effects.
A histogram of the p-values testing each probe for an asssociation with the biological variables (bio.var
). All probes to the right of the vertical red line are the least significant probes (i.e., those used in estimating intensity-dependent effects). The dashed horizontal line is the
estimate from the final model fit.
Nothing of interest.
John D. Storey <[email protected]>
Mecham BH, Nelson PS, Storey JD (2010) Supervised normalization of microarrays. Bioinformatics, 26: 1308-1315.
## Not run: singleChannel <- sim.singleChannel(12345) snm.obj <- snm(singleChannel$raw.data, singleChannel$bio.var, singleChannel$adj.var, singleChannel$int.var, num.iter=10) plot(snm.obj, col.by=snm.obj$bio.var) #color by biological group plot(snm.obj, col.by=snm.obj$adj.var[,-6]) #color by batch ## End(Not run)
## Not run: singleChannel <- sim.singleChannel(12345) snm.obj <- snm(singleChannel$raw.data, singleChannel$bio.var, singleChannel$adj.var, singleChannel$int.var, num.iter=10) plot(snm.obj, col.by=snm.obj$bio.var) #color by biological group plot(snm.obj, col.by=snm.obj$adj.var[,-6]) #color by batch ## End(Not run)
Provides a summary of the snm
fit.
snm.summary(object, cuts=c(0.0001, 0.001, 0.01, 0.025, 0.05, 0.10, 1), ...) ## S3 method for class 'snm' summary(object, ...)
snm.summary(object, cuts=c(0.0001, 0.001, 0.01, 0.025, 0.05, 0.10, 1), ...) ## S3 method for class 'snm' summary(object, ...)
object |
An object of class "snm", i.e., output from the |
cuts |
Cut points at which to calculate the number of significant probes. |
... |
Optional arguments to send to the |
A summary of the snm
fit.
Nothing of interest.
John D. Storey <[email protected]>
Mecham BH, Nelson PS, Storey JD (2010) Supervised normalization of microarrays. Bioinformatics, 26: 1308-1315.
## Not run: singleChannel <- sim.singleChannel(12345) snm.obj <- snm(singleChannel$raw.data, singleChannel$bio.var, singleChannel$adj.var, singleChannel$int.var, num.iter=10) summary(snm.obj) ## End(Not run)
## Not run: singleChannel <- sim.singleChannel(12345) snm.obj <- snm(singleChannel$raw.data, singleChannel$bio.var, singleChannel$adj.var, singleChannel$int.var, num.iter=10) summary(snm.obj) ## End(Not run)