Title: | Surrogate Variable Analysis |
---|---|
Description: | The sva package contains functions for removing batch effects and other unwanted variation in high-throughput experiment. Specifically, the sva package contains functions for the identifying and building surrogate variables for high-dimensional data sets. Surrogate variables are covariates constructed directly from high-dimensional data (like gene expression/RNA sequencing/methylation/brain imaging data) that can be used in subsequent analyses to adjust for unknown, unmodeled, or latent sources of noise. The sva package can be used to remove artifacts in three ways: (1) identifying and estimating surrogate variables for unknown sources of variation in high-throughput experiments (Leek and Storey 2007 PLoS Genetics,2008 PNAS), (2) directly removing known batch effects using ComBat (Johnson et al. 2007 Biostatistics) and (3) removing batch effects with known control probes (Leek 2014 biorXiv). Removing batch effects and using surrogate variables in differential expression analysis have been shown to reduce dependence, stabilize error rate estimates, and improve reproducibility, see (Leek and Storey 2007 PLoS Genetics, 2008 PNAS or Leek et al. 2011 Nat. Reviews Genetics). |
Authors: | Jeffrey T. Leek <[email protected]>, W. Evan Johnson <[email protected]>, Hilary S. Parker <[email protected]>, Elana J. Fertig <[email protected]>, Andrew E. Jaffe <[email protected]>, Yuqing Zhang <[email protected]>, John D. Storey <[email protected]>, Leonardo Collado Torres <[email protected]> |
Maintainer: | Jeffrey T. Leek <[email protected]>, John D. Storey <[email protected]>, W. Evan Johnson <[email protected]> |
License: | Artistic-2.0 |
Version: | 3.55.0 |
Built: | 2024-12-18 04:11:30 UTC |
Source: | https://github.com/bioc/sva |
ComBat allows users to adjust for batch effects in datasets where the batch covariate is known, using methodology described in Johnson et al. 2007. It uses either parametric or non-parametric empirical Bayes frameworks for adjusting data for batch effects. Users are returned an expression matrix that has been corrected for batch effects. The input data are assumed to be cleaned and normalized before batch effect removal.
ComBat( dat, batch, mod = NULL, par.prior = TRUE, prior.plots = FALSE, mean.only = FALSE, ref.batch = NULL, BPPARAM = bpparam("SerialParam") )
ComBat( dat, batch, mod = NULL, par.prior = TRUE, prior.plots = FALSE, mean.only = FALSE, ref.batch = NULL, BPPARAM = bpparam("SerialParam") )
dat |
Genomic measure matrix (dimensions probe x sample) - for example, expression matrix |
batch |
Batch covariate (only one batch allowed) |
mod |
Model matrix for outcome of interest and other covariates besides batch |
par.prior |
(Optional) TRUE indicates parametric adjustments will be used, FALSE indicates non-parametric adjustments will be used |
prior.plots |
(Optional) TRUE give prior plots with black as a kernel estimate of the empirical batch effect density and red as the parametric |
mean.only |
(Optional) FALSE If TRUE ComBat only corrects the mean of the batch effect (no scale adjustment) |
ref.batch |
(Optional) NULL If given, will use the selected batch as a reference for batch adjustment. |
BPPARAM |
(Optional) BiocParallelParam for parallel operation |
data A probe x sample genomic measure matrix, adjusted for batch effects.
library(bladderbatch) data(bladderdata) dat <- bladderEset[1:50,] pheno = pData(dat) edata = exprs(dat) batch = pheno$batch mod = model.matrix(~as.factor(cancer), data=pheno) # parametric adjustment combat_edata1 = ComBat(dat=edata, batch=batch, mod=NULL, par.prior=TRUE, prior.plots=FALSE) # non-parametric adjustment, mean-only version combat_edata2 = ComBat(dat=edata, batch=batch, mod=NULL, par.prior=FALSE, mean.only=TRUE) # reference-batch version, with covariates combat_edata3 = ComBat(dat=edata, batch=batch, mod=mod, par.prior=TRUE, ref.batch=3)
library(bladderbatch) data(bladderdata) dat <- bladderEset[1:50,] pheno = pData(dat) edata = exprs(dat) batch = pheno$batch mod = model.matrix(~as.factor(cancer), data=pheno) # parametric adjustment combat_edata1 = ComBat(dat=edata, batch=batch, mod=NULL, par.prior=TRUE, prior.plots=FALSE) # non-parametric adjustment, mean-only version combat_edata2 = ComBat(dat=edata, batch=batch, mod=NULL, par.prior=FALSE, mean.only=TRUE) # reference-batch version, with covariates combat_edata3 = ComBat(dat=edata, batch=batch, mod=mod, par.prior=TRUE, ref.batch=3)
ComBat_seq is an improved model from ComBat using negative binomial regression, which specifically targets RNA-Seq count data.
ComBat_seq( counts, batch, group = NULL, covar_mod = NULL, full_mod = TRUE, shrink = FALSE, shrink.disp = FALSE, gene.subset.n = NULL )
ComBat_seq( counts, batch, group = NULL, covar_mod = NULL, full_mod = TRUE, shrink = FALSE, shrink.disp = FALSE, gene.subset.n = NULL )
counts |
Raw count matrix from genomic studies (dimensions gene x sample) |
batch |
Vector / factor for batch |
group |
Vector / factor for biological condition of interest |
covar_mod |
Model matrix for multiple covariates to include in linear model (signals from these variables are kept in data after adjustment) |
full_mod |
Boolean, if TRUE include condition of interest in model |
shrink |
Boolean, whether to apply shrinkage on parameter estimation |
shrink.disp |
Boolean, whether to apply shrinkage on dispersion |
gene.subset.n |
Number of genes to use in empirical Bayes estimation, only useful when shrink = TRUE |
data A gene x sample count matrix, adjusted for batch effects.
count_matrix <- matrix(rnbinom(400, size=10, prob=0.1), nrow=50, ncol=8) batch <- c(rep(1, 4), rep(2, 4)) group <- rep(c(0,1), 4) # include condition (group variable) adjusted_counts <- ComBat_seq(count_matrix, batch=batch, group=group, full_mod=TRUE) # do not include condition adjusted_counts <- ComBat_seq(count_matrix, batch=batch, group=NULL, full_mod=FALSE)
count_matrix <- matrix(rnbinom(400, size=10, prob=0.1), nrow=50, ncol=8) batch <- c(rep(1, 4), rep(2, 4)) group <- rep(c(0,1), 4) # include condition (group variable) adjusted_counts <- ComBat_seq(count_matrix, batch=batch, group=group, full_mod=TRUE) # do not include condition adjusted_counts <- ComBat_seq(count_matrix, batch=batch, group=NULL, full_mod=FALSE)
This function uses the iteratively reweighted surrogate variable analysis approach to estimate the probability that each gene is an empirical control.
empirical.controls( dat, mod, mod0 = NULL, n.sv, B = 5, type = c("norm", "counts") )
empirical.controls( dat, mod, mod0 = NULL, n.sv, B = 5, type = c("norm", "counts") )
dat |
The transformed data matrix with the variables in rows and samples in columns |
mod |
The model matrix being used to fit the data |
mod0 |
The null model being compared when fitting the data |
n.sv |
The number of surogate variables to estimate |
B |
The number of iterations of the irwsva algorithm to perform |
type |
If type is norm then standard irwsva is applied, if type is counts, then the moderated log transform is applied first |
pcontrol A vector of probabilites that each gene is a control.
library(bladderbatch) data(bladderdata) dat <- bladderEset[1:5000,] pheno = pData(dat) edata = exprs(dat) mod = model.matrix(~as.factor(cancer), data=pheno) n.sv = num.sv(edata,mod,method="leek") pcontrol <- empirical.controls(edata,mod,mod0=NULL,n.sv=n.sv,type="norm")
library(bladderbatch) data(bladderdata) dat <- bladderEset[1:5000,] pheno = pData(dat) edata = exprs(dat) mod = model.matrix(~as.factor(cancer), data=pheno) n.sv = num.sv(edata,mod,method="leek") pcontrol <- empirical.controls(edata,mod,mod0=NULL,n.sv=n.sv,type="norm")
This function does simple linear algebra to calculate f-statistics for each row of a data matrix comparing the nested models defined by the design matrices for the alternative (mod) and and null (mod0) cases. The columns of mod0 must be a subset of the columns of mod.
f.pvalue(dat, mod, mod0)
f.pvalue(dat, mod, mod0)
dat |
The transformed data matrix with the variables in rows and samples in columns |
mod |
The model matrix being used to fit the data |
mod0 |
The null model being compared when fitting the data |
p A vector of F-statistic p-values one for each row of dat.
library(bladderbatch) data(bladderdata) dat <- bladderEset[1:50,] pheno = pData(dat) edata = exprs(dat) mod = model.matrix(~as.factor(cancer), data=pheno) mod0 = model.matrix(~1,data=pheno) pValues = f.pvalue(edata,mod,mod0) qValues = p.adjust(pValues,method="BH")
library(bladderbatch) data(bladderdata) dat <- bladderEset[1:50,] pheno = pData(dat) edata = exprs(dat) mod = model.matrix(~as.factor(cancer), data=pheno) mod0 = model.matrix(~1,data=pheno) pValues = f.pvalue(edata,mod,mod0) qValues = p.adjust(pValues,method="BH")
This function does simple linear algebra to calculate f-statistics for each row of a data matrix comparing the nested models defined by the design matrices for the alternative (mod) and and null (mod0) cases. The columns of mod0 must be a subset of the columns of mod.
fstats(dat, mod, mod0)
fstats(dat, mod, mod0)
dat |
The transformed data matrix with the variables in rows and samples in columns |
mod |
The model matrix being used to fit the data |
mod0 |
The null model being compared when fitting the data |
fstats A vector of F-statistics one for each row of dat.
library(bladderbatch) data(bladderdata) dat <- bladderEset[1:50,] pheno = pData(dat) edata = exprs(dat) mod = model.matrix(~as.factor(cancer), data=pheno) mod0 = model.matrix(~1,data=pheno) fs <- fstats(edata, mod, mod0)
library(bladderbatch) data(bladderdata) dat <- bladderEset[1:50,] pheno = pData(dat) edata = exprs(dat) mod = model.matrix(~as.factor(cancer), data=pheno) mod0 = model.matrix(~1,data=pheno) fs <- fstats(edata, mod, mod0)
This function performs frozen surrogate variable analysis as described in Parker, Corrada Bravo and Leek 2013.
The approach uses a training database to create surrogate variables which are then used
to remove batch effects both from the training database and a new data set for prediction purposes.
For inferential analysis see sva
, svaseq
, with low level functionality available
through irwsva.build
and ssva
.
fsva(dbdat, mod, sv, newdat = NULL, method = c("fast", "exact"))
fsva(dbdat, mod, sv, newdat = NULL, method = c("fast", "exact"))
dbdat |
A m genes by n arrays matrix of expression data from the database/training data |
mod |
The model matrix for the terms included in the analysis for the training data |
sv |
The surrogate variable object created by running sva on dbdat using mod. |
newdat |
(optional) A set of test samples to be adjusted using the training database |
method |
If method ="fast" then the SVD is calculated using an online approach, this may introduce slight bias. If method="exact" the exact SVD is calculated, but will be slower |
db An adjusted version of the training database where the effect of batch/expression heterogeneity has been removed
new An adjusted version of the new samples, adjusted one at a time using the fsva methodology.
newsv Surrogate variables for the new samples
library(bladderbatch) library(pamr) data(bladderdata) dat <- bladderEset[1:50,] pheno = pData(dat) edata = exprs(dat) set.seed(1234) trainIndicator = sample(1:57,size=30,replace=FALSE) testIndicator = (1:57)[-trainIndicator] trainData = edata[,trainIndicator] testData = edata[,testIndicator] trainPheno = pheno[trainIndicator,] testPheno = pheno[testIndicator,] mydata = list(x=trainData,y=trainPheno$cancer) mytrain = pamr.train(mydata) table(pamr.predict(mytrain,testData,threshold=2),testPheno$cancer) trainMod = model.matrix(~cancer,data=trainPheno) trainMod0 = model.matrix(~1,data=trainPheno) trainSv = sva(trainData,trainMod,trainMod0) fsvaobj = fsva(trainData,trainMod,trainSv,testData) mydataSv = list(x=fsvaobj$db,y=trainPheno$cancer) mytrainSv = pamr.train(mydataSv) table(pamr.predict(mytrainSv,fsvaobj$new,threshold=1),testPheno$cancer)
library(bladderbatch) library(pamr) data(bladderdata) dat <- bladderEset[1:50,] pheno = pData(dat) edata = exprs(dat) set.seed(1234) trainIndicator = sample(1:57,size=30,replace=FALSE) testIndicator = (1:57)[-trainIndicator] trainData = edata[,trainIndicator] testData = edata[,testIndicator] trainPheno = pheno[trainIndicator,] testPheno = pheno[testIndicator,] mydata = list(x=trainData,y=trainPheno$cancer) mytrain = pamr.train(mydata) table(pamr.predict(mytrain,testData,threshold=2),testPheno$cancer) trainMod = model.matrix(~cancer,data=trainPheno) trainMod0 = model.matrix(~1,data=trainPheno) trainSv = sva(trainData,trainMod,trainMod0) fsvaobj = fsva(trainData,trainMod,trainSv,testData) mydataSv = list(x=fsvaobj$db,y=trainPheno$cancer) mytrainSv = pamr.train(mydataSv) table(pamr.predict(mytrainSv,fsvaobj$new,threshold=1),testPheno$cancer)
This function is the implementation of the iteratively re-weighted least squares
approach for estimating surrogate variables. As a buy product, this function
produces estimates of the probability of being an empirical control. See the function
empirical.controls
for a direct estimate of the empirical controls.
irwsva.build(dat, mod, mod0 = NULL, n.sv, B = 5)
irwsva.build(dat, mod, mod0 = NULL, n.sv, B = 5)
dat |
The transformed data matrix with the variables in rows and samples in columns |
mod |
The model matrix being used to fit the data |
mod0 |
The null model being compared when fitting the data |
n.sv |
The number of surogate variables to estimate |
B |
The number of iterations of the irwsva algorithm to perform |
sv The estimated surrogate variables, one in each column
pprob.gam: A vector of the posterior probabilities each gene is affected by heterogeneity
pprob.b A vector of the posterior probabilities each gene is affected by mod
n.sv The number of significant surrogate variables
library(bladderbatch) data(bladderdata) dat <- bladderEset[1:5000,] pheno = pData(dat) edata = exprs(dat) mod = model.matrix(~as.factor(cancer), data=pheno) n.sv = num.sv(edata,mod,method="leek") res <- irwsva.build(edata, mod, mod0 = NULL,n.sv,B=5)
library(bladderbatch) data(bladderdata) dat <- bladderEset[1:5000,] pheno = pData(dat) edata = exprs(dat) mod = model.matrix(~as.factor(cancer), data=pheno) n.sv = num.sv(edata,mod,method="leek") res <- irwsva.build(edata, mod, mod0 = NULL,n.sv,B=5)
This function estimates the number of surrogate variables that should be included in a differential expression model. The default approach is based on a permutation procedure originally prooposed by Buja and Eyuboglu 1992. The function also provides an interface to the asymptotic approach proposed by Leek 2011 Biometrics.
num.sv(dat, mod, method = c("be", "leek"), vfilter = NULL, B = 20, seed = NULL)
num.sv(dat, mod, method = c("be", "leek"), vfilter = NULL, B = 20, seed = NULL)
dat |
The transformed data matrix with the variables in rows and samples in columns |
mod |
The model matrix being used to fit the data |
method |
One of "be" or "leek" as described in the details section |
vfilter |
You may choose to filter to the vfilter most variable rows before performing the analysis |
B |
The number of permutaitons to use if method = "be" |
seed |
Set a seed when using the permutation approach |
n.sv The number of surrogate variables to use in the sva software
library(bladderbatch) data(bladderdata) dat <- bladderEset[1:5000,] pheno = pData(dat) edata = exprs(dat) mod = model.matrix(~as.factor(cancer), data=pheno) n.sv = num.sv(edata,mod,method="leek")
library(bladderbatch) data(bladderdata) dat <- bladderEset[1:5000,] pheno = pData(dat) edata = exprs(dat) mod = model.matrix(~as.factor(cancer), data=pheno) n.sv = num.sv(edata,mod,method="leek")
This function is the implementation of the two step approach for estimating surrogate
variables proposed by Leek and Storey 2007 PLoS Genetics. This function is primarily
included for backwards compatibility. Newer versions of the sva algorithm are available
through sva
, svaseq
, with low level functionality available
through irwsva.build
and ssva
.
psva(dat, batch, ...)
psva(dat, batch, ...)
dat |
The transformed data matrix with the variables in rows and samples in columns |
batch |
A factor variable giving the known batch levels |
... |
Other arguments to the |
psva.D Data with batch effect removed but biological heterogeneity preserved
Elana J. Fertig
library(bladderbatch) library(limma) data(bladderdata) dat <- bladderEset[1:50,] pheno = pData(dat) edata = exprs(dat) batch = pheno$batch batch.fac = as.factor(batch) psva_data <- psva(edata,batch.fac)
library(bladderbatch) library(limma) data(bladderdata) dat <- bladderEset[1:50,] pheno = pData(dat) edata = exprs(dat) batch = pheno$batch batch.fac = as.factor(batch) psva_data <- psva(edata,batch.fac)
This function computes quality surrogate variables (qSVs) from the library-size- and read-length-normalized degradation matrix for subsequent RNA quality correction
qsva( degradationMatrix, mod = matrix(1, ncol = 1, nrow = ncol(degradationMatrix)) )
qsva( degradationMatrix, mod = matrix(1, ncol = 1, nrow = ncol(degradationMatrix)) )
degradationMatrix |
the normalized degradation matrix, region by sample |
mod |
(Optional) statistical model used in DE analysis |
the qSV adjustment variables
## Find files bwPath <- system.file('extdata', 'bwtool', package = 'sva') ## Read the data degCovAdj = read.degradation.matrix( covFiles = list.files(bwPath,full.names=TRUE), sampleNames = list.files(bwPath), readLength = 76, totalMapped = rep(100e6,5),type="bwtool") ## Input data head(degCovAdj) ## Results qsva(degCovAdj)
## Find files bwPath <- system.file('extdata', 'bwtool', package = 'sva') ## Read the data degCovAdj = read.degradation.matrix( covFiles = list.files(bwPath,full.names=TRUE), sampleNames = list.files(bwPath), readLength = 76, totalMapped = rep(100e6,5),type="bwtool") ## Input data head(degCovAdj) ## Results qsva(degCovAdj)
This function reads in degradation regions to form a library-size- and read-length-normalized degradation matrix for subsequent RNA quality correction
read.degradation.matrix( covFiles, sampleNames, totalMapped, readLength = 100, normFactor = 8e+07, type = c("bwtool", "region_matrix_single", "region_matrix_all"), BPPARAM = bpparam() )
read.degradation.matrix( covFiles, sampleNames, totalMapped, readLength = 100, normFactor = 8e+07, type = c("bwtool", "region_matrix_single", "region_matrix_all"), BPPARAM = bpparam() )
covFiles |
coverage file(s) for degradation regions |
sampleNames |
sample names; creates column names of degradation matrix |
totalMapped |
how many reads per sample (library size normalization) |
readLength |
read length in base pairs (read length normalization) |
normFactor |
common library size to normalize to; 80M reads as default |
type |
whether input are individual 'bwtool' output, 'region_matrix' run on individual samples, or 'region_matrix' run on all samples together |
BPPARAM |
(Optional) BiocParallelParam for parallel operation |
the normalized degradation matrix, region by sample
# bwtool bwPath = system.file('extdata', 'bwtool', package = 'sva') degCovAdj = read.degradation.matrix( covFiles = list.files(bwPath,full.names=TRUE), sampleNames = list.files(bwPath), readLength = 76, totalMapped = rep(100e6,5),type="bwtool") head(degCovAdj) # region_matrix: each sample r1Path = system.file('extdata', 'region_matrix_one', package = 'sva') degCovAdj1 = read.degradation.matrix( covFiles = list.files(r1Path,full.names=TRUE), sampleNames = list.files(r1Path), readLength = 76, totalMapped = rep(100e6,5),type="region_matrix_single") head(degCovAdj1) r2Path = system.file('extdata', 'region_matrix_all', package = 'sva') degCovAdj2 = read.degradation.matrix( covFiles = list.files(r2Path,full.names=TRUE), sampleNames = list.files(r1Path), readLength = 76, totalMapped = rep(100e6,5),type="region_matrix_all") head(degCovAdj2)
# bwtool bwPath = system.file('extdata', 'bwtool', package = 'sva') degCovAdj = read.degradation.matrix( covFiles = list.files(bwPath,full.names=TRUE), sampleNames = list.files(bwPath), readLength = 76, totalMapped = rep(100e6,5),type="bwtool") head(degCovAdj) # region_matrix: each sample r1Path = system.file('extdata', 'region_matrix_one', package = 'sva') degCovAdj1 = read.degradation.matrix( covFiles = list.files(r1Path,full.names=TRUE), sampleNames = list.files(r1Path), readLength = 76, totalMapped = rep(100e6,5),type="region_matrix_single") head(degCovAdj1) r2Path = system.file('extdata', 'region_matrix_all', package = 'sva') degCovAdj2 = read.degradation.matrix( covFiles = list.files(r2Path,full.names=TRUE), sampleNames = list.files(r1Path), readLength = 76, totalMapped = rep(100e6,5),type="region_matrix_all") head(degCovAdj2)
This function implements a supervised surrogate variable analysis approach
where genes/probes known to be affected by artifacts but not by the biological
variables of interest are assumed to be known in advance. This supervised sva
approach can be called through the sva
and svaseq
functions
by specifying controls.
ssva(dat, controls, n.sv)
ssva(dat, controls, n.sv)
dat |
The transformed data matrix with the variables in rows and samples in columns |
controls |
A vector of probabilities (between 0 and 1, inclusive) that each gene is a control. A value of 1 means the gene is certainly a control and a value of 0 means the gene is certainly not a control. |
n.sv |
The number of surogate variables to estimate |
sv The estimated surrogate variables, one in each column
pprob.gam: A vector of the posterior probabilities each gene is affected by heterogeneity (exactly equal to controls for ssva)
pprob.b A vector of the posterior probabilities each gene is affected by mod (always null for ssva)
n.sv The number of significant surrogate variables
library(bladderbatch) data(bladderdata) dat <- bladderEset[1:5000,] pheno = pData(dat) edata = exprs(dat) mod = model.matrix(~as.factor(cancer), data=pheno) n.sv = num.sv(edata,mod,method="leek") set.seed(1234) controls <- runif(nrow(edata)) ssva_res <- ssva(edata,controls,n.sv)
library(bladderbatch) data(bladderdata) dat <- bladderEset[1:5000,] pheno = pData(dat) edata = exprs(dat) mod = model.matrix(~as.factor(cancer), data=pheno) n.sv = num.sv(edata,mod,method="leek") set.seed(1234) controls <- runif(nrow(edata)) ssva_res <- ssva(edata,controls,n.sv)
sva has functionality to estimate and remove artifacts from high dimensional data
the sva
function can be used to estimate artifacts from microarray data
the svaseq
function can be used to estimate artifacts from count-based
RNA-sequencing (and other sequencing) data. The ComBat
function can be
used to remove known batch effecs from microarray data. The fsva
function
can be used to remove batch effects for prediction problems.
This function is the implementation of the iteratively re-weighted least squares
approach for estimating surrogate variables. As a by product, this function
produces estimates of the probability of being an empirical control. See the function
empirical.controls
for a direct estimate of the empirical controls.
sva( dat, mod, mod0 = NULL, n.sv = NULL, controls = NULL, method = c("irw", "two-step", "supervised"), vfilter = NULL, B = 5, numSVmethod = "be" )
sva( dat, mod, mod0 = NULL, n.sv = NULL, controls = NULL, method = c("irw", "two-step", "supervised"), vfilter = NULL, B = 5, numSVmethod = "be" )
dat |
The transformed data matrix with the variables in rows and samples in columns |
mod |
The model matrix being used to fit the data |
mod0 |
The null model being compared when fitting the data |
n.sv |
The number of surogate variables to estimate |
controls |
A vector of probabilities (between 0 and 1, inclusive) that each gene is a control. A value of 1 means the gene is certainly a control and a value of 0 means the gene is certainly not a control. |
method |
For empirical estimation of control probes use "irw". If control probes are known use "supervised" |
vfilter |
You may choose to filter to the vfilter most variable rows before performing the analysis. vfilter must be NULL if method is "supervised" |
B |
The number of iterations of the irwsva algorithm to perform |
numSVmethod |
If n.sv is NULL, sva will attempt to estimate the number of needed surrogate variables. This should not be adapted by the user unless they are an expert. |
A vignette is available by typing browseVignettes("sva")
in the R prompt.
sv The estimated surrogate variables, one in each column
pprob.gam: A vector of the posterior probabilities each gene is affected by heterogeneity
pprob.b A vector of the posterior probabilities each gene is affected by mod
n.sv The number of significant surrogate variables
Jeffrey T. Leek, W. Evan Johnson, Hilary S. Parker, Andrew E. Jaffe, John D. Storey, Yuqing Zhang
For the package: Leek JT, Johnson WE, Parker HS, Jaffe AE, and Storey JD. (2012) The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics DOI:10.1093/bioinformatics/bts034
For sva: Leek JT and Storey JD. (2008) A general framework for multiple testing dependence. Proceedings of the National Academy of Sciences , 105: 18718-18723.
For sva: Leek JT and Storey JD. (2007) Capturing heterogeneity in gene expression studies by ‘Surrogate Variable Analysis’. PLoS Genetics, 3: e161.
For Combat: Johnson WE, Li C, Rabinovic A (2007) Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics, 8 (1), 118-127
For svaseq: Leek JT (2014) svaseq: removing batch and other artifacts from count-based sequencing data. bioRxiv doi: TBD
For fsva: Parker HS, Bravo HC, Leek JT (2013) Removing batch effects for prediction problems with frozen surrogate variable analysis arXiv:1301.3947
For psva: Parker HS, Leek JT, Favorov AV, Considine M, Xia X, Chavan S, Chung CH, Fertig EJ (2014) Preserving biological heterogeneity with a permuted surrogate variable analysis for genomics batch correction Bioinformatics doi: 10.1093/bioinformatics/btu375
library(bladderbatch) data(bladderdata) dat <- bladderEset[1:5000,] pheno = pData(dat) edata = exprs(dat) mod = model.matrix(~as.factor(cancer), data=pheno) mod0 = model.matrix(~1,data=pheno) n.sv = num.sv(edata,mod,method="leek") svobj = sva(edata,mod,mod0,n.sv=n.sv)
library(bladderbatch) data(bladderdata) dat <- bladderEset[1:5000,] pheno = pData(dat) edata = exprs(dat) mod = model.matrix(~as.factor(cancer), data=pheno) mod0 = model.matrix(~1,data=pheno) n.sv = num.sv(edata,mod,method="leek") svobj = sva(edata,mod,mod0,n.sv=n.sv)
This function corrects a gene expression matrix prior to network inference by returning the residuals after regressing out the top principal components. The number of principal components to remove can be determined using a permutation-based approach using the "num.sv" function with method = "be"
sva_network(dat, n.pc)
sva_network(dat, n.pc)
dat |
The uncorrected normalized gene expression data matrix with samples in rows and genes in columns |
n.pc |
The number of principal components to remove |
dat.adjusted Cleaned gene expression data matrix with the top prinicpal components removed
library(bladderbatch) data(bladderdata) dat <- bladderEset[1:5000,] edata = exprs(dat) mod = matrix(1, nrow = dim(dat)[2], ncol = 1) n.pc = num.sv(edata, mod, method="be") dat.adjusted = sva_network(t(edata), n.pc)
library(bladderbatch) data(bladderdata) dat <- bladderEset[1:5000,] edata = exprs(dat) mod = matrix(1, nrow = dim(dat)[2], ncol = 1) n.pc = num.sv(edata, mod, method="be") dat.adjusted = sva_network(t(edata), n.pc)
This function is designed to check for degenerate cases in the sva fit and fix the sva object where possible.
sva.check(svaobj, dat, mod, mod0)
sva.check(svaobj, dat, mod, mod0)
svaobj |
The transformed data matrix with the variables in rows and samples in columns |
dat |
The data set that was used to build the surrogate variables |
mod |
The model matrix being used to fit the data |
mod0 |
The null model matrix being used to fit the data |
empirical.controls
for a direct estimate of the empirical controls.
sv The estimated surrogate variables, one in each column
pprob.gam: A vector of the posterior probabilities each gene is affected by heterogeneity
pprob.b A vector of the posterior probabilities each gene is affected by mod
n.sv The number of significant surrogate variables
library(bladderbatch) data(bladderdata) #dat <- bladderEset dat <- bladderEset[1:5000,] pheno = pData(dat) edata = exprs(dat) mod = model.matrix(~as.factor(cancer), data=pheno) mod0 = model.matrix(~1,data=pheno) n.sv = num.sv(edata,mod,method="leek") svobj = sva(edata,mod,mod0,n.sv=n.sv) svacheckobj = sva.check(svobj,edata,mod,mod0)
library(bladderbatch) data(bladderdata) #dat <- bladderEset dat <- bladderEset[1:5000,] pheno = pData(dat) edata = exprs(dat) mod = model.matrix(~as.factor(cancer), data=pheno) mod0 = model.matrix(~1,data=pheno) n.sv = num.sv(edata,mod,method="leek") svobj = sva(edata,mod,mod0,n.sv=n.sv) svacheckobj = sva.check(svobj,edata,mod,mod0)
This function is the implementation of the iteratively re-weighted least squares
approach for estimating surrogate variables. As a by product, this function
produces estimates of the probability of being an empirical control. This function first
applies a moderated log transform as described in Leek 2014 before calculating the surrogate
variables. See the function empirical.controls
for a direct estimate of the empirical controls.
svaseq( dat, mod, mod0 = NULL, n.sv = NULL, controls = NULL, method = c("irw", "two-step", "supervised"), vfilter = NULL, B = 5, numSVmethod = "be", constant = 1 )
svaseq( dat, mod, mod0 = NULL, n.sv = NULL, controls = NULL, method = c("irw", "two-step", "supervised"), vfilter = NULL, B = 5, numSVmethod = "be", constant = 1 )
dat |
The transformed data matrix with the variables in rows and samples in columns |
mod |
The model matrix being used to fit the data |
mod0 |
The null model being compared when fitting the data |
n.sv |
The number of surogate variables to estimate |
controls |
A vector of probabilities (between 0 and 1, inclusive) that each gene is a control. A value of 1 means the gene is certainly a control and a value of 0 means the gene is certainly not a control. |
method |
For empirical estimation of control probes use "irw". If control probes are known use "supervised" |
vfilter |
You may choose to filter to the vfilter most variable rows before performing the analysis. vfilter must be NULL if method is "supervised" |
B |
The number of iterations of the irwsva algorithm to perform |
numSVmethod |
If n.sv is NULL, sva will attempt to estimate the number of needed surrogate variables. This should not be adapted by the user unless they are an expert. |
constant |
The function takes log(dat + constant) before performing sva. By default constant = 1, all values of dat + constant should be positive. |
sv The estimated surrogate variables, one in each column
pprob.gam: A vector of the posterior probabilities each gene is affected by heterogeneity
pprob.b A vector of the posterior probabilities each gene is affected by mod
n.sv The number of significant surrogate variables
library(zebrafishRNASeq) data(zfGenes) filter = apply(zfGenes, 1, function(x) length(x[x>5])>=2) filtered = zfGenes[filter,] genes = rownames(filtered)[grep("^ENS", rownames(filtered))] controls = grepl("^ERCC", rownames(filtered)) group = as.factor(rep(c("Ctl", "Trt"), each=3)) dat0 = as.matrix(filtered) mod1 = model.matrix(~group) mod0 = cbind(mod1[,1]) svseq = svaseq(dat0,mod1,mod0,n.sv=1)$sv plot(svseq,pch=19,col="blue")
library(zebrafishRNASeq) data(zfGenes) filter = apply(zfGenes, 1, function(x) length(x[x>5])>=2) filtered = zfGenes[filter,] genes = rownames(filtered)[grep("^ENS", rownames(filtered))] controls = grepl("^ERCC", rownames(filtered)) group = as.factor(rep(c("Ctl", "Trt"), each=3)) dat0 = as.matrix(filtered) mod1 = model.matrix(~group) mod0 = cbind(mod1[,1]) svseq = svaseq(dat0,mod1,mod0,n.sv=1)$sv plot(svseq,pch=19,col="blue")
This function is the implementation of the two step approach for estimating surrogate
variables proposed by Leek and Storey 2007 PLoS Genetics. This function is primarily
included for backwards compatibility. Newer versions of the sva algorithm are available
through sva
, svaseq
, with low level functionality available
through irwsva.build
and ssva
.
twostepsva.build(dat, mod, n.sv)
twostepsva.build(dat, mod, n.sv)
dat |
The transformed data matrix with the variables in rows and samples in columns |
mod |
The model matrix being used to fit the data |
n.sv |
The number of surogate variables to estimate |
sv The estimated surrogate variables, one in each column
pprob.gam: A vector of the posterior probabilities each gene is affected by heterogeneity
pprob.b A vector of the posterior probabilities each gene is affected by mod (this is always null for the two-step approach)
n.sv The number of significant surrogate variables
library(bladderbatch) library(limma) data(bladderdata) dat <- bladderEset pheno = pData(dat) edata = exprs(dat) mod = model.matrix(~as.factor(cancer), data=pheno) n.sv = num.sv(edata,mod,method="leek") svatwostep <- twostepsva.build(edata,mod,n.sv)
library(bladderbatch) library(limma) data(bladderdata) dat <- bladderEset pheno = pData(dat) edata = exprs(dat) mod = model.matrix(~as.factor(cancer), data=pheno) n.sv = num.sv(edata,mod,method="leek") svatwostep <- twostepsva.build(edata,mod,n.sv)