Title: | Dispersion shrinkage for sequencing data |
---|---|
Description: | DSS is an R library performing differntial analysis for count-based sequencing data. It detectes differentially expressed genes (DEGs) from RNA-seq, and differentially methylated loci or regions (DML/DMRs) from bisulfite sequencing (BS-seq). The core of DSS is a new dispersion shrinkage method for estimating the dispersion parameter from Gamma-Poisson or Beta-Binomial distributions. |
Authors: | Hao Wu<[email protected]>, Hao Feng<[email protected]> |
Maintainer: | Hao Wu <[email protected]>, Hao Feng <[email protected]> |
License: | GPL |
Version: | 2.55.0 |
Built: | 2024-11-25 06:12:57 UTC |
Source: | https://github.com/bioc/DSS |
DSS is an R library performing the differential expression analysis for RNA-seq count data. Compared with other similar packages (DESeq, edgeR), DSS implements a new dispersion shrinkage method to estimate the gene-specific biological variance. Extensive simulation results showed that DSS performs favorabily compared to DESeq and edgeR when the variation of biological variances is large.
DSS only works for two group comparison at this time. We plan to extend the functionalities and make it work for more general experimental designs in the near future.
Hao Wu <[email protected]>
This function takes the results from DML testing procedure ('DMLtest' function) and calls DMLs. Regions will CpG sites being statistically significant are deemed as DMLs.
callDML(DMLresult, delta=0.1, p.threshold=1e-5)
callDML(DMLresult, delta=0.1, p.threshold=1e-5)
DMLresult |
A data frame representing the results for DML detection. This should be the result returned from 'DMLtest' function. |
delta |
A threshold for defining DML. In DML testing procedure, hypothesis test that the two groups means are equalis is conducted at each CpG site. Here if 'delta' is specified, the function will compute the posterior probability that the difference of the means are greater than delta,and then call DML based on that. This only works when the test results are from 'DMLtest', which is for two-group comparison. For general design, this has to be set to 0. |
p.threshold |
When delta is not specified, this is the threshold of p-values for defining DML, e.g. Loci with p-values less than this threshold will be deemed DMLs. When delta is specified, CpG sites with posterior probability greater than 1-p.threshold are deemed DML. |
A data frame for DMLs. Each row is for a DML. DMLs are sorted by statistical significance. The columns are
chr |
Chromosome number. |
pos |
Genomic coordinates. |
mu1 , mu2
|
Mean methylations of two groups. |
diff |
Difference of mean methylations of two groups. |
diff.se |
Standard error of the methylation difference. |
stat |
Wald statistics. |
phi1 , phi2
|
Estimated dispersions in two groups. |
pval |
P-values. This is obtained from normal distribution. |
fdr |
False discovery rate. |
postprob.overThreshold |
The posterior probability of the difference in methylation greater than delta. This columns is only available when delta>0. |
Hao Wu <[email protected]>
DMLtest, callDMR
## Not run: require(bsseq) ## first read in methylation data. path <- file.path(system.file(package="DSS"), "extdata") dat1.1 <- read.table(file.path(path, "cond1_1.txt"), header=TRUE) dat1.2 <- read.table(file.path(path, "cond1_2.txt"), header=TRUE) dat2.1 <- read.table(file.path(path, "cond2_1.txt"), header=TRUE) dat2.2 <- read.table(file.path(path, "cond2_2.txt"), header=TRUE) ## make BSseq objects BSobj <- makeBSseqData( list(dat1.1, dat1.2, dat2.1, dat2.2), c("C1","C2", "N1", "N2") ) ## DML test dmlTest <- DMLtest(BSobj, group1=c("C1", "C2"), group2=c("N1","N2")) ## call DML dmls <- callDML(dmlTest) head(dmls) ## call DML with a threshold dmls2 <- callDML(dmlTest, delta=0.1) head(dmls2) ## For whole-genome BS-seq data, perform DML test with smoothing require(bsseqData) data(BS.cancer.ex) ## takea smallportionof data and test BSobj <- BS.cancer.ex[10000:15000,] dmlTest <- DMLtest(BSobj, group1=c("C1", "C2", "C3"), group2=c("N1","N2","N3"), smoothing=TRUE, smoothing.span=500) dmls <- callDML(dmlTest) head(dmls) ## from multifactor design data(RRBS) DMLfit = DMLfit.multiFactor(RRBS, design, ~case+cell+case:cell) DMLtest.cell = DMLtest.multiFactor(DMLfit, coef="cellrN") dml = callDML(DMLtest.cell, p.threshold=0.05) ## this produce a warning dml = callDML(DMLtest.cell, delta=0, p.threshold=0.05) ## no warning head(dml) ## End(Not run)
## Not run: require(bsseq) ## first read in methylation data. path <- file.path(system.file(package="DSS"), "extdata") dat1.1 <- read.table(file.path(path, "cond1_1.txt"), header=TRUE) dat1.2 <- read.table(file.path(path, "cond1_2.txt"), header=TRUE) dat2.1 <- read.table(file.path(path, "cond2_1.txt"), header=TRUE) dat2.2 <- read.table(file.path(path, "cond2_2.txt"), header=TRUE) ## make BSseq objects BSobj <- makeBSseqData( list(dat1.1, dat1.2, dat2.1, dat2.2), c("C1","C2", "N1", "N2") ) ## DML test dmlTest <- DMLtest(BSobj, group1=c("C1", "C2"), group2=c("N1","N2")) ## call DML dmls <- callDML(dmlTest) head(dmls) ## call DML with a threshold dmls2 <- callDML(dmlTest, delta=0.1) head(dmls2) ## For whole-genome BS-seq data, perform DML test with smoothing require(bsseqData) data(BS.cancer.ex) ## takea smallportionof data and test BSobj <- BS.cancer.ex[10000:15000,] dmlTest <- DMLtest(BSobj, group1=c("C1", "C2", "C3"), group2=c("N1","N2","N3"), smoothing=TRUE, smoothing.span=500) dmls <- callDML(dmlTest) head(dmls) ## from multifactor design data(RRBS) DMLfit = DMLfit.multiFactor(RRBS, design, ~case+cell+case:cell) DMLtest.cell = DMLtest.multiFactor(DMLfit, coef="cellrN") dml = callDML(DMLtest.cell, p.threshold=0.05) ## this produce a warning dml = callDML(DMLtest.cell, delta=0, p.threshold=0.05) ## no warning head(dml) ## End(Not run)
This function takes the results from DML testing procedure ('callDML' function) and calls DMRs. Regions will CpG sites being statistically significant are detected as DMRs. Nearby DMRs are merged into longer ones. Some restrictions including the minimum length, minimum number of CpG sites, etc. are applied.
callDMR(DMLresult, delta=0, p.threshold=1e-5, minlen=50, minCG=3, dis.merge=100, pct.sig=0.5)
callDMR(DMLresult, delta=0, p.threshold=1e-5, minlen=50, minCG=3, dis.merge=100, pct.sig=0.5)
DMLresult |
A data frame representing the results for DML detection. This should be the result returned from 'DMLtest' or 'DMLtest.multiFactor' function. |
delta |
A threshold for defining DMR. In DML detection procedure, a hypothesis test that the two groups means are equal is conducted at each CpG site. Here if 'delta' is specified, the function will compute the posterior probability that the difference of the means are greater than delta, and then construct DMR based on that. This only works when the test results are from 'DMLtest', which is for two-group comparison. |
p.threshold |
A threshold of p-values for calling DMR. Loci with p-values less than this threshold will be picked and joint to form the DMRs. See 'details' for more information. |
minlen |
Minimum length (in basepairs) required for DMR. Default is 50 bps. |
minCG |
Minimum number of CpG sites required for DMR. Default is 3. |
dis.merge |
When two DMRs are very close to each other and the distance (in bps) is less than this number, they will be merged into one. Default is 50 bps. |
pct.sig |
In all DMRs, the percentage of CG sites with significant p-values (less than p.threshold) must be greater than this threshold. Default is 0.5. This is mainly used for correcting the effects of merging of nearby DMRs. |
The choices of 'delta' and 'p.threshold' are somewhat arbitrary. The default value for p-value threshold for calling DMR is 1e-5. The statistical test on loci level is less powerful when smoothing is NOT applied, so users can consider to use a less stringent criteria, such as 0.001, in order to get satisfactory number of DMRs. This function is reasonably fast since the computationally intesnsive part is in 'DMLtest'. Users can try different p.threshold values to obtain satisfactory results.
'delta' is only supported when the experiment is for two-group comparison. This is because in multifactor design, the estimated coefficients in the regression are based on a GLM framework (loosely speaking), thus they don't have clear meaning of methylation level differences. So when the input DMLresult is from DMLtest.multiFactor, 'delta' cannot be specified.
When specifying a 'delta' value, the posterior probability (pp) of each CpG site being DML is computed. Then the p.threshold is applied on 1-pp, e.g., sites with 1-pp<p.threshold is deemed significant. In this case, the criteria for DMR calling is more stringent and users might consider to use a more liberal p.threshold in order to get more regions.
A data frame for DMRs. Each row is for a DMR. Rows are sorted by "areaStat", which is the sum of test statistics of all CpG sites in the region. The columns are:
chr |
Chromosome number. |
start , end
|
Genomic coordinates. |
length |
Length of the DMR, in bps. |
nCG |
Number of CpG sites contained in the DMR. |
meanMethy1 , meanMethy2
|
Average methylation levels in two conditions. |
diff.Methy |
The difference in the methylation levels between two conditions. diff.Methy=meanMethy1-meanMethy2. |
areaStat |
The sum of the test statistics of all CpG sites within the DMR. |
Hao Wu <[email protected]>
DMLtest, callDML
## Not run: require(bsseq) require(bsseqData) data(BS.cancer.ex) ## take a small portion of data and test BSobj <- BS.cancer.ex[140000:150000,] dmlTest <- DMLtest(BSobj, group1=c("C1", "C2", "C3"), group2=c("N1","N2","N3"), smoothing=TRUE, smoothing.span=500) ## call DMR based on test results dmrs <- callDMR(dmlTest) head(dmrs) ## or one can specify a threshold for difference in methylation level dmrs2 <- callDMR(dmlTest, delta=0.1) head(dmrs2) ## visualize one DMR showOneDMR(dmrs[1,], BSobj) ## from multifactor design - using a loose threshold to demonstrate data(RRBS) DMLfit = DMLfit.multiFactor(RRBS, design, ~case+cell+case:cell) DMLtest.cell = DMLtest.multiFactor(DMLfit, coef="cellrN") dmr = callDMR(DMLtest.cell, p.threshold=0.05) dmr ## End(Not run)
## Not run: require(bsseq) require(bsseqData) data(BS.cancer.ex) ## take a small portion of data and test BSobj <- BS.cancer.ex[140000:150000,] dmlTest <- DMLtest(BSobj, group1=c("C1", "C2", "C3"), group2=c("N1","N2","N3"), smoothing=TRUE, smoothing.span=500) ## call DMR based on test results dmrs <- callDMR(dmlTest) head(dmrs) ## or one can specify a threshold for difference in methylation level dmrs2 <- callDMR(dmlTest, delta=0.1) head(dmrs2) ## visualize one DMR showOneDMR(dmrs[1,], BSobj) ## from multifactor design - using a loose threshold to demonstrate data(RRBS) DMLfit = DMLfit.multiFactor(RRBS, design, ~case+cell+case:cell) DMLtest.cell = DMLtest.multiFactor(DMLfit, coef="cellrN") dmr = callDMR(DMLtest.cell, p.threshold=0.05) dmr ## End(Not run)
The RRBS dataset is from 16 samples with two factors (case and cell), each has two levels (so it's a 2x2 design).
data(RRBS)
data(RRBS)
data(RRBS) RRBS design
data(RRBS) RRBS design
Dispersion parameter for a gene represents its coefficient of variation of expressions. It characterizes the biological variations.
## S4 method for signature 'SeqCountSet' dispersion(object) ## S4 replacement method for signature 'SeqCountSet,numeric' dispersion(object) <- value
## S4 method for signature 'SeqCountSet' dispersion(object) ## S4 replacement method for signature 'SeqCountSet,numeric' dispersion(object) <- value
object |
A SeqCountData object. |
value |
A numeric vector with the same length as number of genes. |
If the counts from biological replicates are modeled as negative binomial distribution, the variance (v) and mean (m) should hold following relationship: v=m+m^2*phi, where phi is the dispersion. Another interpretation is that phi represents the biological variations among replicates when underlying expressions are modeled as a Gamma distribution.
Hao Wu <[email protected]>
normalizationFactor
data(seqData) ## obtain seqData=estNormFactors(seqData, "quantile") seqData=estDispersion(seqData) dispersion(seqData) ## assign dispersion(seqData)=rep(0.1, nrow(exprs(seqData)))
data(seqData) ## obtain seqData=estNormFactors(seqData, "quantile") seqData=estDispersion(seqData) dispersion(seqData) ## assign dispersion(seqData)=rep(0.1, nrow(exprs(seqData)))
This function takes a BSseq object, a data frame for experimental design and a model formula and then fit a linear model.
DMLfit.multiFactor(BSobj, design, formula, smoothing=FALSE, smoothing.span=500)
DMLfit.multiFactor(BSobj, design, formula, smoothing=FALSE, smoothing.span=500)
BSobj |
An object of BSseq class for the BS-seq data. |
design |
A data frame for experimental design. Number of rows must match the number of columns of the counts in BSobj. |
formula |
A formula for the linear model. |
smoothing |
A flag to indicate whether to apply smoothing. When true, the counts will be smoothed by a simple moving average method. |
smoothing.span |
The size of smoothing window, in basepairs. Default is 500. |
The lineear model fitting is done through ordinary least square on the arscine transformed methylation percentages. The estimated standard errors are computed with consideration of the data (count) distribution and transformation. This function is extremely efficient. The computation takes around 20 minutes for 4 million CpG sites.
A list with following components
gr |
An object of 'GRanges' for locations of the CpG sites. |
design |
The input data frame for experimental design. |
formula |
The input formula for the model. |
X |
The design matrix used in regression. It is created based on design and formula. |
fit |
The model fitting results. This is a list itself, with three components: 'beta' - the estimated coefficients; 'var.beta' - estimated variance/covariance matrices for beta. 'phi' - estimated beta-binomial dispersion parameters. Note that var.beta for a CpG site should be a ncol(X) x ncol(X) matrix, but is flattend to a vector so that the matrices for all CpG sites can be saved as a matrix. |
Hao Wu<[email protected]>
DMLtest.multiFactor, DMLtest
## Not run: data(RRBS) ## model fitting DMLfit = DMLfit.multiFactor(RRBS, design, ~case+cell+case:cell) ## with smoothing: DMLfit.sm = DMLfit.multiFactor(RRBS, design, ~case+cell+case:cell, smoothing=TRUE) ## hypothesis testing DMLtest.cell = DMLtest.multiFactor(DMLfit, coef=3) ## look at distributions of test statistics and p-values par(mfrow=c(1,2)) hist(DMLtest.cell$stat, 100, main="test statistics") hist(DMLtest.cell$pvals, 100, main="P values") ## End(Not run)
## Not run: data(RRBS) ## model fitting DMLfit = DMLfit.multiFactor(RRBS, design, ~case+cell+case:cell) ## with smoothing: DMLfit.sm = DMLfit.multiFactor(RRBS, design, ~case+cell+case:cell, smoothing=TRUE) ## hypothesis testing DMLtest.cell = DMLtest.multiFactor(DMLfit, coef=3) ## look at distributions of test statistics and p-values par(mfrow=c(1,2)) hist(DMLtest.cell$stat, 100, main="test statistics") hist(DMLtest.cell$pvals, 100, main="P values") ## End(Not run)
This function takes a BSseq object and two group labels, then perform statistical tests for differntial methylation at each CpG site.
DMLtest(BSobj, group1, group2, equal.disp = FALSE, smoothing = FALSE, smoothing.span = 500, ncores)
DMLtest(BSobj, group1, group2, equal.disp = FALSE, smoothing = FALSE, smoothing.span = 500, ncores)
BSobj |
An object of BSseq class for the BS-seq data. |
group1 , group2
|
Vectors of sample names or indexes for the two groups to be tested. See more description in details. |
equal.disp |
A flag to indicate whether the dispersion in two groups are deemed equal. Default is FALSE, and the dispersion shrinkages are performed on two conditions independently. |
smoothing |
A flag to indicate whether to apply smoothing in estimating mean methylation levels. |
smoothing.span |
The size of smoothing window, in basepairs. Default is 500. |
ncores |
Number of CPU cores used in parallel computing. See sections 'Parallelization' for details. |
This is the core function for DML/DMR detection. Tests are performed at each CpG site under the null hypothesis that two groups means are equal. There is an option for applying smoothing or not in estimating mean methylation levels. We recommend to use smoothing=TRUE for whole-genome BS-seq data, and smoothing=FALSE for sparser data such like from RRBS or hydroxyl-methylation data (TAB-seq). If there is not biological replicate, smoothing=TRUE is required. See "Single replicate" section for details.
The BS-seq count data are modeled as Beta-Binomial distribution, where the biological variations are captured by the dispersion parameter. The dispersion parameters are estimated through a shrinakge estimator based on a Bayesian hierarchical model. Then a Wald test is performed at each CpG site.
Due to the differences in coverages, some CpG sites are not covered in both groups, and the test cannot be performed. Those loci will be ignored in test and results will be "NA".
A data frame with each row corresponding to a CpG site. Rows are sorted by chromosome number and genomic coordinates. The columns include:
chr |
Chromosome number. |
pos |
Genomic coordinates. |
mu1 , mu2
|
Mean methylations of two groups. |
diff |
Difference of mean methylations of two groups. diff=mu1-mu2. |
diff.se |
Standard error of the methylation difference. |
stat |
Wald statistics. |
pval |
P-values. This is obtained from normal distribution. |
fdr |
False discovery rate. |
When there is no biological replicate in one or both treatment groups, users can either (1) specify equal.disp=TRUE, which assumes both groups have the same dispersion, then the data from two groups are combined and used as replicates to estimate dispersion; or (2) specify smoothing=TRUE, which uses the smoothed means (methylation levels) to estimate dispersions via a shrinkage estimator. This smoothing procedure uses data from neighboring CpG sites as "pseudo-replicate" for estimating biological variance.
The shrinkage estimation for dispersion is the most computational component in DML testing. We use the 'mcapply' function in 'parallel' package to implement parallelization. Note that older version of DSS (<2.4x) used the 'bplapply' function in 'BiocParallel' package. However, that function somehow has significantly reduced performance in the new release (>1.25), so we switched to mcapply. A drawback is that the progress bar cannot be displayed under the paralelle computing setting.
Users might experience problems on Windows, since the mcapply function relies on forking but Windows does not support forking. Thus, we suggest to use ncores=1 on Windows. For more details, please read the 'parallel' documentation.
When smoothing=FALSE, the mean methylation levels are estimated based on the ratios of methylated and total read counts, and the spatial correlations among nearby CpG sites are ignored. When smoothing=TRUE, smoothing based on moving average or the BSmooth method is used to estimate the mean methylaion level at each site. Moving average is recommended because it is much faster than BSmooth, and the results are reasonable similar in terms of mean estimation, dispersion estimation, and DMR calling results.
Hao Wu <[email protected]>
makeBSseqData, callDML, callDMR
## Not run: require(bsseq) ## first read in methylation data. path <- file.path(system.file(package="DSS"), "extdata") dat1.1 <- read.table(file.path(path, "cond1_1.txt"), header=TRUE) dat1.2 <- read.table(file.path(path, "cond1_2.txt"), header=TRUE) dat2.1 <- read.table(file.path(path, "cond2_1.txt"), header=TRUE) dat2.2 <- read.table(file.path(path, "cond2_2.txt"), header=TRUE) ## make BSseq objects BSobj <- makeBSseqData( list(dat1.1, dat1.2, dat2.1, dat2.2), c("C1","C2", "N1", "N2") ) ## DML test without smoothing dmlTest <- DMLtest(BSobj, group1=c("C1", "C2"), group2=c("N1", "N2")) head(dmlTest) ## For whole-genome BS-seq data, perform DML test with smoothing require(bsseqData) data(BS.cancer.ex) ## take a small portion of data and test BSobj <- BS.cancer.ex[10000:15000,] dmlTest <- DMLtest(BSobj, group1=c("C1", "C2", "C3"), group2=c("N1","N2","N3"), smoothing=TRUE, smoothing.span=500) head(dmlTest) ## Examples for Parallelization ## use single core - this has not parallelization system.time(dmlTest <- DMLtest(BSobj, group1=c("C1", "C2"), group2=c("N1", "N2"), ncores=1)) ## use 4 cores - it's about twice as fast system.time(dmlTest <- DMLtest(BSobj, group1=c("C1", "C2"), group2=c("N1", "N2"), ncores=4)) ## End(Not run)
## Not run: require(bsseq) ## first read in methylation data. path <- file.path(system.file(package="DSS"), "extdata") dat1.1 <- read.table(file.path(path, "cond1_1.txt"), header=TRUE) dat1.2 <- read.table(file.path(path, "cond1_2.txt"), header=TRUE) dat2.1 <- read.table(file.path(path, "cond2_1.txt"), header=TRUE) dat2.2 <- read.table(file.path(path, "cond2_2.txt"), header=TRUE) ## make BSseq objects BSobj <- makeBSseqData( list(dat1.1, dat1.2, dat2.1, dat2.2), c("C1","C2", "N1", "N2") ) ## DML test without smoothing dmlTest <- DMLtest(BSobj, group1=c("C1", "C2"), group2=c("N1", "N2")) head(dmlTest) ## For whole-genome BS-seq data, perform DML test with smoothing require(bsseqData) data(BS.cancer.ex) ## take a small portion of data and test BSobj <- BS.cancer.ex[10000:15000,] dmlTest <- DMLtest(BSobj, group1=c("C1", "C2", "C3"), group2=c("N1","N2","N3"), smoothing=TRUE, smoothing.span=500) head(dmlTest) ## Examples for Parallelization ## use single core - this has not parallelization system.time(dmlTest <- DMLtest(BSobj, group1=c("C1", "C2"), group2=c("N1", "N2"), ncores=1)) ## use 4 cores - it's about twice as fast system.time(dmlTest <- DMLtest(BSobj, group1=c("C1", "C2"), group2=c("N1", "N2"), ncores=4)) ## End(Not run)
This function takes the linar model fitting results and performs Wald test at each CpG site, then return test statistics, p-values and FDR.
DMLtest.multiFactor(DMLfit, coef = 2, term, Contrast)
DMLtest.multiFactor(DMLfit, coef = 2, term, Contrast)
DMLfit |
Result object returned from 'DMLfit.multiFactor' function. |
coef |
It can be an integer to indicate which coefficient in the linear model is be tested for being zero. Be careful of intercept. If the model contains intercept, coef=2 indicate testing the first factor in the formula. If the model has no intercept, testing first factor should use coef=1. It can also be a character for the terms to be tested. In that case it must match one of the column names in the design matrix. One can look at colnames(DMLfit$X) to obtain the column names. |
term |
The term(s) to be tested, as one or a vector of characters. Can be multiple terms. See 'Hypothesis test' section for details. |
Contrast |
A contrast matrix for hypothesis testing. The number of rows must equal to the number of columns in the design matrix ncol(DMLfit$X). See 'Hypothesis test' section for details. |
A data frame with following columns: chr, pos, stat, pvals, fdr. Each row is for a CpG site. Note that the CpG sites are sorted by chromosome and position.
User can specify one of the following parameter for testing: 'coef', 'term', or 'Contrast'.
When specifying 'coef', it tests *one* parameter in the model, which corresponds to one column in the design matrix. In this case, A Wald test is performed using the estimated coefficient and standard error from 'DMLfit.multiFactor'. P-values are derived from test statistics based on normal distribution.
When specifying 'term', it tests the whole term in the model. If the term is continuous or a categorical variable with only two levels (one degree of freedom), it is equivalent to specifying 'coef' because it tests only one parameter in the model. However, when the term is a categorical variable with more than two levels, it will test multiple parameters at the same time so it's a compound hypothesis test, and a F-test will be performed.
Specifying 'Contrast' matrix provides the most flexible test procedure. It can test any linear combination of the parameter, and F-test will be performed. Let L be the contrast matrix. The hypothesis test performed is H0: L^T * beta = 0. Thus the number of rows in L must equal to the number of items of beta (which is the number of columns in the design matrix).
Using 'term' or 'Contrast' will be slower especially when there are a lot of CpG sites, because the computation cannot be vectorized (each CpG site has a different variance/covariance matrix for the estimated coefficients).
FDR is computed using cannonical Benjamini-Hochberg procedure.
Hao Wu<[email protected]>
DMLfit.multiFactor, DMLtest
## Not run: data(RRBS) ## model fitting DMLfit = DMLfit.multiFactor(RRBS, design, ~case+cell+case:cell) ## hypothesis testing - following two lines do the same thing DMLtest.cell = DMLtest.multiFactor(DMLfit, coef=3) DMLtest.cell = DMLtest.multiFactor(DMLfit, coef="cellrN") ## this doesn't work DMLtest.cell = DMLtest.multiFactor(DMLfit, coef="cell") ## look at distributions of test statistics and p-values par(mfrow=c(1,2)) hist(DMLtest.cell$stat, 100, main="test statistics") hist(DMLtest.cell$pvals, 100, main="P values") ## Using term or Contrast DMLfit = DMLfit.multiFactor(RRBS, design, ~case+cell) ## following 4 tests should produce the same results, ## since 'case' only has two levels. ## However the p-values from F-tests (using term or Contrast) are ## slightly different, due to normal approximation in Wald test. test1 = DMLtest.multiFactor(DMLfit, coef=2) test2 = DMLtest.multiFactor(DMLfit, coef="caseSLE") test3 = DMLtest.multiFactor(DMLfit, term="case") Contrast = matrix(c(0,1,0), ncol=1) test4 = DMLtest.multiFactor(DMLfit, Contrast=Contrast) cor(cbind(test1$pval, test2$pval, test3$pval, test4$pval)) ## note the different usage of term and coef. ## 'term' has to be in the formula, whereas 'coef' has to be in colnames ## of the design matrix. DMLfit = DMLfit.multiFactor(RRBS, design, ~case+cell+case:cell) DMLtest.cell = DMLtest.multiFactor(DMLfit, coef="cellrN") DMLtest.cell = DMLtest.multiFactor(DMLfit, term="cell") DMLtest.int = DMLtest.multiFactor(DMLfit, coef="caseSLE:cellrN") DMLtest.int = DMLtest.multiFactor(DMLfit, term="case:cell") ## End(Not run)
## Not run: data(RRBS) ## model fitting DMLfit = DMLfit.multiFactor(RRBS, design, ~case+cell+case:cell) ## hypothesis testing - following two lines do the same thing DMLtest.cell = DMLtest.multiFactor(DMLfit, coef=3) DMLtest.cell = DMLtest.multiFactor(DMLfit, coef="cellrN") ## this doesn't work DMLtest.cell = DMLtest.multiFactor(DMLfit, coef="cell") ## look at distributions of test statistics and p-values par(mfrow=c(1,2)) hist(DMLtest.cell$stat, 100, main="test statistics") hist(DMLtest.cell$pvals, 100, main="P values") ## Using term or Contrast DMLfit = DMLfit.multiFactor(RRBS, design, ~case+cell) ## following 4 tests should produce the same results, ## since 'case' only has two levels. ## However the p-values from F-tests (using term or Contrast) are ## slightly different, due to normal approximation in Wald test. test1 = DMLtest.multiFactor(DMLfit, coef=2) test2 = DMLtest.multiFactor(DMLfit, coef="caseSLE") test3 = DMLtest.multiFactor(DMLfit, term="case") Contrast = matrix(c(0,1,0), ncol=1) test4 = DMLtest.multiFactor(DMLfit, Contrast=Contrast) cor(cbind(test1$pval, test2$pval, test3$pval, test4$pval)) ## note the different usage of term and coef. ## 'term' has to be in the formula, whereas 'coef' has to be in colnames ## of the design matrix. DMLfit = DMLfit.multiFactor(RRBS, design, ~case+cell+case:cell) DMLtest.cell = DMLtest.multiFactor(DMLfit, coef="cellrN") DMLtest.cell = DMLtest.multiFactor(DMLfit, term="cell") DMLtest.int = DMLtest.multiFactor(DMLfit, coef="caseSLE:cellrN") DMLtest.int = DMLtest.multiFactor(DMLfit, term="case:cell") ## End(Not run)
This is the top level wrapper function for RNA-seq differential expression analysis in a two-group comparison. Users only need to provide the count matrix and a vector for design, and obtain DE test results.
DSS.DE(counts, design)
DSS.DE(counts, design)
counts |
A matrix of integers with rows corresponding to genes and columns for samples. |
design |
A vector representing the treatment groups. It must be a vector of 0 and 1. The length of the vector must match the number of columns of input count matrix. |
A data frame with each row corresponding to a gene. Rows are sorted according to wald test statistics. The columns are:
gene Index |
index for input gene orders, integers from 1 to the number of genes. |
muA |
sample mean (after normalization) for sample A. |
muB |
sample mean (after normalization) for sample B. |
lfc |
log fold change of expressions between two groups. |
difExpr |
differences in expressions between two groups. |
stats |
Wald test statistics. |
pval |
p-values. |
others |
input gene annotations supplied as AnnotatedDataFrame when constructed the SeqCountData object. |
Hao Wu <[email protected]>
counts = matrix(rpois(600, 10), ncol=6) design = c(0,0,0,1,1,1) result = DSS.DE(counts, design) head(result)
counts = matrix(rpois(600, 10), ncol=6) design = c(0,0,0,1,1,1) result = DSS.DE(counts, design) head(result)
This function first estimate tag-specific dipsersions using a method of moment estimator. Then the dipsersions are shrunk based a penalized likelihood approach. The function works for general experimental designs.
## S4 method for signature 'SeqCountSet' estDispersion(seqData, trend=FALSE)
## S4 method for signature 'SeqCountSet' estDispersion(seqData, trend=FALSE)
seqData |
An object of SeqCountSet class. |
trend |
A binary indicator for modeling the dispersion~expression trend. |
The function takes and object of seqCountData class and return the same oject with “dispersion” field filled.
With “trend=TRUE” the dependence of dispersion on mean expressions will be modeled. In that case the shrinkage will be performed conditional on mean expressions.
The function works for multiple factor designs. But option “trend=TRUE” only applicable for single factor experiment.
Hao Wu <[email protected]>
data(seqData) seqData=estNormFactors(seqData) seqData=estDispersion(seqData) head(dispersion(seqData)) ## For multiple factor design data(seqData) Y=exprs(seqData) design=data.frame(gender=c(rep("M",4), rep("F",4)), strain=rep(c("WT", "Mutant"),4)) X=as.data.frame(model.matrix(~gender+strain, data=design)) seqData=newSeqCountSet(Y, X) seqData=estDispersion(seqData) head(dispersion(seqData)) ## the hypothesis testing for multifactor experiments can be performed ## using edgeR function, with DSS estimated dispersions ## Not run: library(edgeR) fit.edgeR <- glmFit(Y, X, lib.size=normalizationFactor(seqData), dispersion=dispersion(seqData)) lrt.edgeR <- glmLRT(fit.edgeR, coef=2) head(lrt.edgeR$table) ## End(Not run)
data(seqData) seqData=estNormFactors(seqData) seqData=estDispersion(seqData) head(dispersion(seqData)) ## For multiple factor design data(seqData) Y=exprs(seqData) design=data.frame(gender=c(rep("M",4), rep("F",4)), strain=rep(c("WT", "Mutant"),4)) X=as.data.frame(model.matrix(~gender+strain, data=design)) seqData=newSeqCountSet(Y, X) seqData=estDispersion(seqData) head(dispersion(seqData)) ## the hypothesis testing for multifactor experiments can be performed ## using edgeR function, with DSS estimated dispersions ## Not run: library(edgeR) fit.edgeR <- glmFit(Y, X, lib.size=normalizationFactor(seqData), dispersion=dispersion(seqData)) lrt.edgeR <- glmLRT(fit.edgeR, coef=2) head(lrt.edgeR$table) ## End(Not run)
This function estimates normalization factors for the input 'seqCountSet' object and return the same object with normalizationFactor field filled or replaced.
## S4 method for signature 'SeqCountSet' estNormFactors(seqData, method=c("lr", "quantile", "total", "median"))
## S4 method for signature 'SeqCountSet' estNormFactors(seqData, method=c("lr", "quantile", "total", "median"))
seqData |
An object of "SeqCountSet" class. |
method |
Methods to be used in computing normalization factors. Currently available options only include methods to compute normalization factor to adjust for sequencing depths. Available options use (1) "lr": using median of logratio of counts. Similar to the TMM method. (2) "quantile" (default): 75th quantile, (3) "total": total counts, or (4) "median": median counts to constuct the normalization factors. From all methods the normalization factor will be a vector with same length as number of columns for input counts. |
The same "SeqCountSet" object with normalizationFactor field filled or replaced.
Hao Wu <[email protected]>
data(seqData) ## compare different methods seqData=estNormFactors(seqData, "lr") k1=normalizationFactor(seqData) seqData=estNormFactors(seqData, "quantile") k2=normalizationFactor(seqData) seqData=estNormFactors(seqData, "total") k3=normalizationFactor(seqData) cor(cbind(k1,k2,k3)) ## assign size factor normalizationFactor(seqData)=k1 ## or normalization factor can be a matrix dd=exprs(seqData) f=matrix(runif(length(dd), 1,10), nrow=nrow(dd), ncol=ncol(dd)) normalizationFactor(seqData)=f head(normalizationFactor(seqData))
data(seqData) ## compare different methods seqData=estNormFactors(seqData, "lr") k1=normalizationFactor(seqData) seqData=estNormFactors(seqData, "quantile") k2=normalizationFactor(seqData) seqData=estNormFactors(seqData, "total") k3=normalizationFactor(seqData) cor(cbind(k1,k2,k3)) ## assign size factor normalizationFactor(seqData)=k1 ## or normalization factor can be a matrix dd=exprs(seqData) f=matrix(runif(length(dd), 1,10), nrow=nrow(dd), ncol=ncol(dd)) normalizationFactor(seqData)=f head(normalizationFactor(seqData))
This is an utility function to merge BS-seq data from replicated experiment and create an object of BSseq class.
After sequence alignment and proper processing, the BS-seq data can be summarized by following information at each C position (mostly CpG sites, with some CH): chromosome number, genomic coordinate, total number of reads covering the position, and number of reads showing methylation at this position. For replicated samples, the data need to be merged based on the chromosome number and genomic coordinates. This function provide such functionality. It takes replicated data as a list of data frames, merged them, and create a BSseq object.
makeBSseqData(dat, sampleNames)
makeBSseqData(dat, sampleNames)
dat |
A list of multiple data frames from biological replicates. Each element represents data from one replicate. The data frame MUST contain following columns in correct order: (1) Chromosome number; (2) Genomic coordinates; (3) Read coverage of the position from BS-seq data; (4) Number of reads showing methylation of the position. The colnames MUST BE "chr", "pos", "N", "X". |
sampleNames |
A vector of characters for the sample names. The length of the vector should match the length of the input list. |
An object of 'BSseq' class.
Hao Wu <[email protected]>
callDML
require(bsseq) ## first read in methylation data. path <- file.path(system.file(package="DSS"), "extdata") dat1.1 <- read.table(file.path(path, "cond1_1.txt"), header=TRUE) dat1.2 <- read.table(file.path(path, "cond1_2.txt"), header=TRUE) dat2.1 <- read.table(file.path(path, "cond2_1.txt"), header=TRUE) dat2.2 <- read.table(file.path(path, "cond2_2.txt"), header=TRUE) ## make BSseq objects BSobj <- makeBSseqData( list(dat1.1, dat1.2, dat2.1, dat2.2), c("C1","C2", "N1", "N2") ) BSobj sampleNames(BSobj)
require(bsseq) ## first read in methylation data. path <- file.path(system.file(package="DSS"), "extdata") dat1.1 <- read.table(file.path(path, "cond1_1.txt"), header=TRUE) dat1.2 <- read.table(file.path(path, "cond1_2.txt"), header=TRUE) dat2.1 <- read.table(file.path(path, "cond2_1.txt"), header=TRUE) dat2.2 <- read.table(file.path(path, "cond2_2.txt"), header=TRUE) ## make BSseq objects BSobj <- makeBSseqData( list(dat1.1, dat1.2, dat2.1, dat2.2), c("C1","C2", "N1", "N2") ) BSobj sampleNames(BSobj)
The normalization factors are used to adjust for technical or biological biases in the sequencing experiments. The factors can either be (1) a vector with length equals to the number of columns of the count data; or (2) a matrix with the same dimension of the count data.
## S4 method for signature 'SeqCountSet' normalizationFactor(object) ## S4 replacement method for signature 'SeqCountSet,numeric' normalizationFactor(object) <- value ## S4 replacement method for signature 'SeqCountSet,matrix' normalizationFactor(object) <- value
## S4 method for signature 'SeqCountSet' normalizationFactor(object) ## S4 replacement method for signature 'SeqCountSet,numeric' normalizationFactor(object) <- value ## S4 replacement method for signature 'SeqCountSet,matrix' normalizationFactor(object) <- value
object |
A SeqCountData object. |
value |
A numeric vector or matrix. If it is a vector it must have length equals to the number of columns of the count data. For matrix it must have the same dimension of the count data. |
The vector normalization factors are used mostly to correct for sequencing depth from different datasets. The matrix factor applies a different normalizing constant for each gene at each sample to adjust for a broader range of artifacts such as GC content.
Hao Wu <[email protected]>
dispersion
data(seqData) ## obtain nomalization factor seqData=estNormFactors(seqData, "quantile") normalizationFactor(seqData) ## assign as vector normalizationFactor(seqData)=rep(1, ncol(exprs(seqData))) ## getan error here ## or assign as a matrix f=matrix(1, nrow=nrow(exprs(seqData)), ncol=ncol(exprs(seqData))) normalizationFactor(seqData)=f
data(seqData) ## obtain nomalization factor seqData=estNormFactors(seqData, "quantile") normalizationFactor(seqData) ## assign as vector normalizationFactor(seqData)=rep(1, ncol(exprs(seqData))) ## getan error here ## or assign as a matrix f=matrix(1, nrow=nrow(exprs(seqData)), ncol=ncol(exprs(seqData))) normalizationFactor(seqData)=f
The dataset contains RRBS data for 5000 CpG sites from 16 samples. The experimental design is provided in the 'design' data frame.
data("RRBS")
data("RRBS")
data(RRBS) RRBS design
data(RRBS) RRBS design
"SeqCountSet"
- container for count data from
sequencing experimentThis class is the main container for storing *RNA-seq* data. It is directly inherited fro 'ExpressionSet' class, with two more fields 'normalizationFactor' for normalization factors and 'dispersion' for gene-wise dispersions.
The class for BS-seq data is *BSseq*, which is imported from bsseq package.
normalizationFactor
:Normalization factor for counts.
dispersion
:Gene-wise dispersions.
experimentData
:See 'ExpressionSet'.
assayData
:See 'ExpressionSet'.
phenoData
:See 'ExpressionSet'.
featureData
:See 'ExpressionSet'.
annotation
:See 'ExpressionSet'.
protocolData
:See 'ExpressionSet'.
Class "ExpressionSet"
, directly.
Class "eSet"
, by class "ExpressionSet", distance 2.
Class "VersionedBiobase"
, by class "ExpressionSet", distance 3.
Class "Versioned"
, by class "ExpressionSet", distance 4.
newSeqCountSet(counts,
designs,normalizationFactor,featureData)
:
Creates a 'SeqCountSet' object.
counts
A matrix of integers with rows corresponding to genes and columns for samples.
designs
A vector or data frame representing experimental design. The length of the vector or number of rows of the data frame must match the number of columns of input counts. This field can be accessed using 'pData' function.
normalizationFactor
A vector or matrix of normalization factors for the counts.
featureData
Additional information for genes as an 'AnnotatedDataFrame' object. This field can be access by using 'featureData' function.
Access and set gene-wise dispersions.
Access and set normalization factors.
This is similar to 'CountDataSet' in DESeq or 'DGEList' in edgeR.
Hao Wu <[email protected]>
dispersion, normalizationFactor
## simulate data from RNA-seq counts=matrix(rpois(600, 10), ncol=6) designs=c(0,0,0,1,1,1) seqData=newSeqCountSet(counts, designs) seqData pData(seqData) head(exprs(seqData)) ## multiple factor designs design=data.frame(gender=c(rep("M",4), rep("F",4)), strain=rep(c("WT", "Mutant"),4)) X=model.matrix(~gender+strain, data=design) counts=matrix(rpois(800, 10), ncol=8) seqData=newSeqCountSet(counts, as.data.frame(X)) seqData pData(seqData)
## simulate data from RNA-seq counts=matrix(rpois(600, 10), ncol=6) designs=c(0,0,0,1,1,1) seqData=newSeqCountSet(counts, designs) seqData pData(seqData) head(exprs(seqData)) ## multiple factor designs design=data.frame(gender=c(rep("M",4), rep("F",4)), strain=rep(c("WT", "Mutant"),4)) X=model.matrix(~gender+strain, data=design) counts=matrix(rpois(800, 10), ncol=8) seqData=newSeqCountSet(counts, as.data.frame(X)) seqData pData(seqData)
The object is created based on simulation for 1000 genes and two treatment groups with 4 replicates in each group.
data(seqData)
data(seqData)
data(seqData) seqData
data(seqData) seqData
Given one DMR and an BSseq object, this function generate a multiple panel figure, each for a sample, to visualze the counts. There is a bar at each CpG, the gray bar shows the total coverage, and the black bar shows the methylated count.
showOneDMR(OneDMR, BSobj, ext = 500, ylim)
showOneDMR(OneDMR, BSobj, ext = 500, ylim)
OneDMR |
A data frame with one row representing one DMR. It must have chr, start, and end columns. This is typically a row from the result generated from callDMR. |
BSobj |
An object of class BSseq. |
ext |
The amount (in bps) the plotting region should be extended in both directions. |
ylim |
Y-axis limit. |
This function only generates a figure and has no return values.
Hao Wu <[email protected]>
callDMR
## Not run: require(bsseq) require(bsseqData) data(BS.cancer.ex) ## takea small portion of data and test BSobj <- BS.cancer.ex[140000:150000,] dmlTest <- DMLtest(BSobj, group1=c("C1", "C2", "C3"), group2=c("N1","N2","N3"), smoothing=TRUE, smoothing.span=500) ## call DMR based on test results dmrs <- callDMR(dmlTest) ## visualize one DMR showOneDMR(dmrs[1,], BSobj) ## End(Not run)
## Not run: require(bsseq) require(bsseqData) data(BS.cancer.ex) ## takea small portion of data and test BSobj <- BS.cancer.ex[140000:150000,] dmlTest <- DMLtest(BSobj, group1=c("C1", "C2", "C3"), group2=c("N1","N2","N3"), smoothing=TRUE, smoothing.span=500) ## call DMR based on test results dmrs <- callDMR(dmlTest) ## visualize one DMR showOneDMR(dmrs[1,], BSobj) ## End(Not run)
The counts from two groups are modeled as negative binomial random variables with means and dispersions estimated. Wald statistics will be constructed. P-values will be obtained based on Gaussian assumption.
## S4 method for signature 'SeqCountSet' waldTest(seqData, sampleA, sampleB, equal.var, fdr.method=c("BH", "locfdr"))
## S4 method for signature 'SeqCountSet' waldTest(seqData, sampleA, sampleB, equal.var, fdr.method=c("BH", "locfdr"))
seqData |
An object of SeqCountSet class. |
sampleA |
The sample labels for the first sample to be compared in two-group comparison. |
sampleB |
The sample labels for the second sample to be compared in two-group comparison. |
equal.var |
A boolean to indicate whether to use the same or different means in two groups for computing variances in Wald test. Default is FALSE. |
fdr.method |
Method to compute FDR. Availabile options are "BH" for Benjamini-Hochberg FDR method, or local FDR from "locfdr" package. |
The input seqCountData object Must have normalizationFactor and dispersion fields filled, e.g., estNormFactors and estDispersion need to be called prior to this. With group means and shrunk dispersions ready, the variances for difference in group means will be constructed based on Negative Binomial distribution. P-values will be obtained under the assumption that the Wald test statistics are normally distributed. Genes with 0 counts in both groups will be assigned 0 for test statistics and 1 for p-values.
A data frame with each row corresponding to a gene. Rows are sorted according to wald test statistics. The columns are:
gene Index |
index for input gene orders, integers from 1 to the number of genes. |
muA |
sample mean (after normalization) for sample A. |
muB |
sample mean (after normalization) for sample B. |
lfc |
log fold change of expressions between two groups. |
difExpr |
differences in expressions between two groups. |
stats |
Wald test statistics. |
pval |
p-values. |
fdr |
FDR. |
local.fdr |
Local FDR if the FDR method is "locfdr". |
others |
input gene annotations supplied as AnnotatedDataFrame when constructed the SeqCountData object. |
Hao Wu <[email protected]>
data(seqData) seqData=estNormFactors(seqData) seqData=estDispersion(seqData) result=waldTest(seqData, 0, 1) head(result)
data(seqData) seqData=estNormFactors(seqData) seqData=estDispersion(seqData) result=waldTest(seqData, 0, 1) head(result)