Package 'DSS' reference manual

Title:	Dispersion shrinkage for sequencing data
Description:	DSS is an R library performing differntial analysis for count-based sequencing data. It detectes differentially expressed genes (DEGs) from RNA-seq, and differentially methylated loci or regions (DML/DMRs) from bisulfite sequencing (BS-seq). The core of DSS is a new dispersion shrinkage method for estimating the dispersion parameter from Gamma-Poisson or Beta-Binomial distributions.
Authors:	Hao Wu<[email protected]>, Hao Feng<[email protected]>
Maintainer:	Hao Wu <[email protected]>, Hao Feng <[email protected]>
License:	GPL
Version:	2.55.0
Built:	2025-02-23 06:05:22 UTC
Source:	https://github.com/bioc/DSS

Dispersion shrinkage for sequencing data

Description

DSS is an R library performing the differential expression analysis for RNA-seq count data. Compared with other similar packages (DESeq, edgeR), DSS implements a new dispersion shrinkage method to estimate the gene-specific biological variance. Extensive simulation results showed that DSS performs favorabily compared to DESeq and edgeR when the variation of biological variances is large.

DSS only works for two group comparison at this time. We plan to extend the functionalities and make it work for more general experimental designs in the near future.

Author(s)

Hao Wu <[email protected]>

Function to detect differntially methylated loci (DML) from bisulfite sequencing (BS-seq) data.

Description

This function takes the results from DML testing procedure ('DMLtest' function) and calls DMLs. Regions will CpG sites being statistically significant are deemed as DMLs.

Usage

callDML(DMLresult, delta=0.1, p.threshold=1e-5)
callDML(DMLresult, delta=0.1, p.threshold=1e-5)

Arguments

`DMLresult`	A data frame representing the results for DML detection. This should be the result returned from 'DMLtest' function.
`delta`	A threshold for defining DML. In DML testing procedure, hypothesis test that the two groups means are equalis is conducted at each CpG site. Here if 'delta' is specified, the function will compute the posterior probability that the difference of the means are greater than delta,and then call DML based on that. This only works when the test results are from 'DMLtest', which is for two-group comparison. For general design, this has to be set to 0.
`p.threshold`	When delta is not specified, this is the threshold of p-values for defining DML, e.g. Loci with p-values less than this threshold will be deemed DMLs. When delta is specified, CpG sites with posterior probability greater than 1-p.threshold are deemed DML.

Value

A data frame for DMLs. Each row is for a DML. DMLs are sorted by statistical significance. The columns are

`chr`	Chromosome number.
`pos`	Genomic coordinates.
`mu1`, `mu2`	Mean methylations of two groups.
`diff`	Difference of mean methylations of two groups.
`diff.se`	Standard error of the methylation difference.
`stat`	Wald statistics.
`phi1`, `phi2`	Estimated dispersions in two groups.
`pval`	P-values. This is obtained from normal distribution.
`fdr`	False discovery rate.
`postprob.overThreshold`	The posterior probability of the difference in methylation greater than delta. This columns is only available when delta>0.

Author(s)

Hao Wu <[email protected]>

Examples

## Not run: 
require(bsseq)

## first read in methylation data.
path <- file.path(system.file(package="DSS"), "extdata")
dat1.1 <- read.table(file.path(path, "cond1_1.txt"), header=TRUE)
dat1.2 <- read.table(file.path(path, "cond1_2.txt"), header=TRUE)
dat2.1 <- read.table(file.path(path, "cond2_1.txt"), header=TRUE)
dat2.2 <- read.table(file.path(path, "cond2_2.txt"), header=TRUE)

## make BSseq objects
BSobj <- makeBSseqData( list(dat1.1, dat1.2, dat2.1, dat2.2),
  c("C1","C2", "N1", "N2") )

##  DML test
dmlTest <- DMLtest(BSobj, group1=c("C1", "C2"), group2=c("N1","N2"))

## call DML
dmls <- callDML(dmlTest)
head(dmls)

## call DML with a threshold
dmls2 <- callDML(dmlTest, delta=0.1)
head(dmls2)

## For whole-genome BS-seq data, perform DML test with smoothing
require(bsseqData)
data(BS.cancer.ex)
## takea smallportionof data and test
BSobj <- BS.cancer.ex[10000:15000,]
dmlTest <- DMLtest(BSobj, group1=c("C1", "C2", "C3"), group2=c("N1","N2","N3"),
   smoothing=TRUE, smoothing.span=500)
dmls <- callDML(dmlTest)
head(dmls)

## from multifactor design
data(RRBS)
DMLfit = DMLfit.multiFactor(RRBS, design, ~case+cell+case:cell)
DMLtest.cell = DMLtest.multiFactor(DMLfit, coef="cellrN")
dml = callDML(DMLtest.cell, p.threshold=0.05) ## this produce a warning
dml = callDML(DMLtest.cell, delta=0, p.threshold=0.05) ## no warning

head(dml)


## End(Not run)
## Not run: 
require(bsseq)

## first read in methylation data.
path <- file.path(system.file(package="DSS"), "extdata")
dat1.1 <- read.table(file.path(path, "cond1_1.txt"), header=TRUE)
dat1.2 <- read.table(file.path(path, "cond1_2.txt"), header=TRUE)
dat2.1 <- read.table(file.path(path, "cond2_1.txt"), header=TRUE)
dat2.2 <- read.table(file.path(path, "cond2_2.txt"), header=TRUE)

## make BSseq objects
BSobj <- makeBSseqData( list(dat1.1, dat1.2, dat2.1, dat2.2),
  c("C1","C2", "N1", "N2") )

##  DML test
dmlTest <- DMLtest(BSobj, group1=c("C1", "C2"), group2=c("N1","N2"))

## call DML
dmls <- callDML(dmlTest)
head(dmls)

## call DML with a threshold
dmls2 <- callDML(dmlTest, delta=0.1)
head(dmls2)

## For whole-genome BS-seq data, perform DML test with smoothing
require(bsseqData)
data(BS.cancer.ex)
## takea smallportionof data and test
BSobj <- BS.cancer.ex[10000:15000,]
dmlTest <- DMLtest(BSobj, group1=c("C1", "C2", "C3"), group2=c("N1","N2","N3"),
   smoothing=TRUE, smoothing.span=500)
dmls <- callDML(dmlTest)
head(dmls)

## from multifactor design
data(RRBS)
DMLfit = DMLfit.multiFactor(RRBS, design, ~case+cell+case:cell)
DMLtest.cell = DMLtest.multiFactor(DMLfit, coef="cellrN")
dml = callDML(DMLtest.cell, p.threshold=0.05) ## this produce a warning
dml = callDML(DMLtest.cell, delta=0, p.threshold=0.05) ## no warning

head(dml)


## End(Not run)

Function to detect differntially methylated regions (DMR) from bisulfite sequencing (BS-seq) data.

Description

This function takes the results from DML testing procedure ('callDML' function) and calls DMRs. Regions will CpG sites being statistically significant are detected as DMRs. Nearby DMRs are merged into longer ones. Some restrictions including the minimum length, minimum number of CpG sites, etc. are applied.

Usage

callDMR(DMLresult, delta=0, p.threshold=1e-5,
        minlen=50, minCG=3, dis.merge=100, pct.sig=0.5)
callDMR(DMLresult, delta=0, p.threshold=1e-5,
        minlen=50, minCG=3, dis.merge=100, pct.sig=0.5)

Arguments

`DMLresult`	A data frame representing the results for DML detection. This should be the result returned from 'DMLtest' or 'DMLtest.multiFactor' function.
`delta`	A threshold for defining DMR. In DML detection procedure, a hypothesis test that the two groups means are equal is conducted at each CpG site. Here if 'delta' is specified, the function will compute the posterior probability that the difference of the means are greater than delta, and then construct DMR based on that. This only works when the test results are from 'DMLtest', which is for two-group comparison.
`p.threshold`	A threshold of p-values for calling DMR. Loci with p-values less than this threshold will be picked and joint to form the DMRs. See 'details' for more information.
`minlen`	Minimum length (in basepairs) required for DMR. Default is 50 bps.
`minCG`	Minimum number of CpG sites required for DMR. Default is 3.
`dis.merge`	When two DMRs are very close to each other and the distance (in bps) is less than this number, they will be merged into one. Default is 50 bps.
`pct.sig`	In all DMRs, the percentage of CG sites with significant p-values (less than p.threshold) must be greater than this threshold. Default is 0.5. This is mainly used for correcting the effects of merging of nearby DMRs.

Details

The choices of 'delta' and 'p.threshold' are somewhat arbitrary. The default value for p-value threshold for calling DMR is 1e-5. The statistical test on loci level is less powerful when smoothing is NOT applied, so users can consider to use a less stringent criteria, such as 0.001, in order to get satisfactory number of DMRs. This function is reasonably fast since the computationally intesnsive part is in 'DMLtest'. Users can try different p.threshold values to obtain satisfactory results.

'delta' is only supported when the experiment is for two-group comparison. This is because in multifactor design, the estimated coefficients in the regression are based on a GLM framework (loosely speaking), thus they don't have clear meaning of methylation level differences. So when the input DMLresult is from DMLtest.multiFactor, 'delta' cannot be specified.

When specifying a 'delta' value, the posterior probability (pp) of each CpG site being DML is computed. Then the p.threshold is applied on 1-pp, e.g., sites with 1-pp<p.threshold is deemed significant. In this case, the criteria for DMR calling is more stringent and users might consider to use a more liberal p.threshold in order to get more regions.

Value

A data frame for DMRs. Each row is for a DMR. Rows are sorted by "areaStat", which is the sum of test statistics of all CpG sites in the region. The columns are:

`chr`	Chromosome number.
`start`, `end`	Genomic coordinates.
`length`	Length of the DMR, in bps.
`nCG`	Number of CpG sites contained in the DMR.
`meanMethy1`, `meanMethy2`	Average methylation levels in two conditions.
`diff.Methy`	The difference in the methylation levels between two conditions. diff.Methy=meanMethy1-meanMethy2.
`areaStat`	The sum of the test statistics of all CpG sites within the DMR.

Author(s)

Hao Wu <[email protected]>

Examples

## Not run: 
require(bsseq)
require(bsseqData)
data(BS.cancer.ex)

## take a small portion of data and test
BSobj <- BS.cancer.ex[140000:150000,]
dmlTest <- DMLtest(BSobj, group1=c("C1", "C2", "C3"), group2=c("N1","N2","N3"),
   smoothing=TRUE, smoothing.span=500)

## call DMR based on test results
dmrs <- callDMR(dmlTest)
head(dmrs)

## or one can specify a threshold for difference in methylation level
dmrs2 <- callDMR(dmlTest, delta=0.1)
head(dmrs2)

## visualize one DMR
showOneDMR(dmrs[1,], BSobj)

## from multifactor design - using a loose threshold to demonstrate

data(RRBS)
DMLfit = DMLfit.multiFactor(RRBS, design, ~case+cell+case:cell)
DMLtest.cell = DMLtest.multiFactor(DMLfit, coef="cellrN")
dmr = callDMR(DMLtest.cell, p.threshold=0.05) 
dmr

## End(Not run)
## Not run: 
require(bsseq)
require(bsseqData)
data(BS.cancer.ex)

## take a small portion of data and test
BSobj <- BS.cancer.ex[140000:150000,]
dmlTest <- DMLtest(BSobj, group1=c("C1", "C2", "C3"), group2=c("N1","N2","N3"),
   smoothing=TRUE, smoothing.span=500)

## call DMR based on test results
dmrs <- callDMR(dmlTest)
head(dmrs)

## or one can specify a threshold for difference in methylation level
dmrs2 <- callDMR(dmlTest, delta=0.1)
head(dmrs2)

## visualize one DMR
showOneDMR(dmrs[1,], BSobj)

## from multifactor design - using a loose threshold to demonstrate

data(RRBS)
DMLfit = DMLfit.multiFactor(RRBS, design, ~case+cell+case:cell)
DMLtest.cell = DMLtest.multiFactor(DMLfit, coef="cellrN")
dmr = callDMR(DMLtest.cell, p.threshold=0.05) 
dmr

## End(Not run)

Experimental design for the example RRBS dataset

Description

The RRBS dataset is from 16 samples with two factors (case and cell), each has two levels (so it's a 2x2 design).

Usage

data(RRBS)data(RRBS)

Examples

data(RRBS)
RRBS
design
data(RRBS)
RRBS
design

Acessor functions for the 'dispersion' slot in a SeqCountData object.

Description

Dispersion parameter for a gene represents its coefficient of variation of expressions. It characterizes the biological variations.

Usage

## S4 method for signature 'SeqCountSet'
dispersion(object)
## S4 replacement method for signature 'SeqCountSet,numeric'
dispersion(object) <- value
## S4 method for signature 'SeqCountSet'
dispersion(object)
## S4 replacement method for signature 'SeqCountSet,numeric'
dispersion(object) <- value

Arguments

`object`	A SeqCountData object.
`value`	A numeric vector with the same length as number of genes.

Details

If the counts from biological replicates are modeled as negative binomial distribution, the variance (v) and mean (m) should hold following relationship: v=m+m^2*phi, where phi is the dispersion. Another interpretation is that phi represents the biological variations among replicates when underlying expressions are modeled as a Gamma distribution.

Author(s)

Hao Wu <[email protected]>

Examples

data(seqData)
## obtain
seqData=estNormFactors(seqData, "quantile")
seqData=estDispersion(seqData)
dispersion(seqData)

## assign
dispersion(seqData)=rep(0.1, nrow(exprs(seqData)))

data(seqData)
## obtain
seqData=estNormFactors(seqData, "quantile")
seqData=estDispersion(seqData)
dispersion(seqData)

## assign
dispersion(seqData)=rep(0.1, nrow(exprs(seqData)))

Fit a linear model for BS-seq data from general experimental design

Description

This function takes a BSseq object, a data frame for experimental design and a model formula and then fit a linear model.

Usage

DMLfit.multiFactor(BSobj, design, formula, smoothing=FALSE, smoothing.span=500)
DMLfit.multiFactor(BSobj, design, formula, smoothing=FALSE, smoothing.span=500)

Arguments

`BSobj`	An object of BSseq class for the BS-seq data.
`design`	A data frame for experimental design. Number of rows must match the number of columns of the counts in BSobj.
`formula`	A formula for the linear model.
`smoothing`	A flag to indicate whether to apply smoothing. When true, the counts will be smoothed by a simple moving average method.
`smoothing.span`	The size of smoothing window, in basepairs. Default is 500.

Details

The lineear model fitting is done through ordinary least square on the arscine transformed methylation percentages. The estimated standard errors are computed with consideration of the data (count) distribution and transformation. This function is extremely efficient. The computation takes around 20 minutes for 4 million CpG sites.

Value

A list with following components

`gr`	An object of 'GRanges' for locations of the CpG sites.
`design`	The input data frame for experimental design.
`formula`	The input formula for the model.
`X`	The design matrix used in regression. It is created based on design and formula.
`fit`	The model fitting results. This is a list itself, with three components: 'beta' - the estimated coefficients; 'var.beta' - estimated variance/covariance matrices for beta. 'phi' - estimated beta-binomial dispersion parameters. Note that var.beta for a CpG site should be a ncol(X) x ncol(X) matrix, but is flattend to a vector so that the matrices for all CpG sites can be saved as a matrix.

Author(s)

Hao Wu<[email protected]>

Examples

## Not run: 
data(RRBS)
## model fitting
DMLfit = DMLfit.multiFactor(RRBS, design, ~case+cell+case:cell)

## with smoothing:
DMLfit.sm = DMLfit.multiFactor(RRBS, design, ~case+cell+case:cell, smoothing=TRUE)

## hypothesis testing
DMLtest.cell = DMLtest.multiFactor(DMLfit, coef=3)

## look at distributions of test statistics and p-values
par(mfrow=c(1,2))
hist(DMLtest.cell$stat, 100, main="test statistics")
hist(DMLtest.cell$pvals, 100, main="P values")

## End(Not run)
## Not run: 
data(RRBS)
## model fitting
DMLfit = DMLfit.multiFactor(RRBS, design, ~case+cell+case:cell)

## with smoothing:
DMLfit.sm = DMLfit.multiFactor(RRBS, design, ~case+cell+case:cell, smoothing=TRUE)

## hypothesis testing
DMLtest.cell = DMLtest.multiFactor(DMLfit, coef=3)

## look at distributions of test statistics and p-values
par(mfrow=c(1,2))
hist(DMLtest.cell$stat, 100, main="test statistics")
hist(DMLtest.cell$pvals, 100, main="P values")

## End(Not run)

Function to perform statistical test of differntially methylated loci (DML) for two group comparisons of bisulfite sequencing (BS-seq) data.

Description

This function takes a BSseq object and two group labels, then perform statistical tests for differntial methylation at each CpG site.

Usage

DMLtest(BSobj, group1, group2, equal.disp = FALSE, smoothing = FALSE,
        smoothing.span = 500, ncores) 
DMLtest(BSobj, group1, group2, equal.disp = FALSE, smoothing = FALSE,
        smoothing.span = 500, ncores)

Arguments

`BSobj`	An object of BSseq class for the BS-seq data.
`group1`, `group2`	Vectors of sample names or indexes for the two groups to be tested. See more description in details.
`equal.disp`	A flag to indicate whether the dispersion in two groups are deemed equal. Default is FALSE, and the dispersion shrinkages are performed on two conditions independently.
`smoothing`	A flag to indicate whether to apply smoothing in estimating mean methylation levels.
`smoothing.span`	The size of smoothing window, in basepairs. Default is 500.
`ncores`	Number of CPU cores used in parallel computing. See sections 'Parallelization' for details.

Details

This is the core function for DML/DMR detection. Tests are performed at each CpG site under the null hypothesis that two groups means are equal. There is an option for applying smoothing or not in estimating mean methylation levels. We recommend to use smoothing=TRUE for whole-genome BS-seq data, and smoothing=FALSE for sparser data such like from RRBS or hydroxyl-methylation data (TAB-seq). If there is not biological replicate, smoothing=TRUE is required. See "Single replicate" section for details.

The BS-seq count data are modeled as Beta-Binomial distribution, where the biological variations are captured by the dispersion parameter. The dispersion parameters are estimated through a shrinakge estimator based on a Bayesian hierarchical model. Then a Wald test is performed at each CpG site.

Due to the differences in coverages, some CpG sites are not covered in both groups, and the test cannot be performed. Those loci will be ignored in test and results will be "NA".

Value

A data frame with each row corresponding to a CpG site. Rows are sorted by chromosome number and genomic coordinates. The columns include:

`chr`	Chromosome number.
`pos`	Genomic coordinates.
`mu1`, `mu2`	Mean methylations of two groups.
`diff`	Difference of mean methylations of two groups. diff=mu1-mu2.
`diff.se`	Standard error of the methylation difference.
`stat`	Wald statistics.
`pval`	P-values. This is obtained from normal distribution.
`fdr`	False discovery rate.

Single replicate

When there is no biological replicate in one or both treatment groups, users can either (1) specify equal.disp=TRUE, which assumes both groups have the same dispersion, then the data from two groups are combined and used as replicates to estimate dispersion; or (2) specify smoothing=TRUE, which uses the smoothed means (methylation levels) to estimate dispersions via a shrinkage estimator. This smoothing procedure uses data from neighboring CpG sites as "pseudo-replicate" for estimating biological variance.

Parallelization

The shrinkage estimation for dispersion is the most computational component in DML testing. We use the 'mcapply' function in 'parallel' package to implement parallelization. Note that older version of DSS (<2.4x) used the 'bplapply' function in 'BiocParallel' package. However, that function somehow has significantly reduced performance in the new release (>1.25), so we switched to mcapply. A drawback is that the progress bar cannot be displayed under the paralelle computing setting.

Users might experience problems on Windows, since the mcapply function relies on forking but Windows does not support forking. Thus, we suggest to use ncores=1 on Windows. For more details, please read the 'parallel' documentation.

Estimating mean methylation levels

When smoothing=FALSE, the mean methylation levels are estimated based on the ratios of methylated and total read counts, and the spatial correlations among nearby CpG sites are ignored. When smoothing=TRUE, smoothing based on moving average or the BSmooth method is used to estimate the mean methylaion level at each site. Moving average is recommended because it is much faster than BSmooth, and the results are reasonable similar in terms of mean estimation, dispersion estimation, and DMR calling results.

Author(s)

Hao Wu <[email protected]>

Examples

## Not run: 
require(bsseq)

## first read in methylation data.
path <- file.path(system.file(package="DSS"), "extdata")
dat1.1 <- read.table(file.path(path, "cond1_1.txt"), header=TRUE)
dat1.2 <- read.table(file.path(path, "cond1_2.txt"), header=TRUE)
dat2.1 <- read.table(file.path(path, "cond2_1.txt"), header=TRUE)
dat2.2 <- read.table(file.path(path, "cond2_2.txt"), header=TRUE)

## make BSseq objects
BSobj <- makeBSseqData( list(dat1.1, dat1.2, dat2.1, dat2.2),
  c("C1","C2", "N1", "N2") )

##  DML test without smoothing 
dmlTest <- DMLtest(BSobj, group1=c("C1", "C2"), group2=c("N1", "N2"))
head(dmlTest)

## For whole-genome BS-seq data, perform DML test with smoothing
require(bsseqData)
data(BS.cancer.ex)
## take a small portion of data and test
BSobj <- BS.cancer.ex[10000:15000,]
dmlTest <- DMLtest(BSobj, group1=c("C1", "C2", "C3"), group2=c("N1","N2","N3"), 
                   smoothing=TRUE, smoothing.span=500)
head(dmlTest)

## Examples for Parallelization
## use single core - this has not parallelization
system.time(dmlTest <- DMLtest(BSobj, group1=c("C1", "C2"), group2=c("N1", "N2"), ncores=1))

## use 4 cores - it's about twice as fast
system.time(dmlTest <- DMLtest(BSobj, group1=c("C1", "C2"), group2=c("N1", "N2"), ncores=4))


## End(Not run)
## Not run: 
require(bsseq)

## first read in methylation data.
path <- file.path(system.file(package="DSS"), "extdata")
dat1.1 <- read.table(file.path(path, "cond1_1.txt"), header=TRUE)
dat1.2 <- read.table(file.path(path, "cond1_2.txt"), header=TRUE)
dat2.1 <- read.table(file.path(path, "cond2_1.txt"), header=TRUE)
dat2.2 <- read.table(file.path(path, "cond2_2.txt"), header=TRUE)

## make BSseq objects
BSobj <- makeBSseqData( list(dat1.1, dat1.2, dat2.1, dat2.2),
  c("C1","C2", "N1", "N2") )

##  DML test without smoothing 
dmlTest <- DMLtest(BSobj, group1=c("C1", "C2"), group2=c("N1", "N2"))
head(dmlTest)

## For whole-genome BS-seq data, perform DML test with smoothing
require(bsseqData)
data(BS.cancer.ex)
## take a small portion of data and test
BSobj <- BS.cancer.ex[10000:15000,]
dmlTest <- DMLtest(BSobj, group1=c("C1", "C2", "C3"), group2=c("N1","N2","N3"), 
                   smoothing=TRUE, smoothing.span=500)
head(dmlTest)

## Examples for Parallelization
## use single core - this has not parallelization
system.time(dmlTest <- DMLtest(BSobj, group1=c("C1", "C2"), group2=c("N1", "N2"), ncores=1))

## use 4 cores - it's about twice as fast
system.time(dmlTest <- DMLtest(BSobj, group1=c("C1", "C2"), group2=c("N1", "N2"), ncores=4))


## End(Not run)

Perform statistical test for BS-seq data from general experimental design

Description

This function takes the linar model fitting results and performs Wald test at each CpG site, then return test statistics, p-values and FDR.

Usage

DMLtest.multiFactor(DMLfit, coef = 2, term, Contrast)
DMLtest.multiFactor(DMLfit, coef = 2, term, Contrast)

Arguments

`DMLfit`	Result object returned from 'DMLfit.multiFactor' function.
`coef`	It can be an integer to indicate which coefficient in the linear model is be tested for being zero. Be careful of intercept. If the model contains intercept, coef=2 indicate testing the first factor in the formula. If the model has no intercept, testing first factor should use coef=1. It can also be a character for the terms to be tested. In that case it must match one of the column names in the design matrix. One can look at colnames(DMLfit$X) to obtain the column names.
`term`	The term(s) to be tested, as one or a vector of characters. Can be multiple terms. See 'Hypothesis test' section for details.
`Contrast`	A contrast matrix for hypothesis testing. The number of rows must equal to the number of columns in the design matrix ncol(DMLfit$X). See 'Hypothesis test' section for details.

Value

A data frame with following columns: chr, pos, stat, pvals, fdr. Each row is for a CpG site. Note that the CpG sites are sorted by chromosome and position.

Hypothesis test

User can specify one of the following parameter for testing: 'coef', 'term', or 'Contrast'.

When specifying 'coef', it tests *one* parameter in the model, which corresponds to one column in the design matrix. In this case, A Wald test is performed using the estimated coefficient and standard error from 'DMLfit.multiFactor'. P-values are derived from test statistics based on normal distribution.

When specifying 'term', it tests the whole term in the model. If the term is continuous or a categorical variable with only two levels (one degree of freedom), it is equivalent to specifying 'coef' because it tests only one parameter in the model. However, when the term is a categorical variable with more than two levels, it will test multiple parameters at the same time so it's a compound hypothesis test, and a F-test will be performed.

Specifying 'Contrast' matrix provides the most flexible test procedure. It can test any linear combination of the parameter, and F-test will be performed. Let L be the contrast matrix. The hypothesis test performed is H0: L^T * beta = 0. Thus the number of rows in L must equal to the number of items of beta (which is the number of columns in the design matrix).

Using 'term' or 'Contrast' will be slower especially when there are a lot of CpG sites, because the computation cannot be vectorized (each CpG site has a different variance/covariance matrix for the estimated coefficients).

FDR is computed using cannonical Benjamini-Hochberg procedure.

Author(s)

Hao Wu<[email protected]>

Examples

## Not run: 
data(RRBS)
## model fitting
DMLfit = DMLfit.multiFactor(RRBS, design, ~case+cell+case:cell)

## hypothesis testing - following two lines do the same thing
DMLtest.cell = DMLtest.multiFactor(DMLfit, coef=3)
DMLtest.cell = DMLtest.multiFactor(DMLfit, coef="cellrN")

## this doesn't work
DMLtest.cell = DMLtest.multiFactor(DMLfit, coef="cell")

## look at distributions of test statistics and p-values
par(mfrow=c(1,2))
hist(DMLtest.cell$stat, 100, main="test statistics")
hist(DMLtest.cell$pvals, 100, main="P values")

## Using term or Contrast
DMLfit = DMLfit.multiFactor(RRBS, design, ~case+cell)

## following 4 tests should produce the same results,
## since 'case' only has two levels.
## However the p-values from F-tests (using term or Contrast) are
## slightly different, due to normal approximation in Wald test.
test1 = DMLtest.multiFactor(DMLfit, coef=2)
test2 = DMLtest.multiFactor(DMLfit, coef="caseSLE")
test3 = DMLtest.multiFactor(DMLfit, term="case")
Contrast = matrix(c(0,1,0), ncol=1)
test4 = DMLtest.multiFactor(DMLfit, Contrast=Contrast)
cor(cbind(test1$pval, test2$pval, test3$pval, test4$pval))


## note the different usage of term and coef.
## 'term' has to be in the formula, whereas 'coef' has to be in colnames
## of the design matrix.
DMLfit = DMLfit.multiFactor(RRBS, design, ~case+cell+case:cell)
DMLtest.cell = DMLtest.multiFactor(DMLfit, coef="cellrN")
DMLtest.cell = DMLtest.multiFactor(DMLfit, term="cell")
DMLtest.int = DMLtest.multiFactor(DMLfit, coef="caseSLE:cellrN")
DMLtest.int = DMLtest.multiFactor(DMLfit, term="case:cell")


## End(Not run)
## Not run: 
data(RRBS)
## model fitting
DMLfit = DMLfit.multiFactor(RRBS, design, ~case+cell+case:cell)

## hypothesis testing - following two lines do the same thing
DMLtest.cell = DMLtest.multiFactor(DMLfit, coef=3)
DMLtest.cell = DMLtest.multiFactor(DMLfit, coef="cellrN")

## this doesn't work
DMLtest.cell = DMLtest.multiFactor(DMLfit, coef="cell")

## look at distributions of test statistics and p-values
par(mfrow=c(1,2))
hist(DMLtest.cell$stat, 100, main="test statistics")
hist(DMLtest.cell$pvals, 100, main="P values")

## Using term or Contrast
DMLfit = DMLfit.multiFactor(RRBS, design, ~case+cell)

## following 4 tests should produce the same results,
## since 'case' only has two levels.
## However the p-values from F-tests (using term or Contrast) are
## slightly different, due to normal approximation in Wald test.
test1 = DMLtest.multiFactor(DMLfit, coef=2)
test2 = DMLtest.multiFactor(DMLfit, coef="caseSLE")
test3 = DMLtest.multiFactor(DMLfit, term="case")
Contrast = matrix(c(0,1,0), ncol=1)
test4 = DMLtest.multiFactor(DMLfit, Contrast=Contrast)
cor(cbind(test1$pval, test2$pval, test3$pval, test4$pval))


## note the different usage of term and coef.
## 'term' has to be in the formula, whereas 'coef' has to be in colnames
## of the design matrix.
DMLfit = DMLfit.multiFactor(RRBS, design, ~case+cell+case:cell)
DMLtest.cell = DMLtest.multiFactor(DMLfit, coef="cellrN")
DMLtest.cell = DMLtest.multiFactor(DMLfit, term="cell")
DMLtest.int = DMLtest.multiFactor(DMLfit, coef="caseSLE:cellrN")
DMLtest.int = DMLtest.multiFactor(DMLfit, term="case:cell")


## End(Not run)

Perform RNA-seq differential expression analysis in two-group comparison

Description

This is the top level wrapper function for RNA-seq differential expression analysis in a two-group comparison. Users only need to provide the count matrix and a vector for design, and obtain DE test results.

Usage

DSS.DE(counts, design)
DSS.DE(counts, design)

Arguments

`counts`	A matrix of integers with rows corresponding to genes and columns for samples.
`design`	A vector representing the treatment groups. It must be a vector of 0 and 1. The length of the vector must match the number of columns of input count matrix.

Value

A data frame with each row corresponding to a gene. Rows are sorted according to wald test statistics. The columns are:

`gene Index`	index for input gene orders, integers from 1 to the number of genes.
`muA`	sample mean (after normalization) for sample A.
`muB`	sample mean (after normalization) for sample B.
`lfc`	log fold change of expressions between two groups.
`difExpr`	differences in expressions between two groups.
`stats`	Wald test statistics.
`pval`	p-values.
`others`	input gene annotations supplied as AnnotatedDataFrame when constructed the SeqCountData object.

Author(s)

Hao Wu <[email protected]>

Examples

counts = matrix(rpois(600, 10), ncol=6)
design = c(0,0,0,1,1,1)
result = DSS.DE(counts, design)
head(result)
counts = matrix(rpois(600, 10), ncol=6)
design = c(0,0,0,1,1,1)
result = DSS.DE(counts, design)
head(result)

Estimate and shrink tag-specific dipsersions

Description

This function first estimate tag-specific dipsersions using a method of moment estimator. Then the dipsersions are shrunk based a penalized likelihood approach. The function works for general experimental designs.

Usage

## S4 method for signature 'SeqCountSet'
estDispersion(seqData, trend=FALSE)
## S4 method for signature 'SeqCountSet'
estDispersion(seqData, trend=FALSE)

Arguments

`seqData`	An object of SeqCountSet class.
`trend`	A binary indicator for modeling the dispersion~expression trend.

Details

The function takes and object of seqCountData class and return the same oject with “dispersion” field filled.

With “trend=TRUE” the dependence of dispersion on mean expressions will be modeled. In that case the shrinkage will be performed conditional on mean expressions.

The function works for multiple factor designs. But option “trend=TRUE” only applicable for single factor experiment.

Author(s)

Hao Wu <[email protected]>

Examples

data(seqData)
seqData=estNormFactors(seqData)
seqData=estDispersion(seqData)
head(dispersion(seqData))

## For multiple factor design
data(seqData)
Y=exprs(seqData)
design=data.frame(gender=c(rep("M",4), rep("F",4)), strain=rep(c("WT", "Mutant"),4))
X=as.data.frame(model.matrix(~gender+strain, data=design))
seqData=newSeqCountSet(Y, X)
seqData=estDispersion(seqData)
head(dispersion(seqData))

## the hypothesis testing for multifactor experiments can be performed
## using edgeR function, with DSS estimated dispersions
## Not run: 
library(edgeR)
fit.edgeR <- glmFit(Y, X, lib.size=normalizationFactor(seqData), dispersion=dispersion(seqData))
lrt.edgeR <- glmLRT(fit.edgeR, coef=2)
head(lrt.edgeR$table)

## End(Not run)
data(seqData)
seqData=estNormFactors(seqData)
seqData=estDispersion(seqData)
head(dispersion(seqData))

## For multiple factor design
data(seqData)
Y=exprs(seqData)
design=data.frame(gender=c(rep("M",4), rep("F",4)), strain=rep(c("WT", "Mutant"),4))
X=as.data.frame(model.matrix(~gender+strain, data=design))
seqData=newSeqCountSet(Y, X)
seqData=estDispersion(seqData)
head(dispersion(seqData))

## the hypothesis testing for multifactor experiments can be performed
## using edgeR function, with DSS estimated dispersions
## Not run: 
library(edgeR)
fit.edgeR <- glmFit(Y, X, lib.size=normalizationFactor(seqData), dispersion=dispersion(seqData))
lrt.edgeR <- glmLRT(fit.edgeR, coef=2)
head(lrt.edgeR$table)

## End(Not run)

Estimate normalization factors

Description

This function estimates normalization factors for the input 'seqCountSet' object and return the same object with normalizationFactor field filled or replaced.

Usage

## S4 method for signature 'SeqCountSet'
estNormFactors(seqData, method=c("lr", "quantile", "total", "median"))
## S4 method for signature 'SeqCountSet'
estNormFactors(seqData, method=c("lr", "quantile", "total", "median"))

Arguments

seqData

An object of "SeqCountSet" class.

method

Methods to be used in computing normalization factors. Currently available options only include methods to compute normalization factor to adjust for sequencing depths. Available options use (1) "lr": using median of logratio of counts. Similar to the TMM method. (2) "quantile" (default): 75th quantile, (3) "total": total counts, or (4) "median": median counts to constuct the normalization factors. From all methods the normalization factor will be a vector with same length as number of columns for input counts.

Value

The same "SeqCountSet" object with normalizationFactor field filled or replaced.

Author(s)

Hao Wu <[email protected]>

Examples

data(seqData)
## compare different methods
seqData=estNormFactors(seqData, "lr")
k1=normalizationFactor(seqData)
seqData=estNormFactors(seqData, "quantile")
k2=normalizationFactor(seqData)
seqData=estNormFactors(seqData, "total")
k3=normalizationFactor(seqData)
cor(cbind(k1,k2,k3))

## assign size factor
normalizationFactor(seqData)=k1

## or normalization factor can be a matrix
dd=exprs(seqData)
f=matrix(runif(length(dd), 1,10), nrow=nrow(dd), ncol=ncol(dd))
normalizationFactor(seqData)=f
head(normalizationFactor(seqData))
data(seqData)
## compare different methods
seqData=estNormFactors(seqData, "lr")
k1=normalizationFactor(seqData)
seqData=estNormFactors(seqData, "quantile")
k2=normalizationFactor(seqData)
seqData=estNormFactors(seqData, "total")
k3=normalizationFactor(seqData)
cor(cbind(k1,k2,k3))

## assign size factor
normalizationFactor(seqData)=k1

## or normalization factor can be a matrix
dd=exprs(seqData)
f=matrix(runif(length(dd), 1,10), nrow=nrow(dd), ncol=ncol(dd))
normalizationFactor(seqData)=f
head(normalizationFactor(seqData))

Create an object of BSseq class from several data frames.

Description

This is an utility function to merge BS-seq data from replicated experiment and create an object of BSseq class.

After sequence alignment and proper processing, the BS-seq data can be summarized by following information at each C position (mostly CpG sites, with some CH): chromosome number, genomic coordinate, total number of reads covering the position, and number of reads showing methylation at this position. For replicated samples, the data need to be merged based on the chromosome number and genomic coordinates. This function provide such functionality. It takes replicated data as a list of data frames, merged them, and create a BSseq object.

Usage

makeBSseqData(dat, sampleNames)
makeBSseqData(dat, sampleNames)

Arguments

`dat`	A list of multiple data frames from biological replicates. Each element represents data from one replicate. The data frame MUST contain following columns in correct order: (1) Chromosome number; (2) Genomic coordinates; (3) Read coverage of the position from BS-seq data; (4) Number of reads showing methylation of the position. The colnames MUST BE "chr", "pos", "N", "X".
`sampleNames`	A vector of characters for the sample names. The length of the vector should match the length of the input list.

Value

An object of 'BSseq' class.

Author(s)

Hao Wu <[email protected]>

Examples

require(bsseq)

## first read in methylation data.
path <- file.path(system.file(package="DSS"), "extdata")
dat1.1 <- read.table(file.path(path, "cond1_1.txt"), header=TRUE)
dat1.2 <- read.table(file.path(path, "cond1_2.txt"), header=TRUE)
dat2.1 <- read.table(file.path(path, "cond2_1.txt"), header=TRUE)
dat2.2 <- read.table(file.path(path, "cond2_2.txt"), header=TRUE)

## make BSseq objects
BSobj <- makeBSseqData( list(dat1.1, dat1.2, dat2.1, dat2.2),
  c("C1","C2", "N1", "N2") )

BSobj
sampleNames(BSobj)
require(bsseq)

## first read in methylation data.
path <- file.path(system.file(package="DSS"), "extdata")
dat1.1 <- read.table(file.path(path, "cond1_1.txt"), header=TRUE)
dat1.2 <- read.table(file.path(path, "cond1_2.txt"), header=TRUE)
dat2.1 <- read.table(file.path(path, "cond2_1.txt"), header=TRUE)
dat2.2 <- read.table(file.path(path, "cond2_2.txt"), header=TRUE)

## make BSseq objects
BSobj <- makeBSseqData( list(dat1.1, dat1.2, dat2.1, dat2.2),
  c("C1","C2", "N1", "N2") )

BSobj
sampleNames(BSobj)

Accessor functions for the 'normalizationFactor' slot in a SeqCountData object.

Description

The normalization factors are used to adjust for technical or biological biases in the sequencing experiments. The factors can either be (1) a vector with length equals to the number of columns of the count data; or (2) a matrix with the same dimension of the count data.

Usage

## S4 method for signature 'SeqCountSet'
normalizationFactor(object)
## S4 replacement method for signature 'SeqCountSet,numeric'
normalizationFactor(object) <- value
## S4 replacement method for signature 'SeqCountSet,matrix'
normalizationFactor(object) <- value
## S4 method for signature 'SeqCountSet'
normalizationFactor(object)
## S4 replacement method for signature 'SeqCountSet,numeric'
normalizationFactor(object) <- value
## S4 replacement method for signature 'SeqCountSet,matrix'
normalizationFactor(object) <- value

Arguments

`object`	A SeqCountData object.
`value`	A numeric vector or matrix. If it is a vector it must have length equals to the number of columns of the count data. For matrix it must have the same dimension of the count data.

Details

The vector normalization factors are used mostly to correct for sequencing depth from different datasets. The matrix factor applies a different normalizing constant for each gene at each sample to adjust for a broader range of artifacts such as GC content.

Author(s)

Hao Wu <[email protected]>

Examples

data(seqData)
## obtain nomalization factor
seqData=estNormFactors(seqData, "quantile")
normalizationFactor(seqData)

## assign as vector
normalizationFactor(seqData)=rep(1, ncol(exprs(seqData))) ## getan error here

## or assign as a matrix
f=matrix(1, nrow=nrow(exprs(seqData)), ncol=ncol(exprs(seqData)))
normalizationFactor(seqData)=f
data(seqData)
## obtain nomalization factor
seqData=estNormFactors(seqData, "quantile")
normalizationFactor(seqData)

## assign as vector
normalizationFactor(seqData)=rep(1, ncol(exprs(seqData))) ## getan error here

## or assign as a matrix
f=matrix(1, nrow=nrow(exprs(seqData)), ncol=ncol(exprs(seqData)))
normalizationFactor(seqData)=f

An example dataset for multiple factor design

Description

The dataset contains RRBS data for 5000 CpG sites from 16 samples. The experimental design is provided in the 'design' data frame.

Usage

data("RRBS")data("RRBS")

Examples

data(RRBS)
RRBS
design
data(RRBS)
RRBS
design

Class `"SeqCountSet"` - container for count data from sequencing experiment

Description

This class is the main container for storing *RNA-seq* data. It is directly inherited fro 'ExpressionSet' class, with two more fields 'normalizationFactor' for normalization factors and 'dispersion' for gene-wise dispersions.

The class for BS-seq data is *BSseq*, which is imported from bsseq package.

Slots

normalizationFactor:: Normalization factor for counts.
dispersion:: Gene-wise dispersions.
experimentData:: See 'ExpressionSet'.
assayData:: See 'ExpressionSet'.
phenoData:: See 'ExpressionSet'.
featureData:: See 'ExpressionSet'.
annotation:: See 'ExpressionSet'.
protocolData:: See 'ExpressionSet'.

Extends

Class "ExpressionSet", directly. Class "eSet", by class "ExpressionSet", distance 2. Class "VersionedBiobase", by class "ExpressionSet", distance 3. Class "Versioned", by class "ExpressionSet", distance 4.

Constructor

newSeqCountSet(counts, designs,normalizationFactor,featureData): Creates a 'SeqCountSet' object.

counts: A matrix of integers with rows corresponding to genes and columns for samples.
designs: A vector or data frame representing experimental design. The length of the vector or number of rows of the data frame must match the number of columns of input counts. This field can be accessed using 'pData' function.
normalizationFactor: A vector or matrix of normalization factors for the counts.
featureData: Additional information for genes as an 'AnnotatedDataFrame' object. This field can be access by using 'featureData' function.

Methods

dispersion, dispersion<- :: Access and set gene-wise dispersions.
normalizationFactor, normalizationFactor<- :: Access and set normalization factors.

Note

This is similar to 'CountDataSet' in DESeq or 'DGEList' in edgeR.

Author(s)

Hao Wu <[email protected]>

Examples

## simulate data from RNA-seq
counts=matrix(rpois(600, 10), ncol=6)
designs=c(0,0,0,1,1,1)
seqData=newSeqCountSet(counts, designs)
seqData
pData(seqData)
head(exprs(seqData))

## multiple factor designs
design=data.frame(gender=c(rep("M",4), rep("F",4)), strain=rep(c("WT", "Mutant"),4))
X=model.matrix(~gender+strain, data=design)
counts=matrix(rpois(800, 10), ncol=8)
seqData=newSeqCountSet(counts, as.data.frame(X))
seqData
pData(seqData)

## simulate data from RNA-seq
counts=matrix(rpois(600, 10), ncol=6)
designs=c(0,0,0,1,1,1)
seqData=newSeqCountSet(counts, designs)
seqData
pData(seqData)
head(exprs(seqData))

## multiple factor designs
design=data.frame(gender=c(rep("M",4), rep("F",4)), strain=rep(c("WT", "Mutant"),4))
X=model.matrix(~gender+strain, data=design)
counts=matrix(rpois(800, 10), ncol=8)
seqData=newSeqCountSet(counts, as.data.frame(X))
seqData
pData(seqData)

A simulated 'SeqCountData' object.

Description

The object is created based on simulation for 1000 genes and two treatment groups with 4 replicates in each group.

Usage

data(seqData)data(seqData)

Examples

data(seqData)
seqData
data(seqData)
seqData

Visualze the count data for one DMR

Description

Given one DMR and an BSseq object, this function generate a multiple panel figure, each for a sample, to visualze the counts. There is a bar at each CpG, the gray bar shows the total coverage, and the black bar shows the methylated count.

Usage

showOneDMR(OneDMR, BSobj, ext = 500, ylim)
showOneDMR(OneDMR, BSobj, ext = 500, ylim)

Arguments

`OneDMR`	A data frame with one row representing one DMR. It must have chr, start, and end columns. This is typically a row from the result generated from callDMR.
`BSobj`	An object of class BSseq.
`ext`	The amount (in bps) the plotting region should be extended in both directions.
`ylim`	Y-axis limit.

Value

This function only generates a figure and has no return values.

Author(s)

Hao Wu <[email protected]>

Examples

## Not run: 
require(bsseq)
require(bsseqData)
data(BS.cancer.ex)

## takea small portion of data and test
BSobj <- BS.cancer.ex[140000:150000,]
dmlTest <- DMLtest(BSobj, group1=c("C1", "C2", "C3"), group2=c("N1","N2","N3"),
   smoothing=TRUE, smoothing.span=500)

## call DMR based on test results
dmrs <- callDMR(dmlTest)

## visualize one DMR
showOneDMR(dmrs[1,], BSobj)

## End(Not run)

## Not run: 
require(bsseq)
require(bsseqData)
data(BS.cancer.ex)

## takea small portion of data and test
BSobj <- BS.cancer.ex[140000:150000,]
dmlTest <- DMLtest(BSobj, group1=c("C1", "C2", "C3"), group2=c("N1","N2","N3"),
   smoothing=TRUE, smoothing.span=500)

## call DMR based on test results
dmrs <- callDMR(dmlTest)

## visualize one DMR
showOneDMR(dmrs[1,], BSobj)

## End(Not run)

Perform gene-wise Wald test for two group comparisons for sequencing count data.

Description

The counts from two groups are modeled as negative binomial random variables with means and dispersions estimated. Wald statistics will be constructed. P-values will be obtained based on Gaussian assumption.

Usage

## S4 method for signature 'SeqCountSet'
waldTest(seqData, sampleA, sampleB, equal.var, fdr.method=c("BH", "locfdr"))
## S4 method for signature 'SeqCountSet'
waldTest(seqData, sampleA, sampleB, equal.var, fdr.method=c("BH", "locfdr"))

Arguments

`seqData`	An object of SeqCountSet class.
`sampleA`	The sample labels for the first sample to be compared in two-group comparison.
`sampleB`	The sample labels for the second sample to be compared in two-group comparison.
`equal.var`	A boolean to indicate whether to use the same or different means in two groups for computing variances in Wald test. Default is FALSE.
`fdr.method`	Method to compute FDR. Availabile options are "BH" for Benjamini-Hochberg FDR method, or local FDR from "locfdr" package.

Details

The input seqCountData object Must have normalizationFactor and dispersion fields filled, e.g., estNormFactors and estDispersion need to be called prior to this. With group means and shrunk dispersions ready, the variances for difference in group means will be constructed based on Negative Binomial distribution. P-values will be obtained under the assumption that the Wald test statistics are normally distributed. Genes with 0 counts in both groups will be assigned 0 for test statistics and 1 for p-values.

Value

A data frame with each row corresponding to a gene. Rows are sorted according to wald test statistics. The columns are:

`gene Index`	index for input gene orders, integers from 1 to the number of genes.
`muA`	sample mean (after normalization) for sample A.
`muB`	sample mean (after normalization) for sample B.
`lfc`	log fold change of expressions between two groups.
`difExpr`	differences in expressions between two groups.
`stats`	Wald test statistics.
`pval`	p-values.
`fdr`	FDR.
`local.fdr`	Local FDR if the FDR method is "locfdr".
`others`	input gene annotations supplied as AnnotatedDataFrame when constructed the SeqCountData object.

Author(s)

Hao Wu <[email protected]>

Examples

data(seqData)
seqData=estNormFactors(seqData)
seqData=estDispersion(seqData)
result=waldTest(seqData, 0, 1)
head(result)
data(seqData)
seqData=estNormFactors(seqData)
seqData=estDispersion(seqData)
result=waldTest(seqData, 0, 1)
head(result)

Package 'DSS'

Help Index

Dispersion shrinkage for sequencing data

Description

Author(s)

Function to detect differntially methylated loci (DML) from bisulfite sequencing (BS-seq) data.

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Function to detect differntially methylated regions (DMR) from bisulfite sequencing (BS-seq) data.

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Experimental design for the example RRBS dataset

Description

Usage

Examples

Acessor functions for the 'dispersion' slot in a SeqCountData object.

Description

Usage

Arguments

Details

Author(s)

See Also

Examples

Fit a linear model for BS-seq data from general experimental design

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Function to perform statistical test of differntially methylated loci (DML) for two group comparisons of bisulfite sequencing (BS-seq) data.

Description

Usage

Arguments

Details

Value

Single replicate

Parallelization

Estimating mean methylation levels

Author(s)

See Also

Examples

Perform statistical test for BS-seq data from general experimental design

Description

Usage

Arguments

Value

Hypothesis test

Author(s)

See Also

Examples

Perform RNA-seq differential expression analysis in two-group comparison

Description

Usage

Arguments

Value

Author(s)

Examples

Estimate and shrink tag-specific dipsersions

Description

Usage

Arguments

Details

Author(s)

Examples

Estimate normalization factors

Description

Class `"SeqCountSet"` - container for count data from sequencing experiment