Package 'bsseq'

Title:	Analyze, manage and store whole-genome methylation data
Description:	A collection of tools for analyzing and visualizing whole-genome methylation data from sequencing. This includes whole-genome bisulfite sequencing and Oxford nanopore data.
Authors:	Kasper Daniel Hansen [aut, cre] , Peter Hickey [aut] , Hervé Pagès [ctb], Aaron Lun [ctb]
Maintainer:	Kasper Daniel Hansen <[email protected]>
License:	Artistic-2.0
Version:	1.43.1
Built:	2025-02-15 03:59:17 UTC
Source:	https://github.com/bioc/bsseq

Help Index

Whole-genome bisulfite sequencing for chromosome 22 from Lister et al.
BSmooth, smoothing bisulfite sequence data
Compute t-statistics based on smoothed whole-genome bisulfite sequencing data.
The constructor function for BSseq objects.
Class BSseq
Class BSseqStat
Class BSseqTstat
Converts a data frame to a GRanges.
Finds differentially methylated regions for whole genome bisulfite sequencing data.
Find methylation loci in a genome
Compute Fisher-tests for a BSseq object
Classes FWIRanges and FWGRanges
Obtain coverage for BSseq objects.
Obtain methylation estimates for BSseq objects.
Obtain statistics from a BSseqTstat object
Binomial and poisson goodness of fit statistics for BSSeq objects
Class hasGRanges
Plotting BSmooth methylation estimates
Parsing output from the Bismark alignment suite.
Construct BSseq objects from nanopore BED files
Construct BSseq objects from nanopore BED files

Whole-genome bisulfite sequencing for chromosome 22 from Lister et al.

Description

This dataset represents chromosome 22 from the IMR90 cell line sequenced in Lister et al. Only CpG methylation are included (there were very few non-CpG loci). The two samples are two different extractions from the same cell line (ie. technical replicates), and are pooled in the analysis in the original paper.

Usage

data(BS.chr22)data(BS.chr22)

Format

An object of class BSseq.

Details

All coordinates are in hg18.

Source

Obtained from http://neomorph.salk.edu/human_methylome/data.html specifically the files mc_imr90_r1.tar.gz and mc_imr90_r2.tar.gz. A script which downloads these files and constructs the BS.chr22 object may be found in ‘inst/scripts/get_BS.chr22.R’, see the example.

References

R Lister et al. Human DNA methylomes at base resolution show widespread epigenomic differences. Nature (2009) 462, 315-322.

Examples

data(BS.chr22)
BS.chr22

script <- system.file("scripts", "get_BS.chr22.R", package = "bsseq")
script
readLines(script)
data(BS.chr22)
BS.chr22

script <- system.file("scripts", "get_BS.chr22.R", package = "bsseq")
script
readLines(script)

BSmooth, smoothing bisulfite sequence data

Description

This implements the BSmooth algorithm for estimating methylation levels from bisulfite sequencing data.

Usage

BSmooth(BSseq,
        ns = 70,
        h = 1000,
        maxGap = 10^8,
        keep.se = FALSE,
        BPPARAM = bpparam(),
        chunkdim = NULL,
        level = NULL,
        verbose = getOption("verbose"))
BSmooth(BSseq,
        ns = 70,
        h = 1000,
        maxGap = 10^8,
        keep.se = FALSE,
        BPPARAM = bpparam(),
        chunkdim = NULL,
        level = NULL,
        verbose = getOption("verbose"))

Arguments

`BSseq`	An object of class `BSseq`.
`ns`	The minimum number of methylation loci in a smoothing window.
`h`	The minimum smoothing window, in bases.
`maxGap`	The maximum gap between two methylation loci, before the smoothing is broken across the gap. The default smoothes each chromosome separately.
`keep.se`	Should the estimated standard errors from the smoothing algorithm be kept. This will make the return object roughly 30 percent bigger and is currently not be used for anything in bsseq.
`BPPARAM`	An optional BiocParallelParam instance determining the parallel back-end to be used during evaluation. Currently supported are SerialParam (Unix, Mac, Windows), MulticoreParam (Unix and Mac), SnowParam (Unix, Mac, and Windows, limited to single-machine clusters), and BatchtoolsParam (Unix, Mac, Windows, only with the in-memory realization backend). See sections 'Parallelization and progress monitoring' and 'Realization backends' for further details.
`chunkdim`	Only applicable if `BACKEND == "HDF5Array"`. The dimensions of the chunks to use for writing the data to disk. By default, `getHDF5DumpChunkDim()` using the dimensions of the returned BSseq object will be used. See `?{getHDF5DumpChunkDim}` for more information.
`level`	Only applicable if `BACKEND == "HDF5Array"`. The compression level to use for writing the data to disk. By default, `getHDF5DumpCompressionLevel()` will be used. See `?getHDF5DumpCompressionLevel` for more information.
`verbose`	A `logical(1)` indicating whether progress messages should be printed (default `TRUE`).

Details

ns and h are passed to the locfit function. The bandwidth used is the maximum (in genomic distance) of the h and a width big enough to contain ns number of methylation loci.

Value

An object of class BSseq, containing coefficients used to fit smoothed methylation values and optionally standard errors for these.

Realization backends

The BSmooth() function creates a new assay to store the coefficients used to construct the smoothed methylation estimates ((coef). An additional assay is also created if keep.se == TRUE (se.coef).

The choice of realization backend controls whether these assay(s) are stored in-memory as an ordinary matrix or on-disk as a HDF5Array, for example.

The choice of realization backend is controlled by the BACKEND argument, which defaults to the current value of DelayedArray::getAutoRealizationBackend().

BSmooth supports the following realization backends:

NULL (in-memory): This stores each new assay in-memory using an ordinary matrix.
HDF5Array (on-disk): This stores each new assay on-disk in a HDF5 file using an HDF5Matrix from HDF5Array.

Please note that certain combinations of realization backend and parallelization backend are currently not supported. For example, the HDF5Array realization backend is currently only compatible when used with a single-machine parallelization backend (i.e. it is not compatible with a SnowParam that specifies an ad hoc cluster of multiple machines). BSmooth() will issue an error when given such incompatible realization and parallelization backends. Furthermore, to avoid memory usage blow-ups, BSmooth() will issue an error if an in-memory realization backend is used when smoothing a disk-backed BSseq object.

Additional arguments related to the realization backend can be passed via the ... argument. These arguments must be named and are passed to the relevant RealizationSink constructor. For example, the ... argument can be used to specify the path to the HDF5 file to be used by BSmooth(). Please see the examples at the bottom of the page.

Parallelization and progress monitoring

BSmooth() now uses the BiocParallel package to implement parallelization. This brings some notable improvements:

Smoothed results can now be written directly to an on-disk realization backend by the worker. This dramatically reduces memory usage compared to previous versions of bsseq that required all results be retained in-memory.
Parallelization is now supported on Windows through the use of a SnowParam object as the value of BPPARAM.
Detailed and extensive job logging facilities.

All parallelization options are controlled via the BPPARAM argument. In general, we recommend that users combine multicore (single-machine) parallelization with an on-disk realization backend (see section, 'Realization backend'). For Unix and Mac users, this means using a MulticoreParam. For Windows users, this means using a single-machine SnowParam. Please consult the BiocParallel documentation to take full advantage of the more advanced features.

Deprecated arguments

parallelBy, mc.cores, and mc.preschedule are deprecated and will be removed in subsequent releases of bsseq. These arguments were necessary when BSmooth() used the parallel package to implement parallelization, but this functionality is superseded by the aforementioned use of BiocParallel. We recommend that users who previously relied on these arguments switch to BPPARAM = MulticoreParam(workers = mc.cores, progressbar = TRUE).

Progress monitoring

A useful feature of BiocParallel are progress bars to monitor the status of long-running jobs, such as BSmooth(). Progress bars are controlled via the progressbar argument in the BiocParallelParam constructor. Progress bars replace the use of the deprecated verbose argument to print out information on the status of BSmooth().

BiocParallel also supports extensive and detailed logging facilities. Please consult the BiocParallel documentation to take full advantage these advanced features.

Author(s)

Method and original implementation by Kasper Daniel Hansen [email protected]. Updated implementation to support disk-backed BSseq objects and more general parallelization by Peter Francis Hickey.

References

KD Hansen, B Langmead, and RA Irizarry. BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biology (2012) 13:R83. doi:10.1186/gb-2012-13-10-r83.

Examples

## Not run: 
  # Run BSmooth() on a matrix-backed BSseq object using an in-memory realization
  # backend with serial evaluation.
  data(BS.chr22)
  # This is a matrix-backed BSseq object.
  sapply(assays(BS.chr22, withDimnames = FALSE), class)
  BS.fit <- BSmooth(BS.chr22, BPPARAM = SerialParam(progressbar = TRUE))
  # The new 'coef' assay is an ordinary matrix.
  sapply(assays(BS.fit, withDimnames = FALSE), class)
  BS.fit

  # Run BSmooth() on a disk-backed BSseq object using the HDF5Array realization
  # backend (with data written to the file 'BSmooth_example.h5') with
  # multi-core parallel evaluation.
  BS.chr22 <- realize(BS.chr22, "HDF5Array")
  # This is a disk-backed BSseq object.
  sapply(assays(BS.chr22, withDimnames = FALSE), class)
  BS.fit <- BSmooth(BS.chr22,
              BPPARAM = MulticoreParam(workers = 2, progressbar = TRUE),
              BACKEND = "HDF5Array",
              filepath = "BSmooth_example.h5")
  # The new 'coef' assay is an HDF5Matrix.
  sapply(assays(BS.fit, withDimnames = FALSE), class)
  BS.fit
  # The new 'coef' assay is in the HDF5 file 'BSmooth_example.h5' (in the
  # current working directory).
  sapply(assays(BS.fit, withDimnames = FALSE), path)

## End(Not run)
## Not run: 
  # Run BSmooth() on a matrix-backed BSseq object using an in-memory realization
  # backend with serial evaluation.
  data(BS.chr22)
  # This is a matrix-backed BSseq object.
  sapply(assays(BS.chr22, withDimnames = FALSE), class)
  BS.fit <- BSmooth(BS.chr22, BPPARAM = SerialParam(progressbar = TRUE))
  # The new 'coef' assay is an ordinary matrix.
  sapply(assays(BS.fit, withDimnames = FALSE), class)
  BS.fit

  # Run BSmooth() on a disk-backed BSseq object using the HDF5Array realization
  # backend (with data written to the file 'BSmooth_example.h5') with
  # multi-core parallel evaluation.
  BS.chr22 <- realize(BS.chr22, "HDF5Array")
  # This is a disk-backed BSseq object.
  sapply(assays(BS.chr22, withDimnames = FALSE), class)
  BS.fit <- BSmooth(BS.chr22,
              BPPARAM = MulticoreParam(workers = 2, progressbar = TRUE),
              BACKEND = "HDF5Array",
              filepath = "BSmooth_example.h5")
  # The new 'coef' assay is an HDF5Matrix.
  sapply(assays(BS.fit, withDimnames = FALSE), class)
  BS.fit
  # The new 'coef' assay is in the HDF5 file 'BSmooth_example.h5' (in the
  # current working directory).
  sapply(assays(BS.fit, withDimnames = FALSE), path)

## End(Not run)

Compute t-statistics based on smoothed whole-genome bisulfite sequencing data.

Description

Compute t-statistics based on smoothed whole-genome bisulfite sequencing data.

Usage

BSmooth.tstat(BSseq, group1, group2,
  estimate.var = c("same", "paired", "group2"), local.correct = TRUE,
  maxGap = NULL, qSd = 0.75, k = 21, mc.cores = 1, verbose = TRUE)
BSmooth.tstat(BSseq, group1, group2,
  estimate.var = c("same", "paired", "group2"), local.correct = TRUE,
  maxGap = NULL, qSd = 0.75, k = 21, mc.cores = 1, verbose = TRUE)

Arguments

`BSseq`	An object of class `BSseq`.
`group1`	A vector of sample names or indexes for the ‘treatment’ group.
`group2`	A vector of sample names or indexes for the ‘control’ group.
`estimate.var`	How is the variance estimated, see details.
`local.correct`	A logical; should local correction be used, see details.
`maxGap`	A scalar greater than 0, see details.
`qSd`	A scalar between 0 and 1, see details.
`k`	A positive scalar, see details.
`mc.cores`	The number of cores used. Note that setting `mc.cores` to a value greater than 1 is not supported on MS Windows, see the help page for `mclapply`.
`verbose`	Should the function be verbose?

Details

T-statistics are formed as the difference in means between group 1 and group 2 divided by an estimate of the standard deviation, assuming that the variance in the two groups are the same (same), that we have paired samples (paired) or only estimate the variance based on group 2 (group2). The standard deviation estimates are then smoothed (using a running mean with a width of k) and thresholded (using qSd which sets the minimum standard deviation to be the qSd-quantile). Optionally, the t-statistics are corrected for low-frequency patterns.

It is sometimes useful to use local.correct even if no large scale changes in methylation have been found; it makes the marginal distribution of the t-statistics more symmetric.

Additional details in the reference.

Value

An object of class BSseqTstat.

Author(s)

Kasper Daniel Hansen [email protected]

References

KD Hansen, B Langmead, and RA Irizarry. BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biology (2012) 13:R83. doi:10.1186/gb-2012-13-10-r83.

Examples

if(require(bsseqData)) {
  data(keepLoci.ex)
  data(BS.cancer.ex.fit)
  BS.cancer.ex.fit <- updateObject(BS.cancer.ex.fit)
  ## Remember to subset the BSseq object, see vignette for explanation
  BS.tstat <- BSmooth.tstat(BS.cancer.ex.fit[keepLoci.ex,],
                            group1 = c("C1", "C2", "C3"),
                            group2 = c("N1", "N2", "N3"),
                            estimate.var = "group2")
  BS.tstat
  ## This object is also stored as BS.cancer.ex.tstat in the
  ## bsseqData package

  #---------------------------------------------------------------------------
  # An example using a HDF5Array-backed BSseq object
  #
  
    library(HDF5Array)
    # See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details
    hdf5_BS.cancer.ex.fit <- saveHDF5SummarizedExperiment(
        x = BS.cancer.ex.fit[keepLoci.ex, ],
        dir = tempfile())
    hdf5_BS.tstat <- BSmooth.tstat(hdf5_BS.cancer.ex.fit,
                                   group1 = c("C1", "C2", "C3"),
                                   group2 = c("N1", "N2", "N3"),
                                   estimate.var = "group2")
  
}
if(require(bsseqData)) {
  data(keepLoci.ex)
  data(BS.cancer.ex.fit)
  BS.cancer.ex.fit <- updateObject(BS.cancer.ex.fit)
  ## Remember to subset the BSseq object, see vignette for explanation
  BS.tstat <- BSmooth.tstat(BS.cancer.ex.fit[keepLoci.ex,],
                            group1 = c("C1", "C2", "C3"),
                            group2 = c("N1", "N2", "N3"),
                            estimate.var = "group2")
  BS.tstat
  ## This object is also stored as BS.cancer.ex.tstat in the
  ## bsseqData package

  #---------------------------------------------------------------------------
  # An example using a HDF5Array-backed BSseq object
  #
  
    library(HDF5Array)
    # See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details
    hdf5_BS.cancer.ex.fit <- saveHDF5SummarizedExperiment(
        x = BS.cancer.ex.fit[keepLoci.ex, ],
        dir = tempfile())
    hdf5_BS.tstat <- BSmooth.tstat(hdf5_BS.cancer.ex.fit,
                                   group1 = c("C1", "C2", "C3"),
                                   group2 = c("N1", "N2", "N3"),
                                   estimate.var = "group2")
  
}

The constructor function for BSseq objects.

Description

The constructor function for BSseq objects.

Usage

BSseq(M = NULL, Cov = NULL, Filtered = NULL, coef = NULL, se.coef = NULL,
  trans = NULL, parameters = NULL, pData = NULL, gr = NULL,
  pos = NULL, chr = NULL, sampleNames = NULL, rmZeroCov = FALSE)
BSseq(M = NULL, Cov = NULL, Filtered = NULL, coef = NULL, se.coef = NULL,
  trans = NULL, parameters = NULL, pData = NULL, gr = NULL,
  pos = NULL, chr = NULL, sampleNames = NULL, rmZeroCov = FALSE)

Arguments

`M`	A matrix-like object of methylation evidence (see 'Details' below).
`Cov`	A matrix-like object of coverage (see 'Details' below)).
`Filtered`	A matrix-like object of ambiguous modification bases obtained from modbam2bed.
`coef`	A matrix-like object of smoothing estimates (see 'Details' below).
`se.coef`	A matrix-like object of smoothing standard errors (see 'Details' below).
`trans`	A smoothing transformation.
`parameters`	A list of smoothing parameters.
`pData`	An `data.frame` or DataFrame.
`sampleNames`	A vector of sample names.
`gr`	An object of type GRanges.
`pos`	A vector of locations.
`chr`	A vector of chromosomes.
`rmZeroCov`	Should genomic locations with zero coverage in all samples be removed.

Details

The 'M', 'Cov', 'coef', and 'se.coef' matrix-like objects will be coerced to DelayedMatrix objects; see ?DelayedArray::DelayedMatrix for the full list of supported matrix-like objects. We recommend using matrix objects for in-memory storage of data and HDF5Matrix for on-disk storage of data.

Genomic locations are specified either through gr or through chr and pos but not both. There should be the same number of genomic locations as there are rows in the M and Cov matrix.

The argument rmZeroCov may be useful in order to reduce the size of the return object be removing methylation loci with zero coverage.

In case one or more methylation loci appears multiple times, the M and Cov matrices are summed over rows linked to the same methylation loci. See the example below.

Users should never have to specify coef, se.coef, trans, and parameters, this is for internal use (they are added by BSmooth).

phenoData is a way to specify pheno data (as known from the ExpressionSet and eSet classes), at a minimum sampleNames should be given (if they are not present, the function uses col.names(M)).

Value

An object of class BSseq.

Author(s)

Kasper Daniel Hansen [email protected]

Examples

  M <- matrix(0:8, 3, 3)
  Cov <- matrix(1:9, 3, 3)
  BS1 <- BSseq(chr = c("chr1", "chr2", "chr1"), pos = c(1,2,3),
               M = M, Cov = Cov, sampleNames = c("A","B", "C"))
  BS1
  BS2 <- BSseq(chr = c("chr1", "chr1", "chr1"), pos = c(1,1,1),
               M = M, Cov = Cov, sampleNames = c("A","B", "C"))
  BS2

  #-------------------------------------------------------------------------------
  # An example using a HDF5Array-backed BSseq object
  #

  library(HDF5Array)
  hdf5_M <- realize(M, "HDF5Array")
  hdf5_Cov <- realize(Cov, "HDF5Array")
  hdf5_BS1 <- BSseq(chr = c("chr1", "chr2", "chr1"),
                    pos = c(1, 2, 3),
                    M = hdf5_M,
                    Cov = hdf5_Cov,
                    sampleNames = c("A", "B", "C"))
  hdf5_BS1
  hdf5_BS2 <- BSseq(chr = c("chr1", "chr1", "chr1"),
                    pos = c(1, 1, 1),
                    M = hdf5_M,
                    Cov = hdf5_Cov,
                    sampleNames = c("A", "B", "C"))
  hdf5_BS2
M <- matrix(0:8, 3, 3)
  Cov <- matrix(1:9, 3, 3)
  BS1 <- BSseq(chr = c("chr1", "chr2", "chr1"), pos = c(1,2,3),
               M = M, Cov = Cov, sampleNames = c("A","B", "C"))
  BS1
  BS2 <- BSseq(chr = c("chr1", "chr1", "chr1"), pos = c(1,1,1),
               M = M, Cov = Cov, sampleNames = c("A","B", "C"))
  BS2

  #-------------------------------------------------------------------------------
  # An example using a HDF5Array-backed BSseq object
  #

  library(HDF5Array)
  hdf5_M <- realize(M, "HDF5Array")
  hdf5_Cov <- realize(Cov, "HDF5Array")
  hdf5_BS1 <- BSseq(chr = c("chr1", "chr2", "chr1"),
                    pos = c(1, 2, 3),
                    M = hdf5_M,
                    Cov = hdf5_Cov,
                    sampleNames = c("A", "B", "C"))
  hdf5_BS1
  hdf5_BS2 <- BSseq(chr = c("chr1", "chr1", "chr1"),
                    pos = c(1, 1, 1),
                    M = hdf5_M,
                    Cov = hdf5_Cov,
                    sampleNames = c("A", "B", "C"))
  hdf5_BS2

Class BSseq

Description

A class for representing whole-genome or capture bisulfite sequencing data.

Objects from the Class

An object from the class links together several pieces of information. (1) genomic locations stored as a GRanges object, a location by samples matrix of M values, a location by samples matrix of Cov (coverage) values, a location by samples matrix of Filtered (ambiguous modification status) values, and phenodata information.In addition, there are slots for representing smoothed data. This class is an extension of RangedSummarizedExperiment.

Slots

trans:: Object of class function. This function transforms the coef slot from the scale the smoothing was done to the 0-1 methylation scale.
parameters:: Object of class list. A list of parameters representing for example how the data was smoothed.

Methods

[: signature(x = "BSseq"): Subsetting by location (using integer indices) or sample (using integers or sample names).
length: Unlike for RangedSummarizedExperiment, length() is the number of methylation loci (equal to length(granges(x))).
sampleNames,sampleNames<-: Sample names and its replacement function for the object. This is an alias for colnames.
pData,pData<-: Obtain and replace the pData slot of the phenoData slot. This is an alias for colData.
show: The show method.
combine: This function combines two BSSeq objects. The genomic locations of the new object is the union of the genomic locations of the individual objects. In addition, the methylation data matrices are placed next to each other (as appropriate wrt. the new genomic locations) and zeros are entered into the matrices as needed.

Utilities

This class extends RangedSummarizedExperiment and therefore inherits a number of useful GRanges methods that operate on the rowRanges slot, used for accessing and setting the genomic locations and also do subsetByOverlaps.

There are a number of almost methods-like functions for operating on objects of class BSseq, including getBSseq, collapseBSseq, and orderBSseq. They are detailed below.

collapseBSseq(BSseq, columns)

is used to collapse an object of class BSseq. By collapsing we simply mean that certain columns (samples) are merge together by summing up the methylation evidence and coverage. This is a useful function if you start by reading in a dataset based on say flowcells and you (after QC) want to simply add a number of flowcells into one sample. The argument columns specify which samples are to be merged, in the following way: it is a character vector of new sample names, and the names of the column vector indicates which samples in the BSseq object are to be collapsed. If columns have the same length as the number of rows of BSseq (and has no names) it is assumed that the ordering corresponds to the sample ordering in BSseq.

orderBSseq(BSseq, seqOrder = NULL)

simply orders an object of class BSseq according to (increasing) genomic locations. The seqOrder vector is a character vector of seqnames(BSseq) describing the order of the chromosomes. This is useful for ordering chr1 before chr10.

chrSelectBSseq(BSseq, seqnames = NULL, order = FALSE)

subsets and optionally reorders an object of class BSseq. The seqnames vector is a character vector of seqnames(BSseq) describing which chromosomes should be retained. If order is TRUE, the chromosomes are also re-ordered using orderBSseq.

getBSseq(BSseq, type = c("Cov", "M", "gr", "coef", "se.coef", "trans", "parameters"))

is a general accessor: is used to obtain a specific slot of an object of class BSseq. It is primarily intended for internal use in the package, for users we recommend granges to get the genomic locations, getCoverage to get the coverage slots and getMeth to get the smoothed values (if they exist).

hasBeenSmoothed(BSseq)

This function returns a logical depending on whether or not the BSseq object has been smoothed using BSmooth.

combineList(list, BACKEND = NULL)

This function function is a faster way of using combine on multiple BSseq objects. The input is a list, with each component an object of class BSseq. The (slower) alternative is to use Reduce(combine, list).

The BACKEND argument determines which backend should be used for the 'M' and 'Cov' matrices and, if present, the 'coef' and 'se.coef' matrices (the latter two can only be combined if all objects have the same rowRanges). The default, BACKEND = NULL, corresponds to using matrix objects. See ?DelayedArray::setAutoRealizationBackend for alternative backends.

strandCollapse(BSseq, shift = TRUE)

This function operates on a BSseq objects which has stranded loci (i.e. loci where the strand is one of ‘+’ or ‘-’). It will collapse the methylation and coverage information across the two strands, unstranding the loci in the process and potentially re-ordering them.

The argument shift indicates whether the positions for the loci on the reverse strand should be shifted one (i.e. the positions for these loci are the positions of the ‘G’ in the ‘CpG’; this is the case for Bismark output for example).

Coercion

Package versions 1.5.2 and 1.11.1 introduced a new version of representing ‘BSseq’ objects. You can update old serialized (saved) objects by invoking x <- updateObject(x).

Assays

This class overrides the default implementation of assays to make it faster. Per default, no names are added to the returned data matrices.

Assay names can conveniently be obtained by the function assayNames(x)

Author(s)

Kasper Daniel Hansen [email protected]

Examples

M <- matrix(1:9, 3,3)
colnames(M) <- c("A1", "A2", "A3")
BStest <- BSseq(pos = 1:3, chr = c("chr1", "chr2", "chr1"), M = M, Cov = M + 2)
chrSelectBSseq(BStest, seqnames = "chr1", order = TRUE)
collapseBSseq(BStest, group = c("A", "A", "B"))

#-------------------------------------------------------------------------------
# An example using a HDF5-backed BSseq object
#
hdf5_BStest <- realize(BStest, "HDF5Array")
chrSelectBSseq(hdf5_BStest, seqnames = "chr1", order = TRUE)
collapseBSseq(
    BSseq = hdf5_BStest,
    group = c("A", "A", "B"),
    BACKEND = "HDF5Array",
    type = "integer")
M <- matrix(1:9, 3,3)
colnames(M) <- c("A1", "A2", "A3")
BStest <- BSseq(pos = 1:3, chr = c("chr1", "chr2", "chr1"), M = M, Cov = M + 2)
chrSelectBSseq(BStest, seqnames = "chr1", order = TRUE)
collapseBSseq(BStest, group = c("A", "A", "B"))

#-------------------------------------------------------------------------------
# An example using a HDF5-backed BSseq object
#
hdf5_BStest <- realize(BStest, "HDF5Array")
chrSelectBSseq(hdf5_BStest, seqnames = "chr1", order = TRUE)
collapseBSseq(
    BSseq = hdf5_BStest,
    group = c("A", "A", "B"),
    BACKEND = "HDF5Array",
    type = "integer")

Class BSseqStat

Description

A class for representing statistics for smoothed whole-genome bisulfite sequencing data.

Usage

    BSseqStat(gr = NULL, stats = NULL, parameters = NULL)
BSseqStat(gr = NULL, stats = NULL, parameters = NULL)

Arguments

`gr`	The genomic locations as an object of class `GRanges`.
`stats`	The statistics, as a list of matrix-like objects (see 'Details' below).
`parameters`	A list of parameters.

Details

The matrix-like elements of the list in the 'stats' slot will be coerced to DelayedMatrix objects; see ?DelayedArray::DelayedMatrix for the full list of supported matrix-like objects. We recommend using matrix objects for in-memory storage of data and HDF5Matrix for on-disk storage of data.

Objects from the Class

Objects can be created by calls of the form BSseqStat(...). However, usually objects are returned by BSmooth.fstat(...) and not constructed by the user.

Slots

stats:: This is a list of DelayedMatrix objects with list elements representing various statistics for methylation loci along the genome.
parameters:: Object of class list. A list of parameters representing how the statistics were computed.
gr:: Object of class GRanges giving genomic locations.

Extends

Class hasGRanges, directly.

Methods

[: The subsetting operator; one may only subset in one dimension, corresponding to methylation loci.
show: The show method.

Utilities

This class extends hasGRanges and therefore inherits a number of useful GRanges methods that operate on the gr slot, used for accessing and setting the genomic locations and also do subsetByOverlaps.

Coercion

Package version 1.11.1 introduced a new version of representing ‘BSseqStat’ objects. You can update old serialized (saved) objects by invoking x <- updateObject(x).

Author(s)

Kasper Daniel Hansen [email protected]

Class BSseqTstat

Description

A class for representing t-statistics for smoothed whole-genome bisulfite sequencing data.

Usage

  BSseqTstat(gr = NULL, stats = NULL, parameters = NULL)
BSseqTstat(gr = NULL, stats = NULL, parameters = NULL)

Arguments

`gr`	The genomic locations as an object of class `GRanges`.
`stats`	The statistics, as a matrix-like object (see 'Details' below).
`parameters`	A list of parameters.

Details

The 'stats' matrix-like object will be coerced to a DelayedMatrix object; see ?DelayedArray::DelayedMatrix for the full list of supported matrix-like objects. We recommend using matrix objects for in-memory storage of data and HDF5Matrix for on-disk storage of data.

Objects from the Class

Objects can be created by calls of the form BSseqTstat(...). However, usually objects are returned by BSmooth.tstat(...) and not constructed by the user..

Slots

stats:: This is a DelayedMatrix object with columns representing various statistics for methylation loci along the genome.
parameters:: Object of class list. A list of parameters representing how the t-statistics were computed.
gr:: Object of class GRanges giving genomic locations.

Extends

Class hasGRanges, directly.

Methods

[: The subsetting operator; one may only subset in one dimension, corresponding to methylation loci.
show: The show method.

Utilities

Coercion

Package version 1.11.1 introduced a new version of representing ‘BSseqTstat’ objects. You can update old serialized (saved) objects by invoking x <- updateObject(x).

Author(s)

Kasper Daniel Hansen [email protected]

Examples

if(require(bsseqData)) {
    data(BS.cancer.ex.tstat)
    dmrFinder(BS.cancer.ex.tstat)
}
if(require(bsseqData)) {
    data(BS.cancer.ex.tstat)
    dmrFinder(BS.cancer.ex.tstat)
}

Converts a data frame to a GRanges.

Description

Converting a data.frame to a GRanges object. The data.frame needs columns like chr, start and end (strand is optional). Additional columns may be kept in the GRanges object.

Usage

data.frame2GRanges(df, keepColumns = FALSE, ignoreStrand = FALSE)
data.frame2GRanges(df, keepColumns = FALSE, ignoreStrand = FALSE)

Arguments

`df`	A `data.frame` with columns `chr` or `seqnames`, `start`, `end` and optionally a `strand column`.
`keepColumns`	In case `df` has additional columns, should these columns be stored as metadata columns on the return GRanges or should they be discarded.
`ignoreStrand`	In case `df` has a `strand` column, should this column be ignored.

Value

An object of class GRanges

Note

In case df has rownames, they will be used as names for the return object.

Author(s)

Kasper Daniel Hansen [email protected]

Examples

df <- data.frame(chr = "chr1", start = 1:3, end = 2:4,
                 strand = c("+","-","+"))
data.frame2GRanges(df, ignoreStrand = TRUE)
df <- data.frame(chr = "chr1", start = 1:3, end = 2:4,
                 strand = c("+","-","+"))
data.frame2GRanges(df, ignoreStrand = TRUE)

Finds differentially methylated regions for whole genome bisulfite sequencing data.

Description

Finds differentially methylated regions for whole genome bisulfite sequencing data. Essentially identifies regions of the genome where all methylation loci have an associated t-statistic that is beyond a (low, high) cutoff.

Usage

dmrFinder(bstat, cutoff = NULL, qcutoff = c(0.025, 0.975),
  maxGap=300, stat = "tstat.corrected", verbose = TRUE)
dmrFinder(bstat, cutoff = NULL, qcutoff = c(0.025, 0.975),
  maxGap=300, stat = "tstat.corrected", verbose = TRUE)

Arguments

`bstat`	An object of class `BSseqStat` or `BSseqTstat`.
`cutoff`	The cutoff of the t-statistics. This should be a vector of length two giving the (low, high) cutoff. If `NULL`, see `qcutoff`.
`qcutoff`	In case `cutoff` is `NULL`, compute the cutoff using these quantiles of the t-statistic.
`maxGap`	If two methylation loci are separated by this distance, break a possible DMR. This guarantees that the return DMRs have CpGs that are this distance from each other.
`stat`	Which statistic should be used?
`verbose`	Should the function be verbose?

Details

The workhorse function is BSmooth.tstat which sets up a t-statistic for a comparison between two groups.

Note that post-processing of these DMRs are likely to be necessary, filtering for example for length (or number of loci).

Value

A data.frame with columns

`start`, `end`, `width`, `chr`	genomic locations and width.
`n`	The number of methylation loci.
`invdensity`	Average length per loci.
`group1.mean`	The mean methylation level across samples and loci in 'group1'.
`group2.mean`	The mean methylation level across samples and loci in 'group2'.
`meanDiff`	The mean difference in methylation level; the difference between `group1.mean` and `group2.mean`.
`idxStart`, `idxEnd`, `cluster`	Internal use.
`areaStat`	The 'area' of the t-statistic; equal to the sum of the t-statistics for the individual methylation loci.
`direction`	either ‘hyper’ or ‘hypo’.
`areaStat.corrected`	Only present if `column = "tstat.corrected"`, contains the area of the corrected t-statistics.

Author(s)

Kasper Daniel Hansen [email protected].

References

KD Hansen, B Langmead, and RA Irizarry. BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biology (2012) 13:R83. doi:10.1186/gb-2012-13-10-r83.

Examples

if(require(bsseqData)) {
  dmrs0 <- dmrFinder(BS.cancer.ex.tstat, cutoff = c(-4.6, 4.6), verbose = TRUE)
  dmrs <- subset(dmrs0, abs(meanDiff) > 0.1 & n >= 3)
}
if(require(bsseqData)) {
  dmrs0 <- dmrFinder(BS.cancer.ex.tstat, cutoff = c(-4.6, 4.6), verbose = TRUE)
  dmrs <- subset(dmrs0, abs(meanDiff) > 0.1 & n >= 3)
}

Find methylation loci in a genome

Description

This is a convenience function to find methylation loci, such as CpGs, in a reference genome. The result is useful as the value of the loci argument for read.bismark().

Usage

findLoci(pattern,
         subject,
         include = seqlevels(subject),
         strand = c("*", "+", "-"),
         fixed = "subject",
         resize = TRUE)
findLoci(pattern,
         subject,
         include = seqlevels(subject),
         strand = c("*", "+", "-"),
         fixed = "subject",
         resize = TRUE)

Arguments

`pattern`	A string specifying the pattern to search for, e.g. `"CG"`. Can contain IUPAC ambiguity codes, e.g., `"CH"`.
`subject`	A string containing a file path to the genome sequence, in FASTA or 2bit format, to be searched. Alternatively, a BSgenome or DNAStringSet object storing the genome sequence to be searched.
`include`	A character vector indicating the seqlevels of `subject` to be used.
`strand`	A character scaler specifying the strand of `subject` to be used. If `strand = "*"`, then both the positive (`strand = "+"`) and negative (`strand = "-"` strands will be searched.) It is assumed that `subject` contains the sequence with respect to the `+` strand.
`fixed`	If `"subject"` (the default), IUPAC ambiguity codes in the pattern only are interpreted as wildcards, e.g., a `pattern` containing `CH` matches a `subject` containing `CA` but not vice versa. See ?Biostrings::`lowlevel-matching` for more information
`resize`	A logical scalar specifying whether the ranges should be shifted to have width 1 and anchored by the start of the locus, e.g., resize a CpG dinucleotide to give the co-ordinates of the cytosine.

Details

This function provides a convenience wrapper for finding methylation loci in a genome, based on running vmatchPattern(). Users requiring finer-grained control should directly use the vmatchPattern() function and coerce the result to a GRanges object.

Value

A GRanges object storing the found loci.

Author(s)

Peter F. Hickey

Examples

  library(Biostrings)
  # Find CpG dinucleotides on the both strands of an artificial sequence
  my_seq <- DNAStringSet("ATCAGTCGC")
  names(my_seq) <- "test"
  findLoci(pattern = "CG", subject = my_seq)
  # Find CHG trinucleotides on the both strands of an artificial sequence
  findLoci(pattern = "CHG", subject = my_seq)

  # Find CpG dinucleotides on the + strand of chr17 from the human genome (hg38)
  if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38")) {
    findLoci(
        pattern = "CG",
        subject = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38,
        include = "chr17",
        strand = "+")
  }
library(Biostrings)
  # Find CpG dinucleotides on the both strands of an artificial sequence
  my_seq <- DNAStringSet("ATCAGTCGC")
  names(my_seq) <- "test"
  findLoci(pattern = "CG", subject = my_seq)
  # Find CHG trinucleotides on the both strands of an artificial sequence
  findLoci(pattern = "CHG", subject = my_seq)

  # Find CpG dinucleotides on the + strand of chr17 from the human genome (hg38)
  if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38")) {
    findLoci(
        pattern = "CG",
        subject = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38,
        include = "chr17",
        strand = "+")
  }

Compute Fisher-tests for a BSseq object

Description

A function to compute Fisher-tests for an object of class BSseq.

Usage

fisherTests(BSseq, group1, group2, lookup = NULL,
  returnLookup = TRUE, mc.cores = 1, verbose = TRUE)
fisherTests(BSseq, group1, group2, lookup = NULL,
  returnLookup = TRUE, mc.cores = 1, verbose = TRUE)

Arguments

`BSseq`	An object of class `BSseq`.
`group1`	A vector of sample names or indexes for the ‘treatment’ group.
`group2`	A vector of sample names or indexes for the ‘control’ group.
`lookup`	A ‘lookup’ object, see details.
`returnLookup`	Should a ‘lookup’ object be returned, see details.
`mc.cores`	The number of cores used. Note that setting `mc.cores` to a value greater than 1 is not supported on MS Windows, see the help page for `mclapply`.
`verbose`	Should the function be verbose.

Details

This function computes row-wise Fisher's exact tests. It uses an internal lookup table so rows which forms equivalent 2x2 tables are group together and only a single test is computed. If returnLookup is TRUE the return object contains the lookup table which may be feed to another call to the function using the lookup argument.

If group1, group2 designates more than 1 sample, the samples are added together before testing.

This function can use multiple cores on the same computer.

This test cannot model biological variability.

Value

if returnLookup is TRUE, a list with components results and lookup, otherwise just the results component. The results (component) is a matrix with the same number of rows as the BSseq argument and 2 columns p.value (the unadjusted p-values) and log2OR (log2 transformation of the odds ratio).

Author(s)

Kasper Daniel Hansen [email protected]

Examples

M <- matrix(1:9, 3,3)
colnames(M) <- c("A1", "A2", "A3")
BStest <- BSseq(pos = 1:3, chr = c("chr1", "chr2", "chr1"),
                M = M, Cov = M + 2)
results <- fisherTests(BStest, group1 = "A1", group2 = "A2",
                       returnLookup = TRUE)
results

#-------------------------------------------------------------------------------
# An example using a HDF5Array-backed BSseq object
#
library(HDF5Array)
# See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details
hdf5_BStest <- saveHDF5SummarizedExperiment(x = BStest,
                                            dir = tempfile())
results <- fisherTests(hdf5_BStest,
                       group1 = "A1",
                       group2 = "A2",
                       returnLookup = TRUE)
results
M <- matrix(1:9, 3,3)
colnames(M) <- c("A1", "A2", "A3")
BStest <- BSseq(pos = 1:3, chr = c("chr1", "chr2", "chr1"),
                M = M, Cov = M + 2)
results <- fisherTests(BStest, group1 = "A1", group2 = "A2",
                       returnLookup = TRUE)
results

#-------------------------------------------------------------------------------
# An example using a HDF5Array-backed BSseq object
#
library(HDF5Array)
# See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details
hdf5_BStest <- saveHDF5SummarizedExperiment(x = BStest,
                                            dir = tempfile())
results <- fisherTests(hdf5_BStest,
                       group1 = "A1",
                       group2 = "A2",
                       returnLookup = TRUE)
results

Classes FWIRanges and FWGRanges

Description

Classes for fixed-width IRanges and GRanges, ie. objects where all ranges have the same width. The intention is for these classes to be added to GenomicRanges. Documented here temporarily.

Details

See description. Otherwise works like IRanges and GRanges, except there are many unimplemented methods.

This is really a private class, with private methods, but R's NAMESPACE handling means they get unintentionally exported. Hence this documentation.

Examples

showClass("FWIRanges")
showClass("FWIRanges")

Obtain coverage for BSseq objects.

Description

Obtain coverage for BSseq objects.

Usage

getCoverage(BSseq, regions = NULL, type = c("Cov", "M"),
  what = c("perBase", "perRegionAverage", "perRegionTotal"),
  withDimnames = TRUE)
getCoverage(BSseq, regions = NULL, type = c("Cov", "M"),
  what = c("perBase", "perRegionAverage", "perRegionTotal"),
  withDimnames = TRUE)

Arguments

`BSseq`	An object of class `BSseq`.
`regions`	An optional `data.frame` or `GenomicRanges` object specifying a number of genomic regions.
`type`	This returns either coverage or the total evidence for methylation at the loci.
`what`	The type of return object, see details.
`withDimnames`	A `logical(1)`, indicating whether dimnames should be applied to extracted coverage elements. Setting `withDimnames = FALSE` increases the speed and memory efficiency with which coverage is extracted.

Value

NOTE: The return type of getCoverage varies depending on its arguments.

If regions are not specified (regions = NULL) a DelayedMatrix object (what = "perBase") is returned. This will either contain the per-base coverage, the average coverage, or the genome total coverage (depending on value of what).

If what = "perBase" and regions are specified, a list is returned. Each element of the list is a DelayedMatrix object corresponding to the genomic loci inside the region. It is conceptually the same as splitting the coverage by region.

If what = "perRegionAverage" or what = "perRegionTotal" and regions are specified the return value is a DelayedMatrix object. Each row of the DelayedMatrix corresponds to a region and contains either the average coverage or the total coverage in the region.

Author(s)

Kasper Daniel Hansen [email protected].

Examples

data(BS.chr22)
head(getCoverage(BS.chr22, type = "M"))
reg <- GRanges(seqnames = c("chr22", "chr22"),
  ranges = IRanges(start = c(1, 2*10^7), end = c(2*10^7 +1, 4*10^7)))
getCoverage(BS.chr22, regions = reg, what = "perRegionAverage")
  cList <- getCoverage(BS.chr22, regions = reg)
length(cList)
head(cList[[1]])

#-------------------------------------------------------------------------------
# An example using a HDF5Array-backed BSseq object
#

library(HDF5Array)
# See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details
hdf5_BS.chr22 <- saveHDF5SummarizedExperiment(x = BS.chr22,
                                              dir = tempfile())
head(getCoverage(hdf5_BS.chr22, type = "M"))
reg <- GRanges(seqnames = c("chr22", "chr22"),
               ranges = IRanges(start = c(1, 2 * 10 ^ 7),
               end = c(2 * 10 ^ 7 + 1, 4 * 10 ^ 7)))
getCoverage(hdf5_BS.chr22, regions = reg, what = "perRegionAverage")
hdf5_cList <- getCoverage(hdf5_BS.chr22, regions = reg)
length(hdf5_cList)
head(hdf5_cList[[1]])
data(BS.chr22)
head(getCoverage(BS.chr22, type = "M"))
reg <- GRanges(seqnames = c("chr22", "chr22"),
  ranges = IRanges(start = c(1, 2*10^7), end = c(2*10^7 +1, 4*10^7)))
getCoverage(BS.chr22, regions = reg, what = "perRegionAverage")
  cList <- getCoverage(BS.chr22, regions = reg)
length(cList)
head(cList[[1]])

#-------------------------------------------------------------------------------
# An example using a HDF5Array-backed BSseq object
#

library(HDF5Array)
# See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details
hdf5_BS.chr22 <- saveHDF5SummarizedExperiment(x = BS.chr22,
                                              dir = tempfile())
head(getCoverage(hdf5_BS.chr22, type = "M"))
reg <- GRanges(seqnames = c("chr22", "chr22"),
               ranges = IRanges(start = c(1, 2 * 10 ^ 7),
               end = c(2 * 10 ^ 7 + 1, 4 * 10 ^ 7)))
getCoverage(hdf5_BS.chr22, regions = reg, what = "perRegionAverage")
hdf5_cList <- getCoverage(hdf5_BS.chr22, regions = reg)
length(hdf5_cList)
head(hdf5_cList[[1]])

Obtain methylation estimates for BSseq objects.

Description

Obtain methylation estimates for BSseq objects, both smoothed and raw.

Usage

getMeth(BSseq, regions = NULL, type = c("smooth", "raw"),
  what = c("perBase", "perRegion"), confint = FALSE, alpha = 0.95,
  withDimnames = TRUE)
getMeth(BSseq, regions = NULL, type = c("smooth", "raw"),
  what = c("perBase", "perRegion"), confint = FALSE, alpha = 0.95,
  withDimnames = TRUE)

Arguments

`BSseq`	An object of class `BSseq`.
`regions`	An optional `data.frame` or `GenomicRanges` object specifying a number of genomic regions.
`type`	This returns either smoothed or raw estimates of the methylation level.
`what`	The type of return object, see details.
`confint`	Should a confidence interval be return for the methylation estimates (see below). This is only supported if `what` is equal to `perBase`.
`alpha`	alpha value for the confidence interval.
`withDimnames`	A `logical(1)`, indicating whether dimnames should be applied to extracted coverage elements. Setting `withDimnames = FALSE` increases the speed and memory efficiency with which coverage is extracted.

Value

NOTE: The return type of getMeth varies depending on its arguments.

If region = NULL the what argument is ignored. This is also the only situation in which confint = TRUE is supported. The return value is either a DelayedMatrix (confint = FALSE or a list with three DelayedMatrix components confint = TRUE (meth, upper and lower), giving the methylation estimates and (optionally) confidence intervals.

Confidence intervals for type = "smooth" is based on standard errors from the smoothing algorithm (if present). Otherwise it is based on pointwise confidence intervals for binomial distributions described in Agresti (see below), specifically the score confidence interval.

If regions are specified, what = "perBase" will make the function return a list, each element of the list being a DelayedMatrix corresponding to a genomic region (and each row of the DelayedMatrix being a loci inside the region). If what = "perRegion" the function returns a DelayedMatrix, with each row corresponding to a region and containing the average methylation level in that region.

Note

A BSseq object needs to be smoothed by the function BSmooth in order to support type = "smooth".

Author(s)

Kasper Daniel Hansen [email protected].

References

A Agresti and B Coull. Approximate Is Better than "Exact" for Interval Estimation of Binomial Proportions. The American Statistician (1998) 52:119-126.

Examples

data(BS.chr22)
head(getMeth(BS.chr22, type = "raw"))
reg <- GRanges(seqnames = c("chr22", "chr22"),
  ranges = IRanges(start = c(1, 2*10^7), end = c(2*10^7 +1, 4*10^7)))
head(getMeth(BS.chr22, regions = reg, type = "raw", what = "perBase"))

#-------------------------------------------------------------------------------
# An example using a HDF5Array-backed BSseq object
#

library(HDF5Array)
# See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details
hdf5_BS.chr22 <- saveHDF5SummarizedExperiment(x = BS.chr22,
                                              dir = tempfile())
head(getMeth(hdf5_BS.chr22, type = "raw"))
head(getMeth(hdf5_BS.chr22, regions = reg, type = "raw", what = "perBase"))
data(BS.chr22)
head(getMeth(BS.chr22, type = "raw"))
reg <- GRanges(seqnames = c("chr22", "chr22"),
  ranges = IRanges(start = c(1, 2*10^7), end = c(2*10^7 +1, 4*10^7)))
head(getMeth(BS.chr22, regions = reg, type = "raw", what = "perBase"))

#-------------------------------------------------------------------------------
# An example using a HDF5Array-backed BSseq object
#

library(HDF5Array)
# See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details
hdf5_BS.chr22 <- saveHDF5SummarizedExperiment(x = BS.chr22,
                                              dir = tempfile())
head(getMeth(hdf5_BS.chr22, type = "raw"))
head(getMeth(hdf5_BS.chr22, regions = reg, type = "raw", what = "perBase"))

Obtain statistics from a BSseqTstat object

Description

Essentially an accessor function for the statistics of a BSseqTstat object.

Usage

getStats(bstat, regions = NULL, ...)
getStats(bstat, regions = NULL, ...)

Arguments

`bstat`	An object of class `BSseqStat` or `BSseqTstat`.
`regions`	An optional `data.frame` or `GenomicRanges` object specifying a number of genomic regions.
`...`	Additional arguments passed to the different backends based on the class of `bstat`; see Details.

Details

Additional argument when the bstat object is of class BSseqTstat:

stat: Which statistics column should be obtained.

Value

An object of class data.frame possible restricted to the regions specified.

Author(s)

Kasper Daniel Hansen [email protected]

Examples

if(require(bsseqData)) {
  data(BS.cancer.ex.tstat)
  head(getStats(BS.cancer.ex.tstat))
  reg <- GRanges(seqnames = c("chr22", "chr22"),
     ranges = IRanges(start = c(1, 2*10^7), end = c(2*10^7 +1, 4*10^7)))
  head(getStats(BS.cancer.ex.tstat, regions = reg))
}
if(require(bsseqData)) {
  data(BS.cancer.ex.tstat)
  head(getStats(BS.cancer.ex.tstat))
  reg <- GRanges(seqnames = c("chr22", "chr22"),
     ranges = IRanges(start = c(1, 2*10^7), end = c(2*10^7 +1, 4*10^7)))
  head(getStats(BS.cancer.ex.tstat, regions = reg))
}

Binomial and poisson goodness of fit statistics for BSSeq objects

Description

Binomial and poisson goodness of fit statistics for BSSeq objects, including plotting capability.

Usage

poissonGoodnessOfFit(BSseq, nQuantiles = 10^5)
binomialGoodnessOfFit(BSseq, method = c("MLE"), nQuantiles = 10^5)
## S3 method for class 'chisqGoodnessOfFit'
print(x, ...)
## S3 method for class 'chisqGoodnessOfFit'
plot(x, type = c("chisq", "pvalue"), plotCol = TRUE, qqline = TRUE,
  pch = 16, cex = 0.75, ...)
poissonGoodnessOfFit(BSseq, nQuantiles = 10^5)
binomialGoodnessOfFit(BSseq, method = c("MLE"), nQuantiles = 10^5)
## S3 method for class 'chisqGoodnessOfFit'
print(x, ...)
## S3 method for class 'chisqGoodnessOfFit'
plot(x, type = c("chisq", "pvalue"), plotCol = TRUE, qqline = TRUE,
  pch = 16, cex = 0.75, ...)

Arguments

`BSseq`	An object of class `BSseq`.
`x`	A chisqGoodnessOfFit object (as produced by `poissonGoodnessOfFit` or `binomialGoodnessOfFit`).
`nQuantiles`	The number of (evenly-spaced) quantiles stored in the return object.
`method`	How is the parameter estimated.
`type`	Are the chisq or the p-values being plotted.
`plotCol`	Should the extreme quantiles be colored.
`qqline`	Add a `qqline`.
`pch`, `cex`	Plotting symbols and size.
`...`	Additional arguments being passed to `qqplot` (for `plot`) or ignored (for `print`).

Details

These functions compute and plot goodness of fit statistics for BSseq objects. For each methylation loci, the Poisson goodness of fit statistic tests whether the coverage (at that loci) is independent and identically Poisson distributed across the samples. In a similar fashion, the Binomial goodness of fit statistic tests whether the number of reads supporting methylation are independent and identically binomial distributed across samples (with different size parameters given by the coverage vector).

These functions do not handle NA values.

Value

The plotting method is invoked for its side effect. Both poissonGoodnessOfFit and binomialGoodnessOfFit returns an object of class chisqGoodnessOfFit which is a list with components

`chisq`	a vector of Chisq values.
`quantiles`	a vector of quantiles (of the chisq values).
`df`	degress of freedom

Author(s)

Kasper Daniel Hansen [email protected]

Examples

if(require(bsseqData)) {
  data(BS.cancer.ex)
  BS.cancer.ex <- updateObject(BS.cancer.ex)
  gof <- poissonGoodnessOfFit(BS.cancer.ex)
  plot(gof)

#-------------------------------------------------------------------------------
# An example using a HDF5Array-backed BSseq object
#

    library(HDF5Array)

    # See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details
    hdf5_BS.cancer.ex <- saveHDF5SummarizedExperiment(x = BS.cancer.ex,
                                                      dir = tempfile())
    hdf5_gof <- poissonGoodnessOfFit(hdf5_BS.cancer.ex)
    plot(hdf5_gof)

}
if(require(bsseqData)) {
  data(BS.cancer.ex)
  BS.cancer.ex <- updateObject(BS.cancer.ex)
  gof <- poissonGoodnessOfFit(BS.cancer.ex)
  plot(gof)

#-------------------------------------------------------------------------------
# An example using a HDF5Array-backed BSseq object
#

    library(HDF5Array)

    # See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details
    hdf5_BS.cancer.ex <- saveHDF5SummarizedExperiment(x = BS.cancer.ex,
                                                      dir = tempfile())
    hdf5_gof <- poissonGoodnessOfFit(hdf5_BS.cancer.ex)
    plot(hdf5_gof)

}

Class hasGRanges

Description

A class with a GRanges slot, used as a building block for other classes. Provides basic accessor functions etc.

Objects from the Class

Objects can be created by calls of the form new("hasGRanges", ...).

Slots

gr:: Object of class GRanges.

Methods

"[": Subsets a single dimension.
granges: Get the GRanges object representing genomic locations.
start,start<-,end,end<-,width,width<-: Start, end and width for the genomic locations of the object, also replacement functions. This accessor functions operate directly on the gr slot.
strand,strand<-: Getting and setting the strand of the genomic locations (the gr slot).
seqlengths,seqlengths<-: Getting and setting the seqlengths of the genomic locations (the gr slot).
seqlevels,seqlevels<-: Getting and setting the seqlevels of the genomic locations (the gr slot).
seqnames,seqnames<-: Getting and setting the seqnames of the genomic locations (the gr slot).
show: The show method.
findOverlaps: (query = "hasGRanges", subject = "hasGRanges"): finds overlaps between the granges() of the two objects.
findOverlaps: (query = "GenomicRanges", subject = "hasGRanges"): finds overlaps between query and the granges() of the subject.
findOverlaps: (query = "hasGRanges", subject = "GenomicRanges"): finds overlaps between the granges() of the query and the subject.
subsetByOverlaps: (query = "hasGRanges", subject = "hasGRanges"): Subset the query, keeping the genomic locations that overlaps the subject.
subsetByOverlaps: (query = "hasGRanges", subject = "GenomicRanges"): Subset the query, keeping the genomic locations that overlaps the subject.
subsetByOverlaps: (query = "GenomicRanges", subject = "hasGRanges"): Subset the query, keeping the genomic locations that overlaps the subject.

Note

If you extend the hasGRanges class, you should consider writing a subset method ([), and a show method. If the new class supports single index subsetting, the subsetByOverlaps methods will automatically extend.

Author(s)

Kasper Daniel Hansen [email protected]

Examples

showClass("hasGRanges")
showClass("hasGRanges")

Plotting BSmooth methylation estimates

Description

Functions for plotting BSmooth methylation estimates. Typically used to display differentially methylated regions.

Usage

plotRegion(BSseq, region = NULL, extend = 0, main = "",
  addRegions = NULL, annoTrack = NULL, cex.anno = 1,
  geneTrack = NULL, cex.gene = 1.5, col = NULL, lty = NULL,
  lwd = NULL, BSseqStat = NULL, stat = "tstat.corrected",
  stat.col = "black", stat.lwd = 1, stat.lty = 1, stat.ylim = c(-8, 8),
  mainWithWidth = TRUE, regionCol = alpha("red", 0.1), addTicks = TRUE,
  addPoints = FALSE, pointsMinCov = 5, highlightMain = FALSE)

plotManyRegions(BSseq, regions = NULL, extend = 0, main = "",
  addRegions = NULL, annoTrack = NULL, cex.anno = 1,
  geneTrack = NULL, cex.gene = 1.5, col = NULL, lty = NULL,
  lwd = NULL, BSseqStat = NULL, stat = "tstat.corrected",
  stat.col = "black", stat.lwd = 1, stat.lty = 1, stat.ylim = c(-8, 8),
  mainWithWidth = TRUE, regionCol = alpha("red", 0.1), addTicks = TRUE,
  addPoints = FALSE, pointsMinCov = 5, highlightMain = FALSE,
  verbose = TRUE)
plotRegion(BSseq, region = NULL, extend = 0, main = "",
  addRegions = NULL, annoTrack = NULL, cex.anno = 1,
  geneTrack = NULL, cex.gene = 1.5, col = NULL, lty = NULL,
  lwd = NULL, BSseqStat = NULL, stat = "tstat.corrected",
  stat.col = "black", stat.lwd = 1, stat.lty = 1, stat.ylim = c(-8, 8),
  mainWithWidth = TRUE, regionCol = alpha("red", 0.1), addTicks = TRUE,
  addPoints = FALSE, pointsMinCov = 5, highlightMain = FALSE)

plotManyRegions(BSseq, regions = NULL, extend = 0, main = "",
  addRegions = NULL, annoTrack = NULL, cex.anno = 1,
  geneTrack = NULL, cex.gene = 1.5, col = NULL, lty = NULL,
  lwd = NULL, BSseqStat = NULL, stat = "tstat.corrected",
  stat.col = "black", stat.lwd = 1, stat.lty = 1, stat.ylim = c(-8, 8),
  mainWithWidth = TRUE, regionCol = alpha("red", 0.1), addTicks = TRUE,
  addPoints = FALSE, pointsMinCov = 5, highlightMain = FALSE,
  verbose = TRUE)

Arguments

`BSseq`	An object of class `BSseq`.
`region`	A `data.frame` (with start, end and chr columns) with 1 row or `GRanges` of length 1. If `region` is `NULL` the entire `BSseq` argument is plotted.
`regions`	A `data.frame` (with start, end and chr columns) or `GRanges`.
`extend`	Describes how much the plotting region should be extended in either direction. The total width of the plot is equal to the width of the region plus twice `extend`.
`main`	The plot title. The default is to construct a title with information about which genomic region is being plotted.
`addRegions`	A set of additional regions to be highlighted on the plots. As the `regions` argument.
`annoTrack`	A named list of `GRanges` objects. Each component is a track and the names of the list are the track names. Each track will be plotted as solid bars, and we routinely display information such as CpG islands, exons, etc.
`cex.anno`	`cex` argument when plotting `annoTrack`.
`geneTrack`	EXPERIMENTAL: A `data.frame` with columns: `chr`, `start`, `end`, `gene_ID`, `exon_number`, `strand`, `gene_name`, `isoforms`. This interface is under active development and subject to change.
`cex.gene`	`cex` argument when plotting `geneTrack`.
`col`	The color of the methylation estimates, see details.
`lty`	The line type of the methylation estimates, see details.
`lwd`	The line width of the methylation estimates, see details.
`BSseqStat`	An object of class `BSseqStat`. If present, a new panel will be shown with the t-statistics.
`stat`	Which statistics will be plotted (only used is `BSseqStat` is not `NULL`.)
`stat.col`	color for the statistics plot.
`stat.lwd`	line width for the statistics plot.
`stat.lty`	line type for the statistics plot.
`stat.ylim`	y-limits for the statistics plot.
`mainWithWidth`	Should the default title include information about width of the plot region.
`regionCol`	The color used for highlighting the region.
`addTicks`	Should tick marks showing the location of methylation loci, be added?
`addPoints`	Should the individual unsmoothed methylation estimates be plotted. This usually leads to a very confusing plot, but may be useful for diagnostic purposes.
`pointsMinCov`	The minimum coverage a methylation loci need in order for the raw methylation estimates to be plotted. Useful for filtering out low coverage loci. Only used if `addPoints = TRUE`.
`highlightMain`	Should the plot region be highlighted?
`verbose`	Should the function be verbose?

Details

The correct choice of aspect ratio depends on the width of the plotting region. We tend to use width = 10, height = 5.

plotManyRegions is used to plot many regions (hundreds or thousands), and is substantially quicker than repeated calls to plotRegion.

This function has grown to be rather complicated over time. For custom plotting, it is sometimes useful to use the function definition as a skeleton and directly modify the code.

Value

This function is invoked for its side effect: producing a plot.

Author(s)

Kasper Daniel Hansen [email protected]

Parsing output from the Bismark alignment suite.

Description

Parsing output from the Bismark alignment suite.

Usage

read.bismark(files,
             loci = NULL,
             colData = NULL,
             rmZeroCov = FALSE,
             strandCollapse = TRUE,
             BPPARAM = bpparam(),
             BACKEND = NULL,
             dir = tempfile("BSseq"),
             replace = FALSE,
             chunkdim = NULL,
             level = NULL,
             nThread = 1L,
             verbose = getOption("verbose"))
read.bismark(files,
             loci = NULL,
             colData = NULL,
             rmZeroCov = FALSE,
             strandCollapse = TRUE,
             BPPARAM = bpparam(),
             BACKEND = NULL,
             dir = tempfile("BSseq"),
             replace = FALSE,
             chunkdim = NULL,
             level = NULL,
             nThread = 1L,
             verbose = getOption("verbose"))

Arguments

`files`	The path to the files created by running Bismark's methylation extractor, one sample per file. Files ending in `.gz` or `.bz2` will be automatically decompressed to `tempfile()`. We strongly recommend you use the 'genome wide cytosine report' output files. See section 'File formats' for further details.
`loci`	`NULL` (default) or a GenomicRanges instance containing methylation loci (all with width equal to 1). If `loci = NULL`, then `read.bismark()` will perform a first pass over the Bismark file to identify candidate loci. If `loci` is a GenomicRanges instance, then these form the candidate loci. In either case, the candidate loci will be filtered if `rmZeroCov = TRUE` and collapsed if `strandCollapse = TRUE` to form the final set of methylation loci that form the `rowRanges` of the returned BSseq object. See section 'Efficient use of `read.bismark()`' for further details.
`colData`	An optional DataFrame describing the samples. Row names, if present, become the column names of the BSseq object. If `NULL`, then a DataFrame will be created with `files` used as the row names.
`rmZeroCov`	A `logical(1)` indicating whether methylation loci that have zero coverage in all samples be removed. For several reasons, the default `rmZeroCov = FALSE` is recommended even in cases where you ultimately want to remove such loci. See section 'Efficient use of `read.bismark()`' for further details.
`strandCollapse`	A `logical(1)` indicating whether strand-symmetric methylation loci (i.e. CpGs) should be collapsed across strands. This is only applicable for stranded methylation loci, e.g., loci extracted from 'genome wide cytosine reports' (see section 'File formats' for further details).
`BPPARAM`	An optional BiocParallelParam instance determining the parallel back-end to be used during evaluation. Currently supported are SerialParam (Unix, Mac, Windows), MulticoreParam (Unix and Mac), SnowParam (Unix, Mac, and Windows, limited to single-machine clusters), and BatchtoolsParam (Unix, Mac, Windows, only with the in-memory realization backend). See sections 'Parallelization and progress monitoring' and 'Realization backends' for further details.
`BACKEND`	`NULL` or a single string specifying the name of the realization backend. When the backend is set to `NULL`, the `M` and `Cov` assays are realized in memory as ordinary matrices, otherwise these are realized with the given `BACKEND`. See section 'Realization backends' for further details.
`dir`	Only applicable if `BACKEND == "HDF5Array"`. The path (as a single string) to the directory where to save the HDF5-based BSseq object. The directory will be created so should not already exist, unless `replace` is set to `TRUE`.
`replace`	Only applicable if `BACKEND == "HDF5Array"`. If directory `dir` already exists, should it be replaced with a new one? The content of the existing directory will be lost!
`chunkdim`	Only applicable if `BACKEND == "HDF5Array"`. The dimensions of the chunks to use for writing the data to disk. By default, `getHDF5DumpChunkDim()` using the dimensions of the returned BSseq object will be used. See `?getHDF5DumpChunkDim` for more information.
`level`	Only applicable if `BACKEND == "HDF5Array"`. The compression level to use for writing the data to disk. By default, `getHDF5DumpCompressionLevel()` will be used. See `?getHDF5DumpCompressionLevel` for more information.
`nThread`	The number of threads used by `fread` when reading the `files`. Be careful when combining a parallel backend specified with `BPPARAM` with `nThread` > 1 because each worker will use `nThread`.
`verbose`	A `logical(1)` indicating whether progress messages should be printed (default `TRUE`).

Value

A BSseq object.

File formats

The format of each file is automatically detected using the internal function bsseq:::.guessBismarkFileType(). Files ending in .gz, .bz2, .xz, or .zip will be automatically decompressed to tempdir().

Supported file formats

Bismark's 'genome wide cytosine report' (https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-genome-wide-cytosine-report-optional-is-tab-delimited-in-the-following-format-1-based-coords) and 'coverage' (https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-coverage-output-looks-like-this-tab-delimited-1-based-genomic-coords) formats are both supported. If setting loci = NULL, then we strongly recommend using the 'genome wide cytosine report' output format because this includes strand information for each locus. The 'coverage' output does not contain strand information and so the strand of the returned BSseq object will be set to * unless stranded loci are supplied.

Unsupported file formats

Neither the 'bedGraph' output format (https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-bedgraph-output-optional-looks-like-this-tab-delimited-0-based-start-1-based-end-coords) nor the 'bismark_methylation_extractor' output format (https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-bismark_methylation_extractor-output-is-in-the-form-tab-delimited-1-based-coords) are supported. The former does not include the required counts of methylated and unmethylated reads hile the is an intermediate file containing read-level, rather than locus-level, data on methylation.

One-based vs. zero-based genomic co-ordinates

The genomic co-ordinates of the Bismark output files may be zero-based or one-based depending on whether the --zero_based argument was used when running Bismark's methylation extractor. Furthermore, the default co-ordinate counting system varies by version of Bismark. bsseq makes no assumptions about the basis of the genomic co-ordinates and it is left to the user to ensure that the appropriate basis is used in the analysis of their data.

Since Bioconductor packages typically use one-based co-ordinates, we strongly recommend that your Bismark files are also one-based.

Efficient use of `read.bismark()`

We recommend the following to achieve fast and efficient importing of Bismark files:

Specify the set of methylation loci via the loci argument.
Use Bismark files in the 'coverage' output format.
Leave rmZeroCov = FALSE.
Use a BPPARAM with a moderate number of workers (cores).
Use BACKEND = "HDF5Array".
Use multiple threads per worker (i.e. nThread > 1).

Each point is discussed below.

Specifying `loci`

Specifying the set of methylation loci via the loci argument means that read.bismark() does not need to first parse all files to identify the set of candidate loci. Provided that rmZeroCov = FALSE, this means that each file is only read once. This may be a considerable saving when there are a large number of files.

If you are unsure whether the below-described shortcuts apply to your data, leave loci = NULL and let read.bismark() identify the set of candidate loci from files.

You may wish to use the findLoci() function to find all methylation loci of interest in your reference genome (e.g., all CpGs) and then pass the result via the loci argument.

Alternatively, if all files are 'genome wide cytosine reports' for samples aligned to the same reference genome, then all files contain the exact same set of methylation loci. In this case, you may wish to first construct loci using the internal function bsseq:::.readBismarkAsFWGRanges() applied to a single file, e.g., loci = bsseq:::.readBismarkAsFWGRanges(files[1], rmZeroCov, strandCollapse).

Using the 'coverage' Bismark files

It will generally be faster to parse Bismark files in the 'coverage' output format than those in the 'genome wide cytosine report' format This is because the former only includes loci with non-zero coverage and so the file size is often considerably smaller, particularly for shallowly sequenced samples (e.g., those from single-cell bisulfite sequencing).

Leaving `rmZeroCov = FALSE`

If you set rmZeroCov = TRUE, then read.bismark() must first parse all the files to identify which loci have zero coverage in all samples and then filter these out from the set of candidate loci. This will happen even if you supply loci with a GenomicRanges of candidate loci.

Furthermore, any coverage-based filtering of methylation loci is best left until you have constructed your final BSseq object. In our experience, the final BSseq object is often the product of combining multiple BSseq objects, each constructed with a separate call to read.bismark(). In such cases, it is premature to use rmZeroCov = TRUE when running each read.bismark(); regretably, combining these objects will often then lead to an inefficiently stored BSseq object.

Using a `BPPARAM` with a moderate number of workers (cores)

Each file can be processed on its own, so you can process in parallel as many files as you have workers. However, if using the HDF5Array backend, then writing to the HDF5 file cannot be performed in parallel and so becomes the bottleneck. Nonetheless, by using a moderate number of workers (2 - 10), we can ensure there is processed data available to write to disk as soon as the current write is completed.

Using `BACKEND = "HDF5Array"`

By using the HDF5Array realization backend from HDF5Array, we reduce the amount of data that is kept in-memory at any one time. Once each file is parsed, the data are written to the HDF5 file and are no longer needed in-memory. When combined with multiple workers (cores), this means that each file will only need to read and retain in-memory 1 sample's worth of data at a time.

Conversely, if you opt for all data to be in-memory (via BACKEND = NULL), then each worker will pass each file's data back to the main process and memory usage will steadily accumulate to often unreasonable levels.

Using `nThread` > 1

read.bismark uses data.table::fread to read each file, which supports threaded-parallisation. Depending on the system, increasing nThread can achieve near-linear speed-ups in the number of threads for reading each file. Care needs to be taken when nThread > 1 is used in conjunction with a parallel backend via BPPARAM to ensure the system isn't overloaded. For example, using BPPARAM = MulticoreParam(workers = 10) with nThread = 4 may use up to 40 workers simultaneously.

Realization backends

The read.bismark() function creates a BSseq object with two assays, M and Cov. The choice of realization backend controls whether these assays are stored in-memory as an ordinary matrix or on-disk as a HDF5Array, for example. The choice of realization backend is controlled by the BACKEND argument, which defaults to the current value of DelayedArray::getAutoRealizationBackend().

read.bismark() supports the following realization backends:

NULL (in-memory): This stores each new assay in-memory using an ordinary matrix.
HDF5Array (on-disk): This stores each new assay on-disk in a HDF5 file using an HDF5Matrix from HDF5Array.

Parallelization, progress monitoring, and logging

read.bismark() now uses the BiocParallel package to implement parallelization. This brings some notable improvements:

Imported files can now be written directly to an on-disk realization backend by the worker. This dramatically reduces memory usage compared to previous versions of bsseq that required all results be retained in-memory.
Parallelization is now supported on Windows through the use of a SnowParam object as the value of BPPARAM.
Detailed and extensive job logging facilities.

BiocParallel also supports extensive and detailed logging facilities. Please consult the BiocParallel documentation to take full advantage these advanced features.

Author(s)

Peter Hickey [email protected]

Examples

  # Run read.bismark() on a single sample to construct a matrix-backed BSseq
  # object.
  infile <- system.file("extdata/test_data.fastq_bismark.bismark.cov.gz",
                        package = "bsseq")
  bsseq <- read.bismark(files = infile,
                        colData = DataFrame(row.names = "test_data"),
                        rmZeroCov = FALSE,
                        strandCollapse = FALSE,
                        verbose = TRUE)
  # This is a matrix-backed BSseq object.
  sapply(assays(bsseq, withDimnames = FALSE), class)
  bsseq

  ## Not run: 
  # Run read.bismark() on a single sample to construct a HDF5Array-backed BSseq
  # object (with data written to 'test_dir')
  test_dir <- tempfile("BSseq")
  bsseq <- read.bismark(files = infile,
                        colData = DataFrame(row.names = "test_data"),
                        rmZeroCov = FALSE,
                        strandCollapse = FALSE,
                        BACKEND = "HDF5Array",
                        dir = test_dir,
                        verbose = TRUE)
  # This is a HDF5Array-backed BSseq object.
  sapply(assays(bsseq, withDimnames = FALSE), class)
  # The 'M' and 'Cov' assays are in the HDF5 file 'assays.h5' (in 'test_dir').
  sapply(assays(bsseq, withDimnames = FALSE), path)
  
## End(Not run)
# Run read.bismark() on a single sample to construct a matrix-backed BSseq
  # object.
  infile <- system.file("extdata/test_data.fastq_bismark.bismark.cov.gz",
                        package = "bsseq")
  bsseq <- read.bismark(files = infile,
                        colData = DataFrame(row.names = "test_data"),
                        rmZeroCov = FALSE,
                        strandCollapse = FALSE,
                        verbose = TRUE)
  # This is a matrix-backed BSseq object.
  sapply(assays(bsseq, withDimnames = FALSE), class)
  bsseq

  ## Not run: 
  # Run read.bismark() on a single sample to construct a HDF5Array-backed BSseq
  # object (with data written to 'test_dir')
  test_dir <- tempfile("BSseq")
  bsseq <- read.bismark(files = infile,
                        colData = DataFrame(row.names = "test_data"),
                        rmZeroCov = FALSE,
                        strandCollapse = FALSE,
                        BACKEND = "HDF5Array",
                        dir = test_dir,
                        verbose = TRUE)
  # This is a HDF5Array-backed BSseq object.
  sapply(assays(bsseq, withDimnames = FALSE), class)
  # The 'M' and 'Cov' assays are in the HDF5 file 'assays.h5' (in 'test_dir').
  sapply(assays(bsseq, withDimnames = FALSE), path)
  
## End(Not run)

Construct BSseq objects from nanopore BED files

Description

Construct BSseq objects from nanopore BED files

Usage

read.modbam2bed(
    files,
    colData = NULL,
    rmZeroCov = FALSE,
    strandCollapse = TRUE
)
read.modbam2bed(
    files,
    colData = NULL,
    rmZeroCov = FALSE,
    strandCollapse = TRUE
)

Arguments

`files`	vector, BED files
`colData`	data frame, phenotypic data with samples as rows and variables as columns
`rmZeroCov`	A logical (1) indicating whether methylation loci that have zero coverage in all samples be removed
`strandCollapse`	A logical (1) indicating whether stand-symmetric methylation loci (i.e. CpGs) should be collapsed across strands

Details

This function reads in nanopore sequencing modified BED files to Bsseq objects. Nanopore sequencing data (i.e. aggregated modified base counts) is stored in modified-base BAM files. These modified-base BAM files are converted to bedMethyl (BED) files using modbam2bed.

Details for using modbam2bed

After installing modbam2bed, a conda environment is activated. Index files for BAM files are created using samtools index. The code requires aligned reads with the Mm and Ml tags (MM and ML also supported), and the reference sequence used for alignment (<reference.fasta>).

-e, -- extended to output canonical, modified, and filtered bases;
-m, -- mod_base=BASE to output modified base of interest, one of: 5mC, 5hmC, 5fC, 5caC, 5hmU, 5fU, 5caU, 6mA, 5oxoG, Xao. (Or modA, modC, modG, modT, modU, modN for generic modified base);
-r, --region=chr:start-end to output chromosome or genomic region of interest;
-f, --threshold=THRESHOLD to output filtered bases for probability lower than threshold (default = 0.66)

modbam2bed to Bsseq object

After creating BED files using modbam2bed, the BED files are read in and the Bsseq object is constructed via read.modbam2bed() function. The function reads in BED files, extract genomic regions, methylation, coverage, ambiguous modification status data and sample information and then construct Bsseq object using BSseq function within the package.

Value

BSseq object

Examples

files <- c(system.file("extdata/modbam2bed/ctr1.chr10.chr11.bed.gz", package = "bsseq"),
    system.file("extdata/modbam2bed/ctr2.chr10.chr11.bed.gz", package = "bsseq"),
    system.file("extdata/modbam2bed/ctr3.chr10.chr11.bed.gz", package = "bsseq"),
    system.file("extdata/modbam2bed/tret1.chr10.chr11.bed.gz", package = "bsseq"),
    system.file("extdata/modbam2bed/tret2.chr10.chr11.bed.gz", package = "bsseq"),
    system.file("extdata/modbam2bed/tret3.chr10.chr11.bed.gz", package = "bsseq"))
pd <- data.frame(condition = rep(c("control", "treatment"), each = 3),
    replicate = rep(c("rep1", "rep2", "rep3"), times = 2))
bsseq_nano <- bsseq::read.modbam2bed(files,colData=pd,rmZeroCov = FALSE,
    strandCollapse=TRUE)
files <- c(system.file("extdata/modbam2bed/ctr1.chr10.chr11.bed.gz", package = "bsseq"),
    system.file("extdata/modbam2bed/ctr2.chr10.chr11.bed.gz", package = "bsseq"),
    system.file("extdata/modbam2bed/ctr3.chr10.chr11.bed.gz", package = "bsseq"),
    system.file("extdata/modbam2bed/tret1.chr10.chr11.bed.gz", package = "bsseq"),
    system.file("extdata/modbam2bed/tret2.chr10.chr11.bed.gz", package = "bsseq"),
    system.file("extdata/modbam2bed/tret3.chr10.chr11.bed.gz", package = "bsseq"))
pd <- data.frame(condition = rep(c("control", "treatment"), each = 3),
    replicate = rep(c("rep1", "rep2", "rep3"), times = 2))
bsseq_nano <- bsseq::read.modbam2bed(files,colData=pd,rmZeroCov = FALSE,
    strandCollapse=TRUE)

Construct BSseq objects from nanopore BED files

Description

Construct BSseq objects from nanopore BED files

Usage

read.modkit(
    files,
    colData = NULL,
    rmZeroCov = FALSE,
    strandCollapse = TRUE
)
read.modkit(
    files,
    colData = NULL,
    rmZeroCov = FALSE,
    strandCollapse = TRUE
)

Arguments

`files`	vector, BED files
`colData`	data frame, phenotypic data with samples as rows and variables as columns
`rmZeroCov`	A logical (1) indicating whether methylation loci that have zero coverage in all samples be removed
`strandCollapse`	A logical (1) indicating whether stand-symmetric methylation loci (i.e. CpGs) should be collapsed across strands

Details

Details for modkit

Modkit outputs modified reads, unmodified reads, ambiguous modification reads (reads where the probability was below the threshold and usually failing the lowest 10th percentile), and other modified reads.

modkit to Bsseq object

After creating BED files using modkit, the BED files are read in and the Bsseq object is constructed via read.modkit() function. The function reads in BED files, extract genomic regions, methylation, coverage, ambiguous modification status data and sample information and then construct Bsseq object using BSseq function within the package. Other modification bases such as hydroxymethylation are extracted and added to the methylation matrix when present.

Value

BSseq objects

Examples

# No other modification present
files <- c(system.file("extdata/modkit/chr21.chr22.HG002.top1000.bed.gz", package = "bsseq"))
bsseq_nano <- read.modkit(files, rmZeroCov = FALSE, strandCollapse=FALSE)

# Other modification present
files <- c(system.file("extdata/modkit/Hypo1.first50Bed.txt",package = "bsseq"))
bsseq_nano <- read.modkit(files, rmZeroCov = FALSE, strandCollapse=FALSE)
# No other modification present
files <- c(system.file("extdata/modkit/chr21.chr22.HG002.top1000.bed.gz", package = "bsseq"))
bsseq_nano <- read.modkit(files, rmZeroCov = FALSE, strandCollapse=FALSE)

# Other modification present
files <- c(system.file("extdata/modkit/Hypo1.first50Bed.txt",package = "bsseq"))
bsseq_nano <- read.modkit(files, rmZeroCov = FALSE, strandCollapse=FALSE)

Package 'bsseq'

Help Index

Whole-genome bisulfite sequencing for chromosome 22 from Lister et al.

Description

Usage

Format

Details

Source

References

Examples

BSmooth, smoothing bisulfite sequence data

Description

Usage

Arguments

Details

Value

Realization backends

Parallelization and progress monitoring

Deprecated arguments

Progress monitoring

Author(s)

References

See Also

Examples

Compute t-statistics based on smoothed whole-genome bisulfite sequencing data.

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

The constructor function for BSseq objects.

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Class BSseq

Description

Objects from the Class

Slots

Methods

Utilities

Coercion

Assays

Author(s)

See Also

Examples

Class BSseqStat

Description

Usage

Arguments

Details

Objects from the Class

Slots

Extends

Methods

Utilities

Coercion

Author(s)

See Also

Class BSseqTstat

Description

Usage

Arguments

Details

Objects from the Class

Slots

Extends

Methods

Utilities

Coercion

Author(s)

See Also