| Title: | Analyze, manage and store whole-genome methylation data |
|---|---|
| Description: | A collection of tools for analyzing and visualizing whole-genome methylation data from sequencing. This includes whole-genome bisulfite sequencing and Oxford nanopore data. |
| Authors: | Kasper Daniel Hansen [aut, cre] (ORCID: <https://orcid.org/0000-0003-0086-0687>), Peter Hickey [aut] (ORCID: <https://orcid.org/0000-0002-8153-6258>), Hervé Pagès [ctb], Aaron Lun [ctb] |
| Maintainer: | Kasper Daniel Hansen <[email protected]> |
| License: | Artistic-2.0 |
| Version: | 1.49.0 |
| Built: | 2026-05-30 09:48:09 UTC |
| Source: | https://github.com/bioc/bsseq |
This dataset represents chromosome 22 from the IMR90 cell line sequenced in Lister et al. Only CpG methylation are included (there were very few non-CpG loci). The two samples are two different extractions from the same cell line (ie. technical replicates), and are pooled in the analysis in the original paper.
data(BS.chr22)data(BS.chr22)
An object of class BSseq.
All coordinates are in hg18.
Obtained from http://neomorph.salk.edu/human_methylome/data.html
specifically the files mc_imr90_r1.tar.gz and
mc_imr90_r2.tar.gz. A script which downloads these files and
constructs the BS.chr22 object may be found in
‘inst/scripts/get_BS.chr22.R’, see the example.
R Lister et al. Human DNA methylomes at base resolution show widespread epigenomic differences. Nature (2009) 462, 315-322.
data(BS.chr22) BS.chr22 script <- system.file("scripts", "get_BS.chr22.R", package = "bsseq") script readLines(script)data(BS.chr22) BS.chr22 script <- system.file("scripts", "get_BS.chr22.R", package = "bsseq") script readLines(script)
This implements the BSmooth algorithm for estimating methylation levels from bisulfite sequencing data.
BSmooth(BSseq, ns = 70, h = 1000, maxGap = 10^8, keep.se = FALSE, BPPARAM = bpparam(), chunkdim = NULL, level = NULL, verbose = getOption("verbose"))BSmooth(BSseq, ns = 70, h = 1000, maxGap = 10^8, keep.se = FALSE, BPPARAM = bpparam(), chunkdim = NULL, level = NULL, verbose = getOption("verbose"))
BSseq |
An object of class |
ns |
The minimum number of methylation loci in a smoothing window. |
h |
The minimum smoothing window, in bases. |
maxGap |
The maximum gap between two methylation loci, before the smoothing is broken across the gap. The default smoothes each chromosome separately. |
keep.se |
Should the estimated standard errors from the smoothing algorithm be kept. This will make the return object roughly 30 percent bigger and is currently not be used for anything in bsseq. |
BPPARAM |
An optional
|
chunkdim |
Only applicable if |
level |
Only applicable if |
verbose |
A |
ns and h are passed to the locfit function. The
bandwidth used is the maximum (in genomic distance) of the h
and a width big enough to contain ns number of methylation
loci.
An object of class BSseq, containing coefficients used to fit smoothed
methylation values and optionally standard errors for these.
The BSmooth() function creates a new assay to store the coefficients
used to construct the smoothed methylation estimates ((coef). An
additional assay is also created if keep.se == TRUE (se.coef).
The choice of realization backend controls whether these assay(s) are
stored in-memory as an ordinary matrix or on-disk as a
HDF5Array, for example.
The choice of realization backend is controlled by the BACKEND
argument, which defaults to the current value of DelayedArray::getAutoRealizationBackend().
BSmooth supports the following realization backends:
NULL (in-memory): This stores each new assay in-memory using
an ordinary matrix.
HDF5Array (on-disk): This stores each new assay on-disk in a
HDF5 file using an HDF5Matrix from HDF5Array.
Please note that certain combinations of realization backend and
parallelization backend are currently not supported. For example, the
HDF5Array realization backend is currently only compatible when
used with a single-machine parallelization backend (i.e. it is not compatible
with a SnowParam that specifies an ad hoc cluster of
multiple machines). BSmooth() will issue an error when given
such incompatible realization and parallelization backends. Furthermore, to
avoid memory usage blow-ups, BSmooth() will issue an error if an
in-memory realization backend is used when smoothing a disk-backed
BSseq object.
Additional arguments related to the realization backend can be passed via the
... argument. These arguments must be named and are passed to the
relevant RealizationSink constructor. For example, the
... argument can be used to specify the path to the HDF5 file to be
used by BSmooth(). Please see the examples at the bottom of the page.
BSmooth() now uses the BiocParallel package to implement
parallelization. This brings some notable improvements:
Smoothed results can now be written directly to an on-disk realization backend by the worker. This dramatically reduces memory usage compared to previous versions of bsseq that required all results be retained in-memory.
Parallelization is now supported on Windows through the use of a
SnowParam object as the value of BPPARAM.
Detailed and extensive job logging facilities.
All parallelization options are controlled via the BPPARAM argument.
In general, we recommend that users combine multicore (single-machine)
parallelization with an on-disk realization backend (see section,
'Realization backend'). For Unix and Mac users, this means using
a MulticoreParam. For Windows users, this means using a
single-machine SnowParam. Please consult the BiocParallel
documentation to take full advantage of the more advanced features.
parallelBy, mc.cores, and mc.preschedule are
deprecated and will be removed in subsequent releases of bsseq. These
arguments were necessary when BSmooth() used the parallel
package to implement parallelization, but this functionality is superseded
by the aforementioned use of BiocParallel. We recommend that users
who previously relied on these arguments switch to
BPPARAM = MulticoreParam(workers = mc.cores, progressbar = TRUE).
A useful feature of BiocParallel are progress bars to monitor the
status of long-running jobs, such as BSmooth(). Progress bars are
controlled via the progressbar argument in the
BiocParallelParam constructor. Progress bars replace the
use of the deprecated verbose argument to print out information on
the status of BSmooth().
BiocParallel also supports extensive and detailed logging facilities. Please consult the BiocParallel documentation to take full advantage these advanced features.
Method and original implementation by Kasper Daniel Hansen [email protected]. Updated implementation to support disk-backed BSseq objects and more general parallelization by Peter Francis Hickey.
KD Hansen, B Langmead, and RA Irizarry. BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biology (2012) 13:R83. doi:10.1186/gb-2012-13-10-r83.
locfit in the locfit package, as well as
BSseq.
## Not run: # Run BSmooth() on a matrix-backed BSseq object using an in-memory realization # backend with serial evaluation. data(BS.chr22) # This is a matrix-backed BSseq object. sapply(assays(BS.chr22, withDimnames = FALSE), class) BS.fit <- BSmooth(BS.chr22, BPPARAM = SerialParam(progressbar = TRUE)) # The new 'coef' assay is an ordinary matrix. sapply(assays(BS.fit, withDimnames = FALSE), class) BS.fit # Run BSmooth() on a disk-backed BSseq object using the HDF5Array realization # backend (with data written to the file 'BSmooth_example.h5') with # multi-core parallel evaluation. BS.chr22 <- realize(BS.chr22, "HDF5Array") # This is a disk-backed BSseq object. sapply(assays(BS.chr22, withDimnames = FALSE), class) BS.fit <- BSmooth(BS.chr22, BPPARAM = MulticoreParam(workers = 2, progressbar = TRUE), BACKEND = "HDF5Array", filepath = "BSmooth_example.h5") # The new 'coef' assay is an HDF5Matrix. sapply(assays(BS.fit, withDimnames = FALSE), class) BS.fit # The new 'coef' assay is in the HDF5 file 'BSmooth_example.h5' (in the # current working directory). sapply(assays(BS.fit, withDimnames = FALSE), path) ## End(Not run)## Not run: # Run BSmooth() on a matrix-backed BSseq object using an in-memory realization # backend with serial evaluation. data(BS.chr22) # This is a matrix-backed BSseq object. sapply(assays(BS.chr22, withDimnames = FALSE), class) BS.fit <- BSmooth(BS.chr22, BPPARAM = SerialParam(progressbar = TRUE)) # The new 'coef' assay is an ordinary matrix. sapply(assays(BS.fit, withDimnames = FALSE), class) BS.fit # Run BSmooth() on a disk-backed BSseq object using the HDF5Array realization # backend (with data written to the file 'BSmooth_example.h5') with # multi-core parallel evaluation. BS.chr22 <- realize(BS.chr22, "HDF5Array") # This is a disk-backed BSseq object. sapply(assays(BS.chr22, withDimnames = FALSE), class) BS.fit <- BSmooth(BS.chr22, BPPARAM = MulticoreParam(workers = 2, progressbar = TRUE), BACKEND = "HDF5Array", filepath = "BSmooth_example.h5") # The new 'coef' assay is an HDF5Matrix. sapply(assays(BS.fit, withDimnames = FALSE), class) BS.fit # The new 'coef' assay is in the HDF5 file 'BSmooth_example.h5' (in the # current working directory). sapply(assays(BS.fit, withDimnames = FALSE), path) ## End(Not run)
Compute t-statistics based on smoothed whole-genome bisulfite sequencing data.
BSmooth.tstat(BSseq, group1, group2, estimate.var = c("same", "paired", "group2"), local.correct = TRUE, maxGap = NULL, qSd = 0.75, k = 21, mc.cores = 1, verbose = TRUE)BSmooth.tstat(BSseq, group1, group2, estimate.var = c("same", "paired", "group2"), local.correct = TRUE, maxGap = NULL, qSd = 0.75, k = 21, mc.cores = 1, verbose = TRUE)
BSseq |
An object of class |
group1 |
A vector of sample names or indexes for the ‘treatment’ group. |
group2 |
A vector of sample names or indexes for the ‘control’ group. |
estimate.var |
How is the variance estimated, see details. |
local.correct |
A logical; should local correction be used, see details. |
maxGap |
A scalar greater than 0, see details. |
qSd |
A scalar between 0 and 1, see details. |
k |
A positive scalar, see details. |
mc.cores |
The number of cores used. Note that setting
|
verbose |
Should the function be verbose? |
T-statistics are formed as the difference in means between group 1 and
group 2 divided by an estimate of the standard deviation, assuming
that the variance in the two groups are the same (same), that
we have paired samples (paired) or only estimate the variance
based on group 2 (group2). The standard deviation estimates
are then smoothed (using a running mean with a width of k) and
thresholded (using qSd which sets the minimum standard
deviation to be the qSd-quantile). Optionally, the
t-statistics are corrected for low-frequency patterns.
It is sometimes useful to use local.correct even if no large
scale changes in methylation have been found; it makes the marginal
distribution of the t-statistics more symmetric.
Additional details in the reference.
An object of class BSseqTstat.
Kasper Daniel Hansen [email protected]
KD Hansen, B Langmead, and RA Irizarry. BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biology (2012) 13:R83. doi:10.1186/gb-2012-13-10-r83.
BSmooth for the input object and
BSseq for its class.
BSseqTstat describes the return class. This
function is likely to be followed by the use of
dmrFinder. And finally, see the package vignette(s) for
more information on how to use it.
if(require(bsseqData)) { data(keepLoci.ex) data(BS.cancer.ex.fit) BS.cancer.ex.fit <- updateObject(BS.cancer.ex.fit) ## Remember to subset the BSseq object, see vignette for explanation BS.tstat <- BSmooth.tstat(BS.cancer.ex.fit[keepLoci.ex,], group1 = c("C1", "C2", "C3"), group2 = c("N1", "N2", "N3"), estimate.var = "group2") BS.tstat ## This object is also stored as BS.cancer.ex.tstat in the ## bsseqData package #--------------------------------------------------------------------------- # An example using a HDF5Array-backed BSseq object # library(HDF5Array) # See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details hdf5_BS.cancer.ex.fit <- saveHDF5SummarizedExperiment( x = BS.cancer.ex.fit[keepLoci.ex, ], dir = tempfile()) hdf5_BS.tstat <- BSmooth.tstat(hdf5_BS.cancer.ex.fit, group1 = c("C1", "C2", "C3"), group2 = c("N1", "N2", "N3"), estimate.var = "group2") }if(require(bsseqData)) { data(keepLoci.ex) data(BS.cancer.ex.fit) BS.cancer.ex.fit <- updateObject(BS.cancer.ex.fit) ## Remember to subset the BSseq object, see vignette for explanation BS.tstat <- BSmooth.tstat(BS.cancer.ex.fit[keepLoci.ex,], group1 = c("C1", "C2", "C3"), group2 = c("N1", "N2", "N3"), estimate.var = "group2") BS.tstat ## This object is also stored as BS.cancer.ex.tstat in the ## bsseqData package #--------------------------------------------------------------------------- # An example using a HDF5Array-backed BSseq object # library(HDF5Array) # See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details hdf5_BS.cancer.ex.fit <- saveHDF5SummarizedExperiment( x = BS.cancer.ex.fit[keepLoci.ex, ], dir = tempfile()) hdf5_BS.tstat <- BSmooth.tstat(hdf5_BS.cancer.ex.fit, group1 = c("C1", "C2", "C3"), group2 = c("N1", "N2", "N3"), estimate.var = "group2") }
The constructor function for BSseq objects.
BSseq(M = NULL, Cov = NULL, Filtered = NULL, coef = NULL, se.coef = NULL, trans = NULL, parameters = NULL, pData = NULL, gr = NULL, pos = NULL, chr = NULL, sampleNames = NULL, rmZeroCov = FALSE)BSseq(M = NULL, Cov = NULL, Filtered = NULL, coef = NULL, se.coef = NULL, trans = NULL, parameters = NULL, pData = NULL, gr = NULL, pos = NULL, chr = NULL, sampleNames = NULL, rmZeroCov = FALSE)
M |
A matrix-like object of methylation evidence (see 'Details' below). |
Cov |
A matrix-like object of coverage (see 'Details' below)). |
Filtered |
A matrix-like object of ambiguous modification bases obtained from modbam2bed. |
coef |
A matrix-like object of smoothing estimates (see 'Details' below). |
se.coef |
A matrix-like object of smoothing standard errors (see 'Details' below). |
trans |
A smoothing transformation. |
parameters |
A list of smoothing parameters. |
pData |
An |
sampleNames |
A vector of sample names. |
gr |
An object of type |
pos |
A vector of locations. |
chr |
A vector of chromosomes. |
rmZeroCov |
Should genomic locations with zero coverage in all samples be removed. |
The 'M', 'Cov', 'coef', and 'se.coef' matrix-like objects will be
coerced to DelayedMatrix objects; see
DelayedMatrix in the DelayedArray
package for the full list of supported matrix-like objects. We
recommend using matrix objects for in-memory storage of
data and HDF5Matrix for on-disk storage of
data.
Genomic locations are specified either through gr or through
chr and pos but not both. There should be the same
number of genomic locations as there are rows in the M and
Cov matrix.
The argument rmZeroCov may be useful in order to reduce the
size of the return object be removing methylation loci with zero
coverage.
In case one or more methylation loci appears multiple times, the
M and Cov matrices are summed over rows linked to the
same methylation loci. See the example below.
Users should never have to specify coef, se.coef,
trans, and parameters, this is for internal use (they
are added by BSmooth).
phenoData is a way to specify pheno data (as known from the
ExpressionSet and eSet classes), at a minimum
sampleNames should be given (if they are not present, the
function uses col.names(M)).
An object of class BSseq.
Kasper Daniel Hansen [email protected]
M <- matrix(0:8, 3, 3) Cov <- matrix(1:9, 3, 3) BS1 <- BSseq(chr = c("chr1", "chr2", "chr1"), pos = c(1,2,3), M = M, Cov = Cov, sampleNames = c("A","B", "C")) BS1 BS2 <- BSseq(chr = c("chr1", "chr1", "chr1"), pos = c(1,1,1), M = M, Cov = Cov, sampleNames = c("A","B", "C")) BS2 #------------------------------------------------------------------------------- # An example using a HDF5Array-backed BSseq object # library(HDF5Array) hdf5_M <- realize(M, "HDF5Array") hdf5_Cov <- realize(Cov, "HDF5Array") hdf5_BS1 <- BSseq(chr = c("chr1", "chr2", "chr1"), pos = c(1, 2, 3), M = hdf5_M, Cov = hdf5_Cov, sampleNames = c("A", "B", "C")) hdf5_BS1 hdf5_BS2 <- BSseq(chr = c("chr1", "chr1", "chr1"), pos = c(1, 1, 1), M = hdf5_M, Cov = hdf5_Cov, sampleNames = c("A", "B", "C")) hdf5_BS2M <- matrix(0:8, 3, 3) Cov <- matrix(1:9, 3, 3) BS1 <- BSseq(chr = c("chr1", "chr2", "chr1"), pos = c(1,2,3), M = M, Cov = Cov, sampleNames = c("A","B", "C")) BS1 BS2 <- BSseq(chr = c("chr1", "chr1", "chr1"), pos = c(1,1,1), M = M, Cov = Cov, sampleNames = c("A","B", "C")) BS2 #------------------------------------------------------------------------------- # An example using a HDF5Array-backed BSseq object # library(HDF5Array) hdf5_M <- realize(M, "HDF5Array") hdf5_Cov <- realize(Cov, "HDF5Array") hdf5_BS1 <- BSseq(chr = c("chr1", "chr2", "chr1"), pos = c(1, 2, 3), M = hdf5_M, Cov = hdf5_Cov, sampleNames = c("A", "B", "C")) hdf5_BS1 hdf5_BS2 <- BSseq(chr = c("chr1", "chr1", "chr1"), pos = c(1, 1, 1), M = hdf5_M, Cov = hdf5_Cov, sampleNames = c("A", "B", "C")) hdf5_BS2
A class for representing whole-genome or capture bisulfite sequencing data.
An object from the class links together several pieces of information.
(1) genomic locations stored as a GRanges object, a location by
samples matrix of M values, a location by samples matrix of Cov
(coverage) values, a location by samples matrix of Filtered (ambiguous
modification status) values, and phenodata information.In addition,
there are slots for representing smoothed data. This class is an
extension of RangedSummarizedExperiment
from the SummarizedExperiment package.
trans:Object of class function. This
function transforms the coef slot from the scale the
smoothing was done to the 0-1 methylation scale.
parameters:Object of class list. A list of
parameters representing for example how the data was smoothed.
signature(x = "BSseq"): Subsetting by location
(using integer indices) or sample (using integers or sample
names).
Unlike for RangedSummarizedExperiment,
length() is the number of methylation loci (equal to
length(granges(x))).
Sample names and its replacement
function for the object. This is an alias for colnames.
Obtain and replace the pData slot of the
phenoData slot. This is an alias for colData.
The show method.
This function combines two BSSeq objects.
The genomic locations of the new object is the union of the
genomic locations of the individual objects. In addition, the
methylation data matrices are placed next to each other (as
appropriate wrt. the new genomic locations) and zeros are entered
into the matrices as needed.
This class extends
RangedSummarizedExperiment
from the SummarizedExperiment package and therefore
inherits a number of useful GRanges methods that operate on the
rowRanges slot, used for accessing and setting the genomic locations
and also do subsetByOverlaps.
There are a number of almost methods-like functions for operating on
objects of class BSseq, including getBSseq,
collapseBSseq, and orderBSseq. They are detailed below.
collapseBSseq(BSseq, columns)is used to collapse an object of class BSseq. By
collapsing we simply mean that certain columns (samples) are merge
together by summing up the methylation evidence and coverage.
This is a useful function if you start by reading in a dataset
based on say flowcells and you (after QC) want to simply add a
number of flowcells into one sample. The argument columns
specify which samples are to be merged, in the following way: it
is a character vector of new sample names, and the names of the
column vector indicates which samples in the BSseq object
are to be collapsed. If columns have the same length as
the number of rows of BSseq (and has no names) it is
assumed that the ordering corresponds to the sample ordering in
BSseq.
orderBSseq(BSseq, seqOrder = NULL)simply orders an object of class BSseq according to
(increasing) genomic locations. The seqOrder vector is a
character vector of seqnames(BSseq) describing the order of
the chromosomes. This is useful for ordering chr1 before
chr10.
chrSelectBSseq(BSseq, seqnames = NULL, order = FALSE)subsets and optionally reorders an object of class BSseq.
The seqnames vector is a character vector of
seqnames(BSseq) describing which chromosomes should be
retained. If order is TRUE, the chromosomes are
also re-ordered using orderBSseq.
getBSseq(BSseq, type = c("Cov", "M", "gr", "coef",
"se.coef", "trans", "parameters"))is a general accessor: is used to obtain a specific slot of an
object of class BSseq. It is primarily intended for
internal use in the package, for users we recommend granges
to get the genomic locations, getCoverage to get the
coverage slots and getMeth to get the smoothed values (if
they exist).
hasBeenSmoothed(BSseq)This function returns a logical depending on whether or not the
BSseq object has been smoothed using BSmooth.
combineList(list, BACKEND = NULL)This function function is a faster way of using combine on
multiple BSseq objects. The input is a list, with each
component an object of class BSseq. The (slower)
alternative is to use Reduce(combine, list).
The BACKEND argument determines which backend should be used for
the 'M' and 'Cov' matrices and, if present, the 'coef' and 'se.coef'
matrices (the latter two can only be combined if all objects have the
same rowRanges). The default, BACKEND = NULL, corresponds to using
matrix objects. See
setAutoRealizationBackend (in the
DelayedArray package) for alternative backends.
strandCollapse(BSseq, shift = TRUE)This function operates on a BSseq objects which has
stranded loci (i.e. loci where the strand is one of ‘+’ or
‘-’). It will collapse the methylation and coverage
information across the two strands, unstranding the loci in the process
and potentially re-ordering them.
The argument shift indicates whether the positions for the loci
on the reverse strand should be shifted one (i.e. the positions for
these loci are the positions of the ‘G’ in the
‘CpG’; this is the case for Bismark output for example).
Package versions 1.5.2 and 1.11.1 introduced a new version of representing
‘BSseq’ objects. You can update old serialized (saved)
objects by invoking x <- updateObject(x).
This class overrides the default implementation of assays to
make it faster. Per default, no names are added to the returned data
matrices.
Assay names can conveniently be obtained by the function
assayNames(x)
Kasper Daniel Hansen [email protected]
The package vignette. BSseq for the constructor
function.
RangedSummarizedExperiment
(in the SummarizedExperiment package) for the underlying class.
getBSseq, getCoverage, and
getMeth for accessing the data stored in the object and
finally BSmooth for smoothing the bisulfite sequence
data.
M <- matrix(1:9, 3,3) colnames(M) <- c("A1", "A2", "A3") BStest <- BSseq(pos = 1:3, chr = c("chr1", "chr2", "chr1"), M = M, Cov = M + 2) chrSelectBSseq(BStest, seqnames = "chr1", order = TRUE) collapseBSseq(BStest, group = c("A", "A", "B")) #------------------------------------------------------------------------------- # An example using a HDF5-backed BSseq object # hdf5_BStest <- realize(BStest, "HDF5Array") chrSelectBSseq(hdf5_BStest, seqnames = "chr1", order = TRUE) collapseBSseq( BSseq = hdf5_BStest, group = c("A", "A", "B"), BACKEND = "HDF5Array", type = "integer")M <- matrix(1:9, 3,3) colnames(M) <- c("A1", "A2", "A3") BStest <- BSseq(pos = 1:3, chr = c("chr1", "chr2", "chr1"), M = M, Cov = M + 2) chrSelectBSseq(BStest, seqnames = "chr1", order = TRUE) collapseBSseq(BStest, group = c("A", "A", "B")) #------------------------------------------------------------------------------- # An example using a HDF5-backed BSseq object # hdf5_BStest <- realize(BStest, "HDF5Array") chrSelectBSseq(hdf5_BStest, seqnames = "chr1", order = TRUE) collapseBSseq( BSseq = hdf5_BStest, group = c("A", "A", "B"), BACKEND = "HDF5Array", type = "integer")
These functions are provided for compatibility with older versions of ‘bsseq’ only, and will be defunct at the next release.
The following functions are deprecated and will be made defunct; use the replacement indicated below:
read.modbam2bed: read.bedMethyl
read.modkit: read.bedMethyl
A class for representing statistics for smoothed whole-genome bisulfite sequencing data.
BSseqStat(gr = NULL, stats = NULL, parameters = NULL)BSseqStat(gr = NULL, stats = NULL, parameters = NULL)
gr |
The genomic locations as an object of class |
stats |
The statistics, as a list of matrix-like objects (see 'Details' below). |
parameters |
A list of parameters. |
The matrix-like elements of the list in the 'stats' slot will be
coerced to DelayedMatrix objects; see
DelayedMatrix in the DelayedArray
package for the full list of supported matrix-like objects. We
recommend using matrix objects for in-memory
storage of data and HDF5Matrix for on-disk
storage of data.
Objects can be created by calls of the form BSseqStat(...).
However, usually objects are returned by BSmooth.fstat(...) and
not constructed by the user.
stats:This is a list of DelayedMatrix
objects with list elements representing various statistics for
methylation loci along the genome.
parameters:Object of class list. A list of
parameters representing how the statistics were computed.
gr:Object of class GRanges giving genomic
locations.
Class hasGRanges, directly.
The subsetting operator; one may only subset in one dimension, corresponding to methylation loci.
The show method.
This class extends hasGRanges and therefore inherits a number
of useful GRanges methods that operate on the gr slot,
used for accessing and setting the genomic locations and also do
subsetByOverlaps.
Package version 1.11.1 introduced a new version of representing
‘BSseqStat’ objects. You can update old serialized (saved)
objects by invoking x <- updateObject(x).
Kasper Daniel Hansen [email protected]
hasGRanges for accessing the genomic locations.
BSmooth.fstat for a function
that returns objects of class BSseqStat, and smoothSds,
computeStat and dmrFinder
for functions that operate based on these statistics. Also see
the more specialised BSseqTstat.
A class for representing t-statistics for smoothed whole-genome bisulfite sequencing data.
BSseqTstat(gr = NULL, stats = NULL, parameters = NULL)BSseqTstat(gr = NULL, stats = NULL, parameters = NULL)
gr |
The genomic locations as an object of class |
stats |
The statistics, as a matrix-like object (see 'Details' below). |
parameters |
A list of parameters. |
The 'stats' matrix-like object will be coerced to a
DelayedMatrix objects; see
DelayedMatrix in the DelayedArray
package for the full list of supported matrix-like objects. We
recommend using matrix objects for in-memory
storage of data and HDF5Matrix for on-disk
storage of data.
Objects can be created by calls of the form BSseqTstat(...).
However, usually objects are returned by BSmooth.tstat(...) and
not constructed by the user..
stats:This is a
DelayedMatrix object with columns
representing various statistics for methylation loci along the
genome.
parameters:Object of class list. A list of
parameters representing how the t-statistics were computed.
gr:Object of class GRanges giving genomic
locations.
Class hasGRanges, directly.
The subsetting operator; one may only subset in one dimension, corresponding to methylation loci.
The show method.
This class extends hasGRanges and therefore inherits a number
of useful GRanges methods that operate on the gr slot,
used for accessing and setting the genomic locations and also do
subsetByOverlaps.
Package version 1.11.1 introduced a new version of representing
‘BSseqTstat’ objects. You can update old serialized (saved)
objects by invoking x <- updateObject(x).
Kasper Daniel Hansen [email protected]
The package vignette(s). hasGRanges for accessing
the genomic locations. BSmooth.tstat for a function
that returns objects of class BSseqTstat, and dmrFinder
for a function that computes DMRs based on the t-statistics. Also see
BS.cancer.ex.tstat for an example of the
class in the bsseqData package.
if(require(bsseqData)) { data(BS.cancer.ex.tstat) dmrFinder(BS.cancer.ex.tstat) }if(require(bsseqData)) { data(BS.cancer.ex.tstat) dmrFinder(BS.cancer.ex.tstat) }
Converting a data.frame to a GRanges object. The data.frame needs columns like chr, start and end (strand is optional). Additional columns may be kept in the GRanges object.
data.frame2GRanges(df, keepColumns = FALSE, ignoreStrand = FALSE)data.frame2GRanges(df, keepColumns = FALSE, ignoreStrand = FALSE)
df |
A |
keepColumns |
In case |
ignoreStrand |
In case |
An object of class GRanges
In case df has rownames, they will be used as
names for the return object.
Kasper Daniel Hansen [email protected]
df <- data.frame(chr = "chr1", start = 1:3, end = 2:4, strand = c("+","-","+")) data.frame2GRanges(df, ignoreStrand = TRUE)df <- data.frame(chr = "chr1", start = 1:3, end = 2:4, strand = c("+","-","+")) data.frame2GRanges(df, ignoreStrand = TRUE)
Finds differentially methylated regions for whole genome bisulfite sequencing data. Essentially identifies regions of the genome where all methylation loci have an associated t-statistic that is beyond a (low, high) cutoff.
dmrFinder(bstat, cutoff = NULL, qcutoff = c(0.025, 0.975), maxGap=300, stat = "tstat.corrected", verbose = TRUE)dmrFinder(bstat, cutoff = NULL, qcutoff = c(0.025, 0.975), maxGap=300, stat = "tstat.corrected", verbose = TRUE)
bstat |
An object of class |
cutoff |
The cutoff of the t-statistics. This should be a vector
of length two giving the (low, high) cutoff. If |
qcutoff |
In case |
maxGap |
If two methylation loci are separated by this distance, break a possible DMR. This guarantees that the return DMRs have CpGs that are this distance from each other. |
stat |
Which statistic should be used? |
verbose |
Should the function be verbose? |
The workhorse function is BSmooth.tstat which sets up a
t-statistic for a comparison between two groups.
Note that post-processing of these DMRs are likely to be necessary, filtering for example for length (or number of loci).
A data.frame with columns
start, end, width, chr
|
genomic locations and width. |
n |
The number of methylation loci. |
invdensity |
Average length per loci. |
group1.mean |
The mean methylation level across samples and loci in 'group1'. |
group2.mean |
The mean methylation level across samples and loci in 'group2'. |
meanDiff |
The mean difference in methylation level; the
difference between |
idxStart, idxEnd, cluster
|
Internal use. |
areaStat |
The 'area' of the t-statistic; equal to the sum of the t-statistics for the individual methylation loci. |
direction |
either ‘hyper’ or ‘hypo’. |
areaStat.corrected |
Only present if |
Kasper Daniel Hansen [email protected].
KD Hansen, B Langmead, and RA Irizarry. BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biology (2012) 13:R83. doi:10.1186/gb-2012-13-10-r83.
BSmooth.tstat for the function constructing the input
object, and BSseqTstat for its class. In the
example below, we use BS.cancer.ex.tstat as
the actual input object. Also see the package vignette(s) for a
detailed example.
if(require(bsseqData)) { dmrs0 <- dmrFinder(BS.cancer.ex.tstat, cutoff = c(-4.6, 4.6), verbose = TRUE) dmrs <- subset(dmrs0, abs(meanDiff) > 0.1 & n >= 3) }if(require(bsseqData)) { dmrs0 <- dmrFinder(BS.cancer.ex.tstat, cutoff = c(-4.6, 4.6), verbose = TRUE) dmrs <- subset(dmrs0, abs(meanDiff) > 0.1 & n >= 3) }
This function estimates the CpG-specific error rate from a single sample BSseq object generated using read.bedMethyl.
estimateErrorRate(BSseq, minCov = 10, maxCov = 100, minRatio = 0.8, plotErrorProfile = FALSE)estimateErrorRate(BSseq, minCov = 10, maxCov = 100, minRatio = 0.8, plotErrorProfile = FALSE)
BSseq |
A single sample object of class |
minCov |
A non-negative integer specifying the minimum coverage required for CpG loci to be considered. |
maxCov |
A non-negative integer specifying the maximum coverage allowed for CpG loci to be considered. |
minRatio |
A numeric value between 0 and 1 specifying the minimum ratio of CpG sites to non-CpG sites for a loci to be considered. |
plotErrorProfile |
A logical value indicating whether to plot the CpG to non-CpG ratio distribution for the filtered sites. |
A numeric value representing the estimated CpG-specific error rate the BSseq object.
Søren Blikdal Hansen ([email protected])
BSseq for the BSseq class,
read.bedMethyl for details on reading data into a BSseq object.
# Example input files infiles <- c(system.file("extdata/HG002_nanopore_test.bedMethyl.gz", package = "bsseq"), system.file("extdata/HG002_pacbio_test.bedMethyl.gz", package = "bsseq")) # Run the function to import data bsseq <- read.bedMethyl(files = infiles, colData = DataFrame(row.names = c("test_nanopore", "test_pacbio")), strandCollapse = TRUE, verbose = TRUE) # Estimate error rate estimateErrorRate(bsseq[, 1], plotErrorProfile = FALSE)# Example input files infiles <- c(system.file("extdata/HG002_nanopore_test.bedMethyl.gz", package = "bsseq"), system.file("extdata/HG002_pacbio_test.bedMethyl.gz", package = "bsseq")) # Run the function to import data bsseq <- read.bedMethyl(files = infiles, colData = DataFrame(row.names = c("test_nanopore", "test_pacbio")), strandCollapse = TRUE, verbose = TRUE) # Estimate error rate estimateErrorRate(bsseq[, 1], plotErrorProfile = FALSE)
This is a convenience function to find methylation loci, such as CpGs, in a reference genome.
The result is useful as the value of the loci argument for read.bismark().
findLoci(pattern, subject, include = seqlevels(subject), strand = c("*", "+", "-"), fixed = "subject", resize = TRUE)findLoci(pattern, subject, include = seqlevels(subject), strand = c("*", "+", "-"), fixed = "subject", resize = TRUE)
pattern |
A string specifying the pattern to search for, e.g. |
subject |
A string containing a file path to the genome sequence, in FASTA or 2bit format, to be searched.
Alternatively, a |
include |
A character vector indicating the seqlevels of |
strand |
A character scaler specifying the strand of |
fixed |
If |
resize |
A logical scalar specifying whether the ranges should be shifted to have width 1 and anchored by the start of the locus, e.g., resize a CpG dinucleotide to give the co-ordinates of the cytosine. |
This function provides a convenience wrapper for finding methylation loci in
a genome, based on running
vmatchPattern().
Users requiring finer-grained control should directly use the
vmatchPattern() function and coerce
the result to a GRanges object.
A GRanges object storing the found loci.
Peter F. Hickey
Biostrings::vmatchPattern()
?BSgenome::`BSgenome-utils`
library(Biostrings) # Find CpG dinucleotides on the both strands of an artificial sequence my_seq <- DNAStringSet("ATCAGTCGC") names(my_seq) <- "test" findLoci(pattern = "CG", subject = my_seq) # Find CHG trinucleotides on the both strands of an artificial sequence findLoci(pattern = "CHG", subject = my_seq) # Find CpG dinucleotides on the + strand of chr17 from the human genome (hg38) if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38")) { findLoci( pattern = "CG", subject = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38, include = "chr17", strand = "+") }library(Biostrings) # Find CpG dinucleotides on the both strands of an artificial sequence my_seq <- DNAStringSet("ATCAGTCGC") names(my_seq) <- "test" findLoci(pattern = "CG", subject = my_seq) # Find CHG trinucleotides on the both strands of an artificial sequence findLoci(pattern = "CHG", subject = my_seq) # Find CpG dinucleotides on the + strand of chr17 from the human genome (hg38) if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38")) { findLoci( pattern = "CG", subject = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38, include = "chr17", strand = "+") }
A function to compute Fisher-tests for an object of class BSseq.
fisherTests(BSseq, group1, group2, lookup = NULL, returnLookup = TRUE, mc.cores = 1, verbose = TRUE)fisherTests(BSseq, group1, group2, lookup = NULL, returnLookup = TRUE, mc.cores = 1, verbose = TRUE)
BSseq |
An object of class |
group1 |
A vector of sample names or indexes for the ‘treatment’ group. |
group2 |
A vector of sample names or indexes for the ‘control’ group. |
lookup |
A ‘lookup’ object, see details. |
returnLookup |
Should a ‘lookup’ object be returned, see details. |
mc.cores |
The number of cores used. Note that setting
|
verbose |
Should the function be verbose. |
This function computes row-wise Fisher's exact tests. It uses an
internal lookup table so rows which forms equivalent 2x2 tables are
group together and only a single test is computed. If
returnLookup is TRUE the return object contains the
lookup table which may be feed to another call to the function using
the lookup argument.
If group1, group2 designates more than 1 sample, the
samples are added together before testing.
This function can use multiple cores on the same computer.
This test cannot model biological variability.
if returnLookup is TRUE, a list with components
results and lookup, otherwise just the results
component. The results (component) is a matrix with the same
number of rows as the BSseq argument and 2 columns
p.value (the unadjusted p-values) and log2OR (log2
transformation of the odds ratio).
Kasper Daniel Hansen [email protected]
fisher.test for information about Fisher's test.
mclapply for the mc.cores argument.
M <- matrix(1:9, 3,3) colnames(M) <- c("A1", "A2", "A3") BStest <- BSseq(pos = 1:3, chr = c("chr1", "chr2", "chr1"), M = M, Cov = M + 2) results <- fisherTests(BStest, group1 = "A1", group2 = "A2", returnLookup = TRUE) results #------------------------------------------------------------------------------- # An example using a HDF5Array-backed BSseq object # library(HDF5Array) # See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details hdf5_BStest <- saveHDF5SummarizedExperiment(x = BStest, dir = tempfile()) results <- fisherTests(hdf5_BStest, group1 = "A1", group2 = "A2", returnLookup = TRUE) resultsM <- matrix(1:9, 3,3) colnames(M) <- c("A1", "A2", "A3") BStest <- BSseq(pos = 1:3, chr = c("chr1", "chr2", "chr1"), M = M, Cov = M + 2) results <- fisherTests(BStest, group1 = "A1", group2 = "A2", returnLookup = TRUE) results #------------------------------------------------------------------------------- # An example using a HDF5Array-backed BSseq object # library(HDF5Array) # See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details hdf5_BStest <- saveHDF5SummarizedExperiment(x = BStest, dir = tempfile()) results <- fisherTests(hdf5_BStest, group1 = "A1", group2 = "A2", returnLookup = TRUE) results
Classes for fixed-width IRanges and GRanges, ie. objects where all ranges have the same width. The intention is for these classes to be added to GenomicRanges. Documented here temporarily.
See description. Otherwise works like IRanges and GRanges, except there are many unimplemented methods.
This is really a private class, with private methods, but R's NAMESPACE handling means they get unintentionally exported. Hence this documentation.
showClass("FWIRanges")showClass("FWIRanges")
Obtain coverage for BSseq objects.
getCoverage(BSseq, regions = NULL, type = c("Cov", "M"), what = c("perBase", "perRegionAverage", "perRegionTotal"), withDimnames = TRUE)getCoverage(BSseq, regions = NULL, type = c("Cov", "M"), what = c("perBase", "perRegionAverage", "perRegionTotal"), withDimnames = TRUE)
BSseq |
An object of class |
regions |
An optional |
type |
This returns either coverage or the total evidence for methylation at the loci. |
what |
The type of return object, see details. |
withDimnames |
A |
NOTE: The return type of getCoverage varies depending on its
arguments.
If regions are not specified (regions = NULL) a
DelayedMatrix object (what =
"perBase") is returned. This will either contain the per-base
coverage, the average coverage, or the genome total coverage
(depending on value of what).
If what = "perBase" and regions are specified, a list is
returned. Each element of the list is a
DelayedMatrix object corresponding to the
genomic loci inside the region. It is conceptually the same as
splitting the coverage by region.
If what = "perRegionAverage" or what = "perRegionTotal"
and regions are specified the return value is a
DelayedMatrix object. Each row of the
DelayedMatrix corresponds to a region and
contains either the average coverage or the total coverage in the
region.
Kasper Daniel Hansen [email protected].
BSseq for the BSseq class.
data(BS.chr22) head(getCoverage(BS.chr22, type = "M")) reg <- GRanges(seqnames = c("chr22", "chr22"), ranges = IRanges(start = c(1, 2*10^7), end = c(2*10^7 +1, 4*10^7))) getCoverage(BS.chr22, regions = reg, what = "perRegionAverage") cList <- getCoverage(BS.chr22, regions = reg) length(cList) head(cList[[1]]) #------------------------------------------------------------------------------- # An example using a HDF5Array-backed BSseq object # library(HDF5Array) # See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details hdf5_BS.chr22 <- saveHDF5SummarizedExperiment(x = BS.chr22, dir = tempfile()) head(getCoverage(hdf5_BS.chr22, type = "M")) reg <- GRanges(seqnames = c("chr22", "chr22"), ranges = IRanges(start = c(1, 2 * 10 ^ 7), end = c(2 * 10 ^ 7 + 1, 4 * 10 ^ 7))) getCoverage(hdf5_BS.chr22, regions = reg, what = "perRegionAverage") hdf5_cList <- getCoverage(hdf5_BS.chr22, regions = reg) length(hdf5_cList) head(hdf5_cList[[1]])data(BS.chr22) head(getCoverage(BS.chr22, type = "M")) reg <- GRanges(seqnames = c("chr22", "chr22"), ranges = IRanges(start = c(1, 2*10^7), end = c(2*10^7 +1, 4*10^7))) getCoverage(BS.chr22, regions = reg, what = "perRegionAverage") cList <- getCoverage(BS.chr22, regions = reg) length(cList) head(cList[[1]]) #------------------------------------------------------------------------------- # An example using a HDF5Array-backed BSseq object # library(HDF5Array) # See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details hdf5_BS.chr22 <- saveHDF5SummarizedExperiment(x = BS.chr22, dir = tempfile()) head(getCoverage(hdf5_BS.chr22, type = "M")) reg <- GRanges(seqnames = c("chr22", "chr22"), ranges = IRanges(start = c(1, 2 * 10 ^ 7), end = c(2 * 10 ^ 7 + 1, 4 * 10 ^ 7))) getCoverage(hdf5_BS.chr22, regions = reg, what = "perRegionAverage") hdf5_cList <- getCoverage(hdf5_BS.chr22, regions = reg) length(hdf5_cList) head(hdf5_cList[[1]])
This function generates a matrix of the most likely CpG status for all loci and samples in a BSseq object. Each element of the matrix represents the most likely CpG status (0 for homozygous CpG, 1 for heterozygous CpG, and 2 for non-CpG or if allCpG = TRUE 0 for homozygous or heterozygous CpG, and 2 for non-CpG) for a specific locus and sample.
getCpGMatrix(BSseq, e = NULL, allCpG = FALSE)getCpGMatrix(BSseq, e = NULL, allCpG = FALSE)
BSseq |
An object of class |
e |
An optional numeric vector representing error rates for each sample. If |
allCpG |
A logical value indicating whether to classify loci as allCpG (i.e. combine homozygous or heterozygous CpG) and non-CpG based on their likelihoods. Should be the same for |
A numeric matrix where each row represents a locus, and each column represents a sample, and the values correspond to the CpG status (same order as the BSseq object in input).
Søren Blikdal Hansen ([email protected])
BSseq for the BSseq class,
read.bedMethyl for details on reading data into a BSseq object,
estimateErrorRate for estimating the CpG-specific error rate.
getCpGs for filtering a single-sample BSseg object.
getMaxLikelihoodMatrix for generating a matrix with the maximum scaled likelihoods matching the CpGMatrix.
# Example input files infiles <- c(system.file("extdata/HG002_nanopore_test.bedMethyl.gz", package = "bsseq"), system.file("extdata/HG002_pacbio_test.bedMethyl.gz", package = "bsseq")) # Run the function to import data bsseq <- read.bedMethyl(files = infiles, colData = DataFrame(row.names = c("test_nanopore", "test_pacbio")), strandCollapse = TRUE, verbose = TRUE) # Single samples can be filtered using the getCpGs function bsseq_nano <- bsseq[, 1] bsseq_nano_99All_filtered <- bsseq[getCpGs(bsseq_nano, type = "allCpG", threshold = 0.99)] bsseq_pacbio <- bsseq[, 2] bsseq_pacbio_99All_filtered <- bsseq[getCpGs(bsseq_pacbio, type = "allCpG", threshold = 0.99)] # For filtering multiple samples, we can use a CpGMatrix and a MaxLikelihoodMatrix # Construct the CpGMatrix and getMaxLikelihoodMatrix for the bsseq object CpGMatrix <- getCpGMatrix(bsseq, allCpG = TRUE) MaxLikelihoodMatrix <- getMaxLikelihoodMatrix(bsseq, allCpG = TRUE) # Filter for allCpG loci with a likelihood > 0.99 in both samples bsseq_combined_99All_filtered <- bsseq[which(rowAlls(CpGMatrix == 0) & rowMins(MaxLikelihoodMatrix) > 0.99)]# Example input files infiles <- c(system.file("extdata/HG002_nanopore_test.bedMethyl.gz", package = "bsseq"), system.file("extdata/HG002_pacbio_test.bedMethyl.gz", package = "bsseq")) # Run the function to import data bsseq <- read.bedMethyl(files = infiles, colData = DataFrame(row.names = c("test_nanopore", "test_pacbio")), strandCollapse = TRUE, verbose = TRUE) # Single samples can be filtered using the getCpGs function bsseq_nano <- bsseq[, 1] bsseq_nano_99All_filtered <- bsseq[getCpGs(bsseq_nano, type = "allCpG", threshold = 0.99)] bsseq_pacbio <- bsseq[, 2] bsseq_pacbio_99All_filtered <- bsseq[getCpGs(bsseq_pacbio, type = "allCpG", threshold = 0.99)] # For filtering multiple samples, we can use a CpGMatrix and a MaxLikelihoodMatrix # Construct the CpGMatrix and getMaxLikelihoodMatrix for the bsseq object CpGMatrix <- getCpGMatrix(bsseq, allCpG = TRUE) MaxLikelihoodMatrix <- getMaxLikelihoodMatrix(bsseq, allCpG = TRUE) # Filter for allCpG loci with a likelihood > 0.99 in both samples bsseq_combined_99All_filtered <- bsseq[which(rowAlls(CpGMatrix == 0) & rowMins(MaxLikelihoodMatrix) > 0.99)]
This function identifies CpG (and non-CpG) loci from a single sample BSseq object using scaled likelihoods computed from the count of CpG sites and Non-CpG sites mapped to the loci, based on the specified type, and minimum scaled likelihood (threshold). The function only work for BSseq objects generated using read.bedMethyl.
getCpGs(BSseq, type = c("homozygous", "heterozygous", "allCpG", "nonCpG"), threshold = 0.99, e = NULL)getCpGs(BSseq, type = c("homozygous", "heterozygous", "allCpG", "nonCpG"), threshold = 0.99, e = NULL)
BSseq |
An single sample object of class |
type |
A character string specifying the type of loci to extract. Must be one of |
threshold |
A numeric value between 0 and 1 specifying the minimum likelihood threshold required for loci to be included. |
e |
An optional numeric value representing the error rate. If |
An integer vector of indices representing the loci that match the criteria.
Søren Blikdal Hansen ([email protected])
BSseq for the BSseq class,
read.bedMethyl for details on reading data into a BSseq object,
estimateErrorRate for estimating the CpG-specific error rate.
# Example input files infiles <- c(system.file("extdata/HG002_nanopore_test.bedMethyl.gz", package = "bsseq"), system.file("extdata/HG002_pacbio_test.bedMethyl.gz", package = "bsseq")) # Run the function to import data bsseq <- read.bedMethyl(files = infiles, colData = DataFrame(row.names = c("test_nanopore", "test_pacbio")), strandCollapse = TRUE, verbose = TRUE) # Filter CpG sites for the Nanopore dataset bsseq_nano <- bsseq[, 1] bsseq_nano_99All_filtered <- bsseq[getCpGs(bsseq_nano, type = "allCpG", threshold = 0.99)] # Filter CpG sites for the PacBio dataset bsseq_pacbio <- bsseq[, 2] bsseq_pacbio_99All_filtered <- bsseq[getCpGs(bsseq_pacbio, type = "allCpG", threshold = 0.99)]# Example input files infiles <- c(system.file("extdata/HG002_nanopore_test.bedMethyl.gz", package = "bsseq"), system.file("extdata/HG002_pacbio_test.bedMethyl.gz", package = "bsseq")) # Run the function to import data bsseq <- read.bedMethyl(files = infiles, colData = DataFrame(row.names = c("test_nanopore", "test_pacbio")), strandCollapse = TRUE, verbose = TRUE) # Filter CpG sites for the Nanopore dataset bsseq_nano <- bsseq[, 1] bsseq_nano_99All_filtered <- bsseq[getCpGs(bsseq_nano, type = "allCpG", threshold = 0.99)] # Filter CpG sites for the PacBio dataset bsseq_pacbio <- bsseq[, 2] bsseq_pacbio_99All_filtered <- bsseq[getCpGs(bsseq_pacbio, type = "allCpG", threshold = 0.99)]
This function generates a matrix of the scaled likelihoods for most likely CpG status for a multi-sample BSseq object. Each element of the matrix represents the scaled likelihood of the most likely CpG status for the locus in the sample. If no data is available for a locus in a sample, the entry in the CpGMatrix is 2 (nonCpG) and the corresponding MaxLikelihood is 1/3.
getMaxLikelihoodMatrix(BSseq, e = NULL, allCpG = FALSE)getMaxLikelihoodMatrix(BSseq, e = NULL, allCpG = FALSE)
BSseq |
An object of class |
e |
An optional numeric vector representing error rates for each sample. If |
allCpG |
A logical value indicating whether to classify loci as allCpG and non-CpG loci and sum the scaled likelihood of homozygous CpG and heterozygous CpG. Should be the same for |
A numeric matrix where each row represents a locus, each column represents a sample, and the values correspond to the quality scores.
Søren Blikdal Hansen ([email protected])
BSseq for the BSseq class,
read.bedMethyl for details on reading data into a BSseq object,
estimateErrorRate for estimating the CpG-specific error rate.
getCpGs for filtering a single-sample BSseg object.
getCpGMatrix for generating a matrix with the most likely CpG status matching the getMaxLikelihoodMatrix.
# Example input files infiles <- c(system.file("extdata/HG002_nanopore_test.bedMethyl.gz", package = "bsseq"), system.file("extdata/HG002_pacbio_test.bedMethyl.gz", package = "bsseq")) # Run the function to import data bsseq <- read.bedMethyl(files = infiles, colData = DataFrame(row.names = c("test_nanopore", "test_pacbio")), strandCollapse = TRUE, verbose = TRUE) # Single samples can be filtered using the getCpGs function bsseq_nano <- bsseq[, 1] bsseq_nano_99All_filtered <- bsseq[getCpGs(bsseq_nano, type = "allCpG", threshold = 0.99)] bsseq_pacbio <- bsseq[, 2] bsseq_pacbio_99All_filtered <- bsseq[getCpGs(bsseq_pacbio, type = "allCpG", threshold = 0.99)] # For filtering multiple samples, we can use a CpGMatrix and a MaxLikelihoodMatrix # Construct the CpGMatrix and QualityMatrix for the bsseq object CpGMatrix <- getCpGMatrix(bsseq, allCpG = TRUE) MaxLikelihoodMatrix <- getMaxLikelihoodMatrix(bsseq, allCpG = TRUE) # Filter for allCpG loci with a likelihood > 0.99 in both samples bsseq_combined_99All_filtered <- bsseq[which(rowAlls(CpGMatrix == 0) & rowMins(MaxLikelihoodMatrix) > 0.99)]# Example input files infiles <- c(system.file("extdata/HG002_nanopore_test.bedMethyl.gz", package = "bsseq"), system.file("extdata/HG002_pacbio_test.bedMethyl.gz", package = "bsseq")) # Run the function to import data bsseq <- read.bedMethyl(files = infiles, colData = DataFrame(row.names = c("test_nanopore", "test_pacbio")), strandCollapse = TRUE, verbose = TRUE) # Single samples can be filtered using the getCpGs function bsseq_nano <- bsseq[, 1] bsseq_nano_99All_filtered <- bsseq[getCpGs(bsseq_nano, type = "allCpG", threshold = 0.99)] bsseq_pacbio <- bsseq[, 2] bsseq_pacbio_99All_filtered <- bsseq[getCpGs(bsseq_pacbio, type = "allCpG", threshold = 0.99)] # For filtering multiple samples, we can use a CpGMatrix and a MaxLikelihoodMatrix # Construct the CpGMatrix and QualityMatrix for the bsseq object CpGMatrix <- getCpGMatrix(bsseq, allCpG = TRUE) MaxLikelihoodMatrix <- getMaxLikelihoodMatrix(bsseq, allCpG = TRUE) # Filter for allCpG loci with a likelihood > 0.99 in both samples bsseq_combined_99All_filtered <- bsseq[which(rowAlls(CpGMatrix == 0) & rowMins(MaxLikelihoodMatrix) > 0.99)]
Obtain methylation estimates for BSseq objects, both smoothed and raw.
getMeth(BSseq, regions = NULL, type = c("smooth", "raw"), what = c("perBase", "perRegion"), confint = FALSE, alpha = 0.95, withDimnames = TRUE)getMeth(BSseq, regions = NULL, type = c("smooth", "raw"), what = c("perBase", "perRegion"), confint = FALSE, alpha = 0.95, withDimnames = TRUE)
BSseq |
An object of class |
regions |
An optional |
type |
This returns either smoothed or raw estimates of the methylation level. |
what |
The type of return object, see details. |
confint |
Should a confidence interval be return for the
methylation estimates (see below). This is only supported if
|
alpha |
alpha value for the confidence interval. |
withDimnames |
A |
NOTE: The return type of getMeth varies depending on its
arguments.
If region = NULL the what argument is ignored. This is
also the only situation in which confint = TRUE is supported.
The return value is either a DelayedMatrix
(confint = FALSE or a list with three
DelayedMatrix components confint =
TRUE (meth, upper and lower), giving the
methylation estimates and (optionally) confidence intervals.
Confidence intervals for type = "smooth" is based on standard
errors from the smoothing algorithm (if present). Otherwise it is
based on pointwise confidence intervals for binomial distributions
described in Agresti (see below), specifically the score confidence
interval.
If regions are specified, what = "perBase" will make the
function return a list, each element of the list being a
DelayedMatrix corresponding to a genomic
region (and each row of the DelayedMatrix
being a loci inside the region). If what = "perRegion" the
function returns a DelayedMatrix, with
each row corresponding to a region and containing the average
methylation level in that region.
A BSseq object needs to be smoothed by the function
BSmooth in order to support type = "smooth".
Kasper Daniel Hansen [email protected].
A Agresti and B Coull. Approximate Is Better than "Exact" for Interval Estimation of Binomial Proportions. The American Statistician (1998) 52:119-126.
BSseq for the BSseq class and
BSmooth for smoothing such an object.
data(BS.chr22) head(getMeth(BS.chr22, type = "raw")) reg <- GRanges(seqnames = c("chr22", "chr22"), ranges = IRanges(start = c(1, 2*10^7), end = c(2*10^7 +1, 4*10^7))) head(getMeth(BS.chr22, regions = reg, type = "raw", what = "perBase")) #------------------------------------------------------------------------------- # An example using a HDF5Array-backed BSseq object # library(HDF5Array) # See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details hdf5_BS.chr22 <- saveHDF5SummarizedExperiment(x = BS.chr22, dir = tempfile()) head(getMeth(hdf5_BS.chr22, type = "raw")) head(getMeth(hdf5_BS.chr22, regions = reg, type = "raw", what = "perBase"))data(BS.chr22) head(getMeth(BS.chr22, type = "raw")) reg <- GRanges(seqnames = c("chr22", "chr22"), ranges = IRanges(start = c(1, 2*10^7), end = c(2*10^7 +1, 4*10^7))) head(getMeth(BS.chr22, regions = reg, type = "raw", what = "perBase")) #------------------------------------------------------------------------------- # An example using a HDF5Array-backed BSseq object # library(HDF5Array) # See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details hdf5_BS.chr22 <- saveHDF5SummarizedExperiment(x = BS.chr22, dir = tempfile()) head(getMeth(hdf5_BS.chr22, type = "raw")) head(getMeth(hdf5_BS.chr22, regions = reg, type = "raw", what = "perBase"))
Essentially an accessor function for the statistics of a BSseqTstat
object.
getStats(bstat, regions = NULL, ...)getStats(bstat, regions = NULL, ...)
bstat |
An object of class |
regions |
An optional |
... |
Additional arguments passed to the different backends based
on the class of |
Additional argument when the bstat object is of class BSseqTstat:
Which statistics column should be obtained.
An object of class data.frame possible restricted to the
regions specified.
Kasper Daniel Hansen [email protected]
BSseqTstat for the BSseqTstat class, and
getCoverage and getMeth for similar
functions, operating on objects of class BSseq.
if(require(bsseqData)) { data(BS.cancer.ex.tstat) head(getStats(BS.cancer.ex.tstat)) reg <- GRanges(seqnames = c("chr22", "chr22"), ranges = IRanges(start = c(1, 2*10^7), end = c(2*10^7 +1, 4*10^7))) head(getStats(BS.cancer.ex.tstat, regions = reg)) }if(require(bsseqData)) { data(BS.cancer.ex.tstat) head(getStats(BS.cancer.ex.tstat)) reg <- GRanges(seqnames = c("chr22", "chr22"), ranges = IRanges(start = c(1, 2*10^7), end = c(2*10^7 +1, 4*10^7))) head(getStats(BS.cancer.ex.tstat, regions = reg)) }
Binomial and poisson goodness of fit statistics for BSSeq objects, including plotting capability.
poissonGoodnessOfFit(BSseq, nQuantiles = 10^5) binomialGoodnessOfFit(BSseq, method = c("MLE"), nQuantiles = 10^5) ## S3 method for class 'chisqGoodnessOfFit' print(x, ...) ## S3 method for class 'chisqGoodnessOfFit' plot(x, type = c("chisq", "pvalue"), plotCol = TRUE, qqline = TRUE, pch = 16, cex = 0.75, ...)poissonGoodnessOfFit(BSseq, nQuantiles = 10^5) binomialGoodnessOfFit(BSseq, method = c("MLE"), nQuantiles = 10^5) ## S3 method for class 'chisqGoodnessOfFit' print(x, ...) ## S3 method for class 'chisqGoodnessOfFit' plot(x, type = c("chisq", "pvalue"), plotCol = TRUE, qqline = TRUE, pch = 16, cex = 0.75, ...)
BSseq |
An object of class |
x |
A chisqGoodnessOfFit object (as produced by
|
nQuantiles |
The number of (evenly-spaced) quantiles stored in the return object. |
method |
How is the parameter estimated. |
type |
Are the chisq or the p-values being plotted. |
plotCol |
Should the extreme quantiles be colored. |
qqline |
Add a |
pch, cex
|
Plotting symbols and size. |
... |
Additional arguments being passed to |
These functions compute and plot goodness of fit statistics for
BSseq objects. For each methylation loci, the Poisson
goodness of fit statistic tests whether the coverage (at that loci) is
independent and identically Poisson distributed across the samples.
In a similar fashion, the Binomial goodness of fit statistic tests
whether the number of reads supporting methylation are independent and
identically binomial distributed across samples (with different size
parameters given by the coverage vector).
These functions do not handle NA values.
The plotting method is invoked for its side effect. Both
poissonGoodnessOfFit and binomialGoodnessOfFit returns
an object of class chisqGoodnessOfFit which is a list with components
chisq |
a vector of Chisq values. |
quantiles |
a vector of quantiles (of the chisq values). |
df |
degress of freedom |
Kasper Daniel Hansen [email protected]
For the plotting method, see qqplot.
if(require(bsseqData)) { data(BS.cancer.ex) BS.cancer.ex <- updateObject(BS.cancer.ex) gof <- poissonGoodnessOfFit(BS.cancer.ex) plot(gof) #------------------------------------------------------------------------------- # An example using a HDF5Array-backed BSseq object # library(HDF5Array) # See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details hdf5_BS.cancer.ex <- saveHDF5SummarizedExperiment(x = BS.cancer.ex, dir = tempfile()) hdf5_gof <- poissonGoodnessOfFit(hdf5_BS.cancer.ex) plot(hdf5_gof) }if(require(bsseqData)) { data(BS.cancer.ex) BS.cancer.ex <- updateObject(BS.cancer.ex) gof <- poissonGoodnessOfFit(BS.cancer.ex) plot(gof) #------------------------------------------------------------------------------- # An example using a HDF5Array-backed BSseq object # library(HDF5Array) # See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details hdf5_BS.cancer.ex <- saveHDF5SummarizedExperiment(x = BS.cancer.ex, dir = tempfile()) hdf5_gof <- poissonGoodnessOfFit(hdf5_BS.cancer.ex) plot(hdf5_gof) }
A class with a GRanges slot, used as a building block for other classes. Provides basic accessor functions etc.
Objects can be created by calls of the form new("hasGRanges", ...).
gr:Object of class GRanges.
Subsets a single dimension.
Get the GRanges object representing
genomic locations.
Start, end and width
for the genomic locations of the object, also replacement
functions. This accessor functions operate directly on the
gr slot.
Getting and setting the
strand of the genomic locations (the gr slot).
Getting and setting the
seqlengths of the genomic locations (the gr slot).
Getting and setting the
seqlevels of the genomic locations (the gr slot).
Getting and setting the
seqnames of the genomic locations (the gr slot).
The show method.
(query = "hasGRanges", subject =
"hasGRanges"): finds overlaps between the granges() of
the two objects.
(query = "GenomicRanges", subject =
"hasGRanges"): finds overlaps between query and the granges() of
the subject.
(query = "hasGRanges", subject =
"GenomicRanges"): finds overlaps between the granges()
of the query and the subject.
(query = "hasGRanges", subject =
"hasGRanges"): Subset the query, keeping the genomic locations that
overlaps the subject.
(query = "hasGRanges", subject =
"GenomicRanges"): Subset the query, keeping the genomic
locations that overlaps the subject.
(query = "GenomicRanges",
subject = "hasGRanges"): Subset the query, keeping the genomic
locations that overlaps the subject.
If you extend the hasGRanges class, you should consider writing
a subset method ([), and a show method. If the new
class supports single index subsetting, the subsetByOverlaps
methods will automatically extend.
Kasper Daniel Hansen [email protected]
showClass("hasGRanges")showClass("hasGRanges")
Functions for plotting BSmooth methylation estimates. Typically used to display differentially methylated regions.
plotRegion(BSseq, region = NULL, extend = 0, main = "", addRegions = NULL, annoTrack = NULL, cex.anno = 1, geneTrack = NULL, cex.gene = 1.5, col = NULL, lty = NULL, lwd = NULL, BSseqStat = NULL, stat = "tstat.corrected", stat.col = "black", stat.lwd = 1, stat.lty = 1, stat.ylim = c(-8, 8), mainWithWidth = TRUE, regionCol = alpha("red", 0.1), addTicks = TRUE, addPoints = FALSE, pointsMinCov = 5, highlightMain = FALSE) plotManyRegions(BSseq, regions = NULL, extend = 0, main = "", addRegions = NULL, annoTrack = NULL, cex.anno = 1, geneTrack = NULL, cex.gene = 1.5, col = NULL, lty = NULL, lwd = NULL, BSseqStat = NULL, stat = "tstat.corrected", stat.col = "black", stat.lwd = 1, stat.lty = 1, stat.ylim = c(-8, 8), mainWithWidth = TRUE, regionCol = alpha("red", 0.1), addTicks = TRUE, addPoints = FALSE, pointsMinCov = 5, highlightMain = FALSE, verbose = TRUE)plotRegion(BSseq, region = NULL, extend = 0, main = "", addRegions = NULL, annoTrack = NULL, cex.anno = 1, geneTrack = NULL, cex.gene = 1.5, col = NULL, lty = NULL, lwd = NULL, BSseqStat = NULL, stat = "tstat.corrected", stat.col = "black", stat.lwd = 1, stat.lty = 1, stat.ylim = c(-8, 8), mainWithWidth = TRUE, regionCol = alpha("red", 0.1), addTicks = TRUE, addPoints = FALSE, pointsMinCov = 5, highlightMain = FALSE) plotManyRegions(BSseq, regions = NULL, extend = 0, main = "", addRegions = NULL, annoTrack = NULL, cex.anno = 1, geneTrack = NULL, cex.gene = 1.5, col = NULL, lty = NULL, lwd = NULL, BSseqStat = NULL, stat = "tstat.corrected", stat.col = "black", stat.lwd = 1, stat.lty = 1, stat.ylim = c(-8, 8), mainWithWidth = TRUE, regionCol = alpha("red", 0.1), addTicks = TRUE, addPoints = FALSE, pointsMinCov = 5, highlightMain = FALSE, verbose = TRUE)
BSseq |
An object of class |
region |
A |
regions |
A |
extend |
Describes how much the plotting region should be
extended in either direction. The total width of the plot is equal to
the width of the region plus twice |
main |
The plot title. The default is to construct a title with information about which genomic region is being plotted. |
addRegions |
A set of additional regions to be highlighted on the
plots. As the |
annoTrack |
A named list of |
cex.anno |
|
geneTrack |
EXPERIMENTAL: A |
cex.gene |
|
col |
The color of the methylation estimates, see details. |
lty |
The line type of the methylation estimates, see details. |
lwd |
The line width of the methylation estimates, see details. |
BSseqStat |
An object of class |
stat |
Which statistics will be plotted (only used is
|
stat.col |
color for the statistics plot. |
stat.lwd |
line width for the statistics plot. |
stat.lty |
line type for the statistics plot. |
stat.ylim |
y-limits for the statistics plot. |
mainWithWidth |
Should the default title include information about width of the plot region. |
regionCol |
The color used for highlighting the region. |
addTicks |
Should tick marks showing the location of methylation loci, be added? |
addPoints |
Should the individual unsmoothed methylation estimates be plotted. This usually leads to a very confusing plot, but may be useful for diagnostic purposes. |
pointsMinCov |
The minimum coverage a methylation loci need in
order for the raw methylation estimates to be plotted. Useful for
filtering out low coverage loci. Only used if |
highlightMain |
Should the plot region be highlighted? |
verbose |
Should the function be verbose? |
The correct choice of aspect ratio depends on the width of the
plotting region. We tend to use width = 10, height = 5.
plotManyRegions is used to plot many regions (hundreds or
thousands), and is substantially quicker than repeated calls to
plotRegion.
This function has grown to be rather complicated over time. For custom plotting, it is sometimes useful to use the function definition as a skeleton and directly modify the code.
This function is invoked for its side effect: producing a plot.
Kasper Daniel Hansen [email protected]
The package vignette has an extended example.
Parsing bedMethyl output from modkit pileup.
read.bedMethyl(files, loci = NULL, colData = NULL, rmZeroCov = TRUE, strandCollapse = TRUE, BPPARAM = bpparam(), BACKEND = NULL, dir = tempfile("BSseq"), replace = FALSE, chunkdim = NULL, level = NULL, nThread = 1L, verbose = getOption("verbose"))read.bedMethyl(files, loci = NULL, colData = NULL, rmZeroCov = TRUE, strandCollapse = TRUE, BPPARAM = bpparam(), BACKEND = NULL, dir = tempfile("BSseq"), replace = FALSE, chunkdim = NULL, level = NULL, nThread = 1L, verbose = getOption("verbose"))
files |
The path to the files created by running modkit pileup, one sample per file. See the methods section of [link to preprint] for validated output. |
loci |
|
colData |
An optional |
rmZeroCov |
A |
strandCollapse |
A |
BPPARAM |
An optional |
BACKEND |
|
dir |
Only applicable if |
replace |
Only applicable if |
chunkdim |
Only applicable if |
level |
The compression level to use for writing the data to disk. |
nThread |
The number of threads used by |
verbose |
A |
The format of each file should be similar to the examples in [link to preprint]. Files ending in .gz, .bz2, .xz, or .zip will be automatically decompressed to tempdir().
Modkit bedMethyl files from modkit pileup. For downstream likelihood functions we recommend running modkit pileup on output from bam files modification/basecalled using a CG context model and not using a reference genome for pileup.
Other types of output.
The genomic co-ordinates of bedMethyl files are zero-based. Since Bioconductor packages typically use one-based co-ordinates, the co-ordinates from the bedMethyl files are converted to one-based in the BSseq object.
Søren Blikdal Hansen ([email protected])
# Example: Reading bedMethyl files included in the bsseq package # Paths to example bedMethyl files in the package's extdata directory infiles <- c(system.file("extdata/HG002_nanopore_test.bedMethyl.gz", package = "bsseq"), system.file("extdata/HG002_pacbio_test.bedMethyl.gz", package = "bsseq")) # Run the function to import data bsseq <- read.bedMethyl(files = infiles, colData = DataFrame(row.names = c("test_nanopore", "test_pacbio")), rmZeroCov = FALSE, strandCollapse = TRUE, verbose = TRUE) # View the resulting BSseq object bsseq# Example: Reading bedMethyl files included in the bsseq package # Paths to example bedMethyl files in the package's extdata directory infiles <- c(system.file("extdata/HG002_nanopore_test.bedMethyl.gz", package = "bsseq"), system.file("extdata/HG002_pacbio_test.bedMethyl.gz", package = "bsseq")) # Run the function to import data bsseq <- read.bedMethyl(files = infiles, colData = DataFrame(row.names = c("test_nanopore", "test_pacbio")), rmZeroCov = FALSE, strandCollapse = TRUE, verbose = TRUE) # View the resulting BSseq object bsseq
Parsing output from the Bismark alignment suite.
read.bismark(files, loci = NULL, colData = NULL, rmZeroCov = FALSE, strandCollapse = TRUE, BPPARAM = bpparam(), BACKEND = NULL, dir = tempfile("BSseq"), replace = FALSE, chunkdim = NULL, level = NULL, nThread = 1L, verbose = getOption("verbose"))read.bismark(files, loci = NULL, colData = NULL, rmZeroCov = FALSE, strandCollapse = TRUE, BPPARAM = bpparam(), BACKEND = NULL, dir = tempfile("BSseq"), replace = FALSE, chunkdim = NULL, level = NULL, nThread = 1L, verbose = getOption("verbose"))
files |
The path to the files created by running Bismark's methylation
extractor, one sample per file.
Files ending in |
loci |
|
colData |
An optional |
rmZeroCov |
A |
strandCollapse |
A |
BPPARAM |
An optional
|
BACKEND |
|
dir |
Only applicable if |
replace |
Only applicable if |
chunkdim |
Only applicable if |
level |
Only applicable if |
nThread |
The number of threads used by |
verbose |
A |
A BSseq object.
The format of each file is automatically detected using the internal function bsseq:::.guessBismarkFileType().
Files ending in .gz, .bz2, .xz, or .zip will be automatically decompressed to tempdir().
Bismark's 'genome wide cytosine report' (https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-genome-wide-cytosine-report-optional-is-tab-delimited-in-the-following-format-1-based-coords) and 'coverage' (https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-coverage-output-looks-like-this-tab-delimited-1-based-genomic-coords) formats are both supported.
If setting loci = NULL, then we strongly recommend using the 'genome wide cytosine report' output format because this includes strand information for each locus.
The 'coverage' output does not contain strand information and so the strand of the returned BSseq object will be set to * unless stranded loci are supplied.
Neither the 'bedGraph' output format (https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-bedgraph-output-optional-looks-like-this-tab-delimited-0-based-start-1-based-end-coords) nor the 'bismark_methylation_extractor' output format (https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-bismark_methylation_extractor-output-is-in-the-form-tab-delimited-1-based-coords) are supported. The former does not include the required counts of methylated and unmethylated reads hile the is an intermediate file containing read-level, rather than locus-level, data on methylation.
The genomic co-ordinates of the Bismark output files may be zero-based or one-based depending on whether the --zero_based argument was used when running Bismark's methylation extractor.
Furthermore, the default co-ordinate counting system varies by version of Bismark.
bsseq makes no assumptions about the basis of the genomic co-ordinates and it is left to the user to ensure that the appropriate basis is used in the analysis of their data.
Since Bioconductor packages typically use one-based co-ordinates, we strongly recommend that your Bismark files are also one-based.
read.bismark()
We recommend the following to achieve fast and efficient importing of Bismark files:
Specify the set of methylation loci via the loci argument.
Use Bismark files in the 'coverage' output format.
Leave rmZeroCov = FALSE.
Use a BPPARAM with a moderate number of workers (cores).
Use BACKEND = "HDF5Array".
Use multiple threads per worker (i.e. nThread > 1).
Each point is discussed below.
loci
Specifying the set of methylation loci via the loci argument means that read.bismark() does not need to first parse all files to identify the set of candidate loci.
Provided that rmZeroCov = FALSE, this means that each file is only read once.
This may be a considerable saving when there are a large number of files.
If you are unsure whether the below-described shortcuts apply to your data, leave loci = NULL and let read.bismark() identify the set of candidate loci from files.
You may wish to use the findLoci() function to find all methylation loci of interest in your reference genome (e.g., all CpGs) and then pass the result via the loci argument.
Alternatively, if all files are 'genome wide cytosine reports' for samples aligned to the same reference genome, then all files contain the exact same set of methylation loci.
In this case, you may wish to first construct loci using the internal function bsseq:::.readBismarkAsFWGRanges() applied to a single file, e.g., loci = bsseq:::.readBismarkAsFWGRanges(files[1], rmZeroCov, strandCollapse).
It will generally be faster to parse Bismark files in the 'coverage' output format than those in the 'genome wide cytosine report' format This is because the former only includes loci with non-zero coverage and so the file size is often considerably smaller, particularly for shallowly sequenced samples (e.g., those from single-cell bisulfite sequencing).
rmZeroCov = FALSE
If you set rmZeroCov = TRUE, then read.bismark() must first parse all the files to identify which loci have zero coverage in all samples and then filter these out from the set of candidate loci.
This will happen even if you supply loci with a
GenomicRanges of candidate loci.
Furthermore, any coverage-based filtering of methylation loci is best left until you have constructed your final BSseq object.
In our experience, the final BSseq object is often the product of combining multiple BSseq objects, each constructed with a separate call to read.bismark().
In such cases, it is premature to use rmZeroCov = TRUE when running each read.bismark(); regretably, combining these objects will often then lead to an inefficiently stored BSseq object.
BPPARAM with a moderate number of workers (cores)Each file can be processed on its own, so you can process in parallel as many files as you have workers. However, if using the HDF5Array backend, then writing to the HDF5 file cannot be performed in parallel and so becomes the bottleneck. Nonetheless, by using a moderate number of workers (2 - 10), we can ensure there is processed data available to write to disk as soon as the current write is completed.
BACKEND = "HDF5Array"
By using the HDF5Array realization backend from HDF5Array, we reduce the amount of data that is kept in-memory at any one time. Once each file is parsed, the data are written to the HDF5 file and are no longer needed in-memory. When combined with multiple workers (cores), this means that each file will only need to read and retain in-memory 1 sample's worth of data at a time.
Conversely, if you opt for all data to be in-memory (via
BACKEND = NULL), then each worker will pass each file's data
back to the main process and memory usage will steadily accumulate
to often unreasonable levels.
nThread > 1read.bismark uses data.table::fread to
read each file, which supports threaded-parallisation. Depending on the
system, increasing nThread can achieve near-linear speed-ups in the
number of threads for reading each file. Care needs to be taken when
nThread > 1 is used in conjunction with a parallel backend via
BPPARAM to ensure the system isn't overloaded. For example, using
BPPARAM = MulticoreParam(workers = 10) with nThread = 4 may
use up to 40 workers simultaneously.
The read.bismark() function creates a BSseq
object with two assays, M and Cov. The choice of
realization backend controls whether these assays are stored
in-memory as an ordinary matrix or on-disk as a
HDF5Array, for example. The choice of
realization backend is controlled by the BACKEND argument,
which defaults to the current value of
DelayedArray::getAutoRealizationBackend().
read.bismark() supports the following realization backends:
NULL (in-memory): This stores each new assay in-memory using an ordinary matrix.
HDF5Array (on-disk): This stores each new assay on-disk in a HDF5 file using an HDF5Matrix from HDF5Array.
Please note that certain combinations of realization backend and
parallelization backend are currently not supported. For example, the
HDF5Array realization backend is currently only
compatible when used with a single-machine parallelization backend
(i.e. it is not compatible with a SnowParam that
specifies an ad hoc cluster of multiple machines).
BSmooth() will issue an error when given such
incompatible realization and parallelization backends.
Additional arguments related to the realization backend can be passed
via the ... argument. These arguments must be named and are
passed to the relevant RealizationSink constructor. For
example, the ... argument can be used to specify the path to
the HDF5 file to be used by BSmooth(). Please see the examples
at the bottom of the page.
read.bismark() now uses the BiocParallel package to implement
parallelization. This brings some notable improvements:
Imported files can now be written directly to an on-disk realization backend by the worker. This dramatically reduces memory usage compared to previous versions of bsseq that required all results be retained in-memory.
Parallelization is now supported on Windows through the use of
a SnowParam object as the value of
BPPARAM.
Detailed and extensive job logging facilities.
All parallelization options are controlled via the BPPARAM
argument. In general, we recommend that users combine multicore
(single-machine) parallelization with an on-disk realization backend
(see section, 'Realization backend'). For Unix and Mac users, this
means using a MulticoreParam. For Windows
users, this means using a single-machine
SnowParam. Please consult the
BiocParallel documentation to take full advantage of the more
advanced features.
A useful feature of BiocParallel are progress bars to monitor the
status of long-running jobs, such as BSmooth(). Progress bars are
controlled via the progressbar argument in the
BiocParallelParam constructor.
BiocParallel also supports extensive and detailed logging facilities. Please consult the BiocParallel documentation to take full advantage these advanced features.
Peter Hickey [email protected]
collapseBSseq() for collapsing (aggregating) data from sample's with multiple Bismark methylation extractor files (e.g., technical replicates).
# Run read.bismark() on a single sample to construct a matrix-backed BSseq # object. infile <- system.file("extdata/test_data.fastq_bismark.bismark.cov.gz", package = "bsseq") bsseq <- read.bismark(files = infile, colData = DataFrame(row.names = "test_data"), rmZeroCov = FALSE, strandCollapse = FALSE, verbose = TRUE) # This is a matrix-backed BSseq object. sapply(assays(bsseq, withDimnames = FALSE), class) bsseq ## Not run: # Run read.bismark() on a single sample to construct a HDF5Array-backed BSseq # object (with data written to 'test_dir') test_dir <- tempfile("BSseq") bsseq <- read.bismark(files = infile, colData = DataFrame(row.names = "test_data"), rmZeroCov = FALSE, strandCollapse = FALSE, BACKEND = "HDF5Array", dir = test_dir, verbose = TRUE) # This is a HDF5Array-backed BSseq object. sapply(assays(bsseq, withDimnames = FALSE), class) # The 'M' and 'Cov' assays are in the HDF5 file 'assays.h5' (in 'test_dir'). sapply(assays(bsseq, withDimnames = FALSE), path) ## End(Not run)# Run read.bismark() on a single sample to construct a matrix-backed BSseq # object. infile <- system.file("extdata/test_data.fastq_bismark.bismark.cov.gz", package = "bsseq") bsseq <- read.bismark(files = infile, colData = DataFrame(row.names = "test_data"), rmZeroCov = FALSE, strandCollapse = FALSE, verbose = TRUE) # This is a matrix-backed BSseq object. sapply(assays(bsseq, withDimnames = FALSE), class) bsseq ## Not run: # Run read.bismark() on a single sample to construct a HDF5Array-backed BSseq # object (with data written to 'test_dir') test_dir <- tempfile("BSseq") bsseq <- read.bismark(files = infile, colData = DataFrame(row.names = "test_data"), rmZeroCov = FALSE, strandCollapse = FALSE, BACKEND = "HDF5Array", dir = test_dir, verbose = TRUE) # This is a HDF5Array-backed BSseq object. sapply(assays(bsseq, withDimnames = FALSE), class) # The 'M' and 'Cov' assays are in the HDF5 file 'assays.h5' (in 'test_dir'). sapply(assays(bsseq, withDimnames = FALSE), path) ## End(Not run)