Title: | Analyze, manage and store whole-genome methylation data |
---|---|
Description: | A collection of tools for analyzing and visualizing whole-genome methylation data from sequencing. This includes whole-genome bisulfite sequencing and Oxford nanopore data. |
Authors: | Kasper Daniel Hansen [aut, cre] , Peter Hickey [aut] , Hervé Pagès [ctb], Aaron Lun [ctb] |
Maintainer: | Kasper Daniel Hansen <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.43.1 |
Built: | 2024-12-17 03:46:23 UTC |
Source: | https://github.com/bioc/bsseq |
This dataset represents chromosome 22 from the IMR90 cell line sequenced in Lister et al. Only CpG methylation are included (there were very few non-CpG loci). The two samples are two different extractions from the same cell line (ie. technical replicates), and are pooled in the analysis in the original paper.
data(BS.chr22)
data(BS.chr22)
An object of class BSseq
.
All coordinates are in hg18.
Obtained from http://neomorph.salk.edu/human_methylome/data.html
specifically the files mc_imr90_r1.tar.gz and
mc_imr90_r2.tar.gz. A script which downloads these files and
constructs the BS.chr22
object may be found in
‘inst/scripts/get_BS.chr22.R’, see the example.
R Lister et al. Human DNA methylomes at base resolution show widespread epigenomic differences. Nature (2009) 462, 315-322.
data(BS.chr22) BS.chr22 script <- system.file("scripts", "get_BS.chr22.R", package = "bsseq") script readLines(script)
data(BS.chr22) BS.chr22 script <- system.file("scripts", "get_BS.chr22.R", package = "bsseq") script readLines(script)
This implements the BSmooth algorithm for estimating methylation levels from bisulfite sequencing data.
BSmooth(BSseq, ns = 70, h = 1000, maxGap = 10^8, keep.se = FALSE, BPPARAM = bpparam(), chunkdim = NULL, level = NULL, verbose = getOption("verbose"))
BSmooth(BSseq, ns = 70, h = 1000, maxGap = 10^8, keep.se = FALSE, BPPARAM = bpparam(), chunkdim = NULL, level = NULL, verbose = getOption("verbose"))
BSseq |
An object of class |
ns |
The minimum number of methylation loci in a smoothing window. |
h |
The minimum smoothing window, in bases. |
maxGap |
The maximum gap between two methylation loci, before the smoothing is broken across the gap. The default smoothes each chromosome separately. |
keep.se |
Should the estimated standard errors from the smoothing algorithm be kept. This will make the return object roughly 30 percent bigger and is currently not be used for anything in bsseq. |
BPPARAM |
An optional BiocParallelParam instance determining the parallel back-end to be used during evaluation. Currently supported are SerialParam (Unix, Mac, Windows), MulticoreParam (Unix and Mac), SnowParam (Unix, Mac, and Windows, limited to single-machine clusters), and BatchtoolsParam (Unix, Mac, Windows, only with the in-memory realization backend). See sections 'Parallelization and progress monitoring' and 'Realization backends' for further details. |
chunkdim |
Only applicable if |
level |
Only applicable if |
verbose |
A |
ns
and h
are passed to the locfit
function. The
bandwidth used is the maximum (in genomic distance) of the h
and a width big enough to contain ns
number of methylation
loci.
An object of class BSseq
, containing coefficients used to fit smoothed
methylation values and optionally standard errors for these.
The BSmooth()
function creates a new assay to store the coefficients
used to construct the smoothed methylation estimates ((coef
). An
additional assay is also created if keep.se == TRUE
(se.coef
).
The choice of realization backend controls whether these assay(s) are stored in-memory as an ordinary matrix or on-disk as a HDF5Array, for example.
The choice of realization backend is controlled by the BACKEND
argument, which defaults to the current value of DelayedArray::getAutoRealizationBackend()
.
BSmooth
supports the following realization backends:
NULL
(in-memory): This stores each new assay in-memory using
an ordinary matrix.
HDF5Array
(on-disk): This stores each new assay on-disk in a
HDF5 file using an HDF5Matrix from HDF5Array.
Please note that certain combinations of realization backend and
parallelization backend are currently not supported. For example, the
HDF5Array realization backend is currently only compatible when
used with a single-machine parallelization backend (i.e. it is not compatible
with a SnowParam that specifies an ad hoc cluster of
multiple machines). BSmooth()
will issue an error when given
such incompatible realization and parallelization backends. Furthermore, to
avoid memory usage blow-ups, BSmooth()
will issue an error if an
in-memory realization backend is used when smoothing a disk-backed
BSseq object.
Additional arguments related to the realization backend can be passed via the
...
argument. These arguments must be named and are passed to the
relevant RealizationSink constructor. For example, the
...
argument can be used to specify the path to the HDF5 file to be
used by BSmooth()
. Please see the examples at the bottom of the page.
BSmooth()
now uses the BiocParallel package to implement
parallelization. This brings some notable improvements:
Smoothed results can now be written directly to an on-disk realization backend by the worker. This dramatically reduces memory usage compared to previous versions of bsseq that required all results be retained in-memory.
Parallelization is now supported on Windows through the use of a
SnowParam object as the value of BPPARAM
.
Detailed and extensive job logging facilities.
All parallelization options are controlled via the BPPARAM
argument.
In general, we recommend that users combine multicore (single-machine)
parallelization with an on-disk realization backend (see section,
'Realization backend'). For Unix and Mac users, this means using
a MulticoreParam. For Windows users, this means using a
single-machine SnowParam. Please consult the BiocParallel
documentation to take full advantage of the more advanced features.
parallelBy
, mc.cores
, and mc.preschedule
are
deprecated and will be removed in subsequent releases of bsseq. These
arguments were necessary when BSmooth()
used the parallel
package to implement parallelization, but this functionality is superseded
by the aforementioned use of BiocParallel. We recommend that users
who previously relied on these arguments switch to
BPPARAM = MulticoreParam(workers = mc.cores, progressbar = TRUE)
.
A useful feature of BiocParallel are progress bars to monitor the
status of long-running jobs, such as BSmooth()
. Progress bars are
controlled via the progressbar
argument in the
BiocParallelParam constructor. Progress bars replace the
use of the deprecated verbose
argument to print out information on
the status of BSmooth()
.
BiocParallel also supports extensive and detailed logging facilities. Please consult the BiocParallel documentation to take full advantage these advanced features.
Method and original implementation by Kasper Daniel Hansen [email protected]. Updated implementation to support disk-backed BSseq objects and more general parallelization by Peter Francis Hickey.
KD Hansen, B Langmead, and RA Irizarry. BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biology (2012) 13:R83. doi:10.1186/gb-2012-13-10-r83.
locfit
in the locfit package, as well as
BSseq
.
## Not run: # Run BSmooth() on a matrix-backed BSseq object using an in-memory realization # backend with serial evaluation. data(BS.chr22) # This is a matrix-backed BSseq object. sapply(assays(BS.chr22, withDimnames = FALSE), class) BS.fit <- BSmooth(BS.chr22, BPPARAM = SerialParam(progressbar = TRUE)) # The new 'coef' assay is an ordinary matrix. sapply(assays(BS.fit, withDimnames = FALSE), class) BS.fit # Run BSmooth() on a disk-backed BSseq object using the HDF5Array realization # backend (with data written to the file 'BSmooth_example.h5') with # multi-core parallel evaluation. BS.chr22 <- realize(BS.chr22, "HDF5Array") # This is a disk-backed BSseq object. sapply(assays(BS.chr22, withDimnames = FALSE), class) BS.fit <- BSmooth(BS.chr22, BPPARAM = MulticoreParam(workers = 2, progressbar = TRUE), BACKEND = "HDF5Array", filepath = "BSmooth_example.h5") # The new 'coef' assay is an HDF5Matrix. sapply(assays(BS.fit, withDimnames = FALSE), class) BS.fit # The new 'coef' assay is in the HDF5 file 'BSmooth_example.h5' (in the # current working directory). sapply(assays(BS.fit, withDimnames = FALSE), path) ## End(Not run)
## Not run: # Run BSmooth() on a matrix-backed BSseq object using an in-memory realization # backend with serial evaluation. data(BS.chr22) # This is a matrix-backed BSseq object. sapply(assays(BS.chr22, withDimnames = FALSE), class) BS.fit <- BSmooth(BS.chr22, BPPARAM = SerialParam(progressbar = TRUE)) # The new 'coef' assay is an ordinary matrix. sapply(assays(BS.fit, withDimnames = FALSE), class) BS.fit # Run BSmooth() on a disk-backed BSseq object using the HDF5Array realization # backend (with data written to the file 'BSmooth_example.h5') with # multi-core parallel evaluation. BS.chr22 <- realize(BS.chr22, "HDF5Array") # This is a disk-backed BSseq object. sapply(assays(BS.chr22, withDimnames = FALSE), class) BS.fit <- BSmooth(BS.chr22, BPPARAM = MulticoreParam(workers = 2, progressbar = TRUE), BACKEND = "HDF5Array", filepath = "BSmooth_example.h5") # The new 'coef' assay is an HDF5Matrix. sapply(assays(BS.fit, withDimnames = FALSE), class) BS.fit # The new 'coef' assay is in the HDF5 file 'BSmooth_example.h5' (in the # current working directory). sapply(assays(BS.fit, withDimnames = FALSE), path) ## End(Not run)
Compute t-statistics based on smoothed whole-genome bisulfite sequencing data.
BSmooth.tstat(BSseq, group1, group2, estimate.var = c("same", "paired", "group2"), local.correct = TRUE, maxGap = NULL, qSd = 0.75, k = 21, mc.cores = 1, verbose = TRUE)
BSmooth.tstat(BSseq, group1, group2, estimate.var = c("same", "paired", "group2"), local.correct = TRUE, maxGap = NULL, qSd = 0.75, k = 21, mc.cores = 1, verbose = TRUE)
BSseq |
An object of class |
group1 |
A vector of sample names or indexes for the ‘treatment’ group. |
group2 |
A vector of sample names or indexes for the ‘control’ group. |
estimate.var |
How is the variance estimated, see details. |
local.correct |
A logical; should local correction be used, see details. |
maxGap |
A scalar greater than 0, see details. |
qSd |
A scalar between 0 and 1, see details. |
k |
A positive scalar, see details. |
mc.cores |
The number of cores used. Note that setting
|
verbose |
Should the function be verbose? |
T-statistics are formed as the difference in means between group 1 and
group 2 divided by an estimate of the standard deviation, assuming
that the variance in the two groups are the same (same
), that
we have paired samples (paired
) or only estimate the variance
based on group 2 (group2
). The standard deviation estimates
are then smoothed (using a running mean with a width of k
) and
thresholded (using qSd
which sets the minimum standard
deviation to be the qSd
-quantile). Optionally, the
t-statistics are corrected for low-frequency patterns.
It is sometimes useful to use local.correct
even if no large
scale changes in methylation have been found; it makes the marginal
distribution of the t-statistics more symmetric.
Additional details in the reference.
An object of class BSseqTstat
.
Kasper Daniel Hansen [email protected]
KD Hansen, B Langmead, and RA Irizarry. BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biology (2012) 13:R83. doi:10.1186/gb-2012-13-10-r83.
BSmooth
for the input object and
BSseq
for its class.
BSseqTstat
describes the return class. This
function is likely to be followed by the use of
dmrFinder
. And finally, see the package vignette(s) for
more information on how to use it.
if(require(bsseqData)) { data(keepLoci.ex) data(BS.cancer.ex.fit) BS.cancer.ex.fit <- updateObject(BS.cancer.ex.fit) ## Remember to subset the BSseq object, see vignette for explanation BS.tstat <- BSmooth.tstat(BS.cancer.ex.fit[keepLoci.ex,], group1 = c("C1", "C2", "C3"), group2 = c("N1", "N2", "N3"), estimate.var = "group2") BS.tstat ## This object is also stored as BS.cancer.ex.tstat in the ## bsseqData package #--------------------------------------------------------------------------- # An example using a HDF5Array-backed BSseq object # library(HDF5Array) # See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details hdf5_BS.cancer.ex.fit <- saveHDF5SummarizedExperiment( x = BS.cancer.ex.fit[keepLoci.ex, ], dir = tempfile()) hdf5_BS.tstat <- BSmooth.tstat(hdf5_BS.cancer.ex.fit, group1 = c("C1", "C2", "C3"), group2 = c("N1", "N2", "N3"), estimate.var = "group2") }
if(require(bsseqData)) { data(keepLoci.ex) data(BS.cancer.ex.fit) BS.cancer.ex.fit <- updateObject(BS.cancer.ex.fit) ## Remember to subset the BSseq object, see vignette for explanation BS.tstat <- BSmooth.tstat(BS.cancer.ex.fit[keepLoci.ex,], group1 = c("C1", "C2", "C3"), group2 = c("N1", "N2", "N3"), estimate.var = "group2") BS.tstat ## This object is also stored as BS.cancer.ex.tstat in the ## bsseqData package #--------------------------------------------------------------------------- # An example using a HDF5Array-backed BSseq object # library(HDF5Array) # See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details hdf5_BS.cancer.ex.fit <- saveHDF5SummarizedExperiment( x = BS.cancer.ex.fit[keepLoci.ex, ], dir = tempfile()) hdf5_BS.tstat <- BSmooth.tstat(hdf5_BS.cancer.ex.fit, group1 = c("C1", "C2", "C3"), group2 = c("N1", "N2", "N3"), estimate.var = "group2") }
The constructor function for BSseq objects.
BSseq(M = NULL, Cov = NULL, Filtered = NULL, coef = NULL, se.coef = NULL, trans = NULL, parameters = NULL, pData = NULL, gr = NULL, pos = NULL, chr = NULL, sampleNames = NULL, rmZeroCov = FALSE)
BSseq(M = NULL, Cov = NULL, Filtered = NULL, coef = NULL, se.coef = NULL, trans = NULL, parameters = NULL, pData = NULL, gr = NULL, pos = NULL, chr = NULL, sampleNames = NULL, rmZeroCov = FALSE)
M |
A matrix-like object of methylation evidence (see 'Details' below). |
Cov |
A matrix-like object of coverage (see 'Details' below)). |
Filtered |
A matrix-like object of ambiguous modification bases obtained from modbam2bed. |
coef |
A matrix-like object of smoothing estimates (see 'Details' below). |
se.coef |
A matrix-like object of smoothing standard errors (see 'Details' below). |
trans |
A smoothing transformation. |
parameters |
A list of smoothing parameters. |
pData |
An |
sampleNames |
A vector of sample names. |
gr |
An object of type GRanges. |
pos |
A vector of locations. |
chr |
A vector of chromosomes. |
rmZeroCov |
Should genomic locations with zero coverage in all samples be removed. |
The 'M', 'Cov', 'coef', and 'se.coef' matrix-like objects will be coerced to
DelayedMatrix objects; see
?DelayedArray::DelayedMatrix
for the full list of
supported matrix-like objects. We recommend using matrix objects
for in-memory storage of data and HDF5Matrix for on-disk
storage of data.
Genomic locations are specified either through gr
or through
chr
and pos
but not both. There should be the same
number of genomic locations as there are rows in the M
and
Cov
matrix.
The argument rmZeroCov
may be useful in order to reduce the
size of the return object be removing methylation loci with zero
coverage.
In case one or more methylation loci appears multiple times, the
M
and Cov
matrices are summed over rows linked to the
same methylation loci. See the example below.
Users should never have to specify coef
, se.coef
,
trans
, and parameters
, this is for internal use (they
are added by BSmooth
).
phenoData
is a way to specify pheno data (as known from the
ExpressionSet
and eSet
classes), at a minimum
sampleNames
should be given (if they are not present, the
function uses col.names(M)
).
An object of class BSseq
.
Kasper Daniel Hansen [email protected]
M <- matrix(0:8, 3, 3) Cov <- matrix(1:9, 3, 3) BS1 <- BSseq(chr = c("chr1", "chr2", "chr1"), pos = c(1,2,3), M = M, Cov = Cov, sampleNames = c("A","B", "C")) BS1 BS2 <- BSseq(chr = c("chr1", "chr1", "chr1"), pos = c(1,1,1), M = M, Cov = Cov, sampleNames = c("A","B", "C")) BS2 #------------------------------------------------------------------------------- # An example using a HDF5Array-backed BSseq object # library(HDF5Array) hdf5_M <- realize(M, "HDF5Array") hdf5_Cov <- realize(Cov, "HDF5Array") hdf5_BS1 <- BSseq(chr = c("chr1", "chr2", "chr1"), pos = c(1, 2, 3), M = hdf5_M, Cov = hdf5_Cov, sampleNames = c("A", "B", "C")) hdf5_BS1 hdf5_BS2 <- BSseq(chr = c("chr1", "chr1", "chr1"), pos = c(1, 1, 1), M = hdf5_M, Cov = hdf5_Cov, sampleNames = c("A", "B", "C")) hdf5_BS2
M <- matrix(0:8, 3, 3) Cov <- matrix(1:9, 3, 3) BS1 <- BSseq(chr = c("chr1", "chr2", "chr1"), pos = c(1,2,3), M = M, Cov = Cov, sampleNames = c("A","B", "C")) BS1 BS2 <- BSseq(chr = c("chr1", "chr1", "chr1"), pos = c(1,1,1), M = M, Cov = Cov, sampleNames = c("A","B", "C")) BS2 #------------------------------------------------------------------------------- # An example using a HDF5Array-backed BSseq object # library(HDF5Array) hdf5_M <- realize(M, "HDF5Array") hdf5_Cov <- realize(Cov, "HDF5Array") hdf5_BS1 <- BSseq(chr = c("chr1", "chr2", "chr1"), pos = c(1, 2, 3), M = hdf5_M, Cov = hdf5_Cov, sampleNames = c("A", "B", "C")) hdf5_BS1 hdf5_BS2 <- BSseq(chr = c("chr1", "chr1", "chr1"), pos = c(1, 1, 1), M = hdf5_M, Cov = hdf5_Cov, sampleNames = c("A", "B", "C")) hdf5_BS2
A class for representing whole-genome or capture bisulfite sequencing data.
An object from the class links together several pieces of information.
(1) genomic locations stored as a GRanges
object, a location by
samples matrix of M values, a location by samples matrix of Cov
(coverage) values, a location by samples matrix of Filtered (ambiguous modification status) values, and phenodata information.In addition, there are
slots for representing smoothed data. This class is an extension of
RangedSummarizedExperiment.
trans
:Object of class function
. This
function transforms the coef
slot from the scale the
smoothing was done to the 0-1 methylation scale.
parameters
:Object of class list
. A list of
parameters representing for example how the data was smoothed.
signature(x = "BSseq")
: Subsetting by location
(using integer indices) or sample (using integers or sample
names).
Unlike for RangedSummarizedExperiment
,
length()
is the number of methylation loci (equal to
length(granges(x))
).
Sample names and its replacement
function for the object. This is an alias for colnames
.
Obtain and replace the pData
slot of the
phenoData
slot. This is an alias for colData
.
The show method.
This function combines two BSSeq
objects.
The genomic locations of the new object is the union of the
genomic locations of the individual objects. In addition, the
methylation data matrices are placed next to each other (as
appropriate wrt. the new genomic locations) and zeros are entered
into the matrices as needed.
This class extends RangedSummarizedExperiment and therefore
inherits a number of useful GRanges
methods that operate on the
rowRanges
slot, used for accessing and setting the genomic locations
and also do subsetByOverlaps
.
There are a number of almost methods-like functions for operating on
objects of class BSseq
, including getBSseq
,
collapseBSseq
, and orderBSseq
. They are detailed below.
collapseBSseq(BSseq, columns)
is used to collapse an object of class BSseq
. By
collapsing we simply mean that certain columns (samples) are merge
together by summing up the methylation evidence and coverage.
This is a useful function if you start by reading in a dataset
based on say flowcells and you (after QC) want to simply add a
number of flowcells into one sample. The argument columns
specify which samples are to be merged, in the following way: it
is a character vector of new sample names, and the names of the
column vector indicates which samples in the BSseq
object
are to be collapsed. If columns
have the same length as
the number of rows of BSseq
(and has no names) it is
assumed that the ordering corresponds to the sample ordering in
BSseq
.
orderBSseq(BSseq, seqOrder = NULL)
simply orders an object of class BSseq
according to
(increasing) genomic locations. The seqOrder
vector is a
character vector of seqnames(BSseq)
describing the order of
the chromosomes. This is useful for ordering chr1
before
chr10
.
chrSelectBSseq(BSseq, seqnames = NULL, order = FALSE)
subsets and optionally reorders an object of class BSseq
.
The seqnames
vector is a character vector of
seqnames(BSseq)
describing which chromosomes should be
retained. If order
is TRUE
, the chromosomes are
also re-ordered using orderBSseq
.
getBSseq(BSseq, type = c("Cov", "M", "gr", "coef",
"se.coef", "trans", "parameters"))
is a general accessor: is used to obtain a specific slot of an
object of class BSseq
. It is primarily intended for
internal use in the package, for users we recommend granges
to get the genomic locations, getCoverage
to get the
coverage slots and getMeth
to get the smoothed values (if
they exist).
hasBeenSmoothed(BSseq)
This function returns a logical depending on whether or not the
BSseq
object has been smoothed using BSmooth
.
combineList(list, BACKEND = NULL)
This function function is a faster way of using combine
on
multiple BSseq objects. The input is a list, with each
component an object of class BSseq. The (slower)
alternative is to use Reduce(combine, list)
.
The BACKEND
argument determines which backend should be used for
the 'M' and 'Cov' matrices and, if present, the 'coef' and 'se.coef'
matrices (the latter two can only be combined if all objects have the
same rowRanges). The default, BACKEND = NULL
, corresponds to using
matrix objects. See
?DelayedArray::setAutoRealizationBackend
for alternative
backends.
strandCollapse(BSseq, shift = TRUE)
This function operates on a BSseq
objects which has
stranded loci (i.e. loci where the strand is one of ‘+’ or
‘-’). It will collapse the methylation and coverage
information across the two strands, unstranding the loci in the process
and potentially re-ordering them.
The argument shift
indicates whether the positions for the loci
on the reverse strand should be shifted one (i.e. the positions for
these loci are the positions of the ‘G’ in the
‘CpG’; this is the case for Bismark output for example).
Package versions 1.5.2 and 1.11.1 introduced a new version of representing
‘BSseq’ objects. You can update old serialized (saved)
objects by invoking x <- updateObject(x)
.
This class overrides the default implementation of assays
to
make it faster. Per default, no names are added to the returned data
matrices.
Assay names can conveniently be obtained by the function
assayNames(x)
Kasper Daniel Hansen [email protected]
The package vignette. BSseq
for the constructor
function. RangedSummarizedExperiment
for the underlying class. getBSseq
,
getCoverage
, and getMeth
for accessing the
data stored in the object and finally BSmooth
for
smoothing the bisulfite sequence data.
M <- matrix(1:9, 3,3) colnames(M) <- c("A1", "A2", "A3") BStest <- BSseq(pos = 1:3, chr = c("chr1", "chr2", "chr1"), M = M, Cov = M + 2) chrSelectBSseq(BStest, seqnames = "chr1", order = TRUE) collapseBSseq(BStest, group = c("A", "A", "B")) #------------------------------------------------------------------------------- # An example using a HDF5-backed BSseq object # hdf5_BStest <- realize(BStest, "HDF5Array") chrSelectBSseq(hdf5_BStest, seqnames = "chr1", order = TRUE) collapseBSseq( BSseq = hdf5_BStest, group = c("A", "A", "B"), BACKEND = "HDF5Array", type = "integer")
M <- matrix(1:9, 3,3) colnames(M) <- c("A1", "A2", "A3") BStest <- BSseq(pos = 1:3, chr = c("chr1", "chr2", "chr1"), M = M, Cov = M + 2) chrSelectBSseq(BStest, seqnames = "chr1", order = TRUE) collapseBSseq(BStest, group = c("A", "A", "B")) #------------------------------------------------------------------------------- # An example using a HDF5-backed BSseq object # hdf5_BStest <- realize(BStest, "HDF5Array") chrSelectBSseq(hdf5_BStest, seqnames = "chr1", order = TRUE) collapseBSseq( BSseq = hdf5_BStest, group = c("A", "A", "B"), BACKEND = "HDF5Array", type = "integer")
A class for representing statistics for smoothed whole-genome bisulfite sequencing data.
BSseqStat(gr = NULL, stats = NULL, parameters = NULL)
BSseqStat(gr = NULL, stats = NULL, parameters = NULL)
gr |
The genomic locations as an object of class |
stats |
The statistics, as a list of matrix-like objects (see 'Details' below). |
parameters |
A list of parameters. |
The matrix-like elements of the list in the 'stats' slot will be coerced to
DelayedMatrix objects; see
?DelayedArray::DelayedMatrix
for the full list of
supported matrix-like objects. We recommend using matrix objects
for in-memory storage of data and HDF5Matrix for on-disk
storage of data.
Objects can be created by calls of the form BSseqStat(...)
.
However, usually objects are returned by BSmooth.fstat(...)
and
not constructed by the user.
stats
:This is a list of DelayedMatrix objects with list elements representing various statistics for methylation loci along the genome.
parameters
:Object of class list
. A list of
parameters representing how the statistics were computed.
gr
:Object of class GRanges
giving genomic
locations.
Class hasGRanges
, directly.
The subsetting operator; one may only subset in one dimension, corresponding to methylation loci.
The show method.
This class extends hasGRanges
and therefore inherits a number
of useful GRanges
methods that operate on the gr
slot,
used for accessing and setting the genomic locations and also do
subsetByOverlaps
.
Package version 1.11.1 introduced a new version of representing
‘BSseqStat’ objects. You can update old serialized (saved)
objects by invoking x <- updateObject(x)
.
Kasper Daniel Hansen [email protected]
hasGRanges
for accessing the genomic locations.
BSmooth.fstat
for a function
that returns objects of class BSseqStat
, and smoothSds
,
computeStat
and dmrFinder
for functions that operate based on these statistics. Also see
the more specialised BSseqTstat
.
A class for representing t-statistics for smoothed whole-genome bisulfite sequencing data.
BSseqTstat(gr = NULL, stats = NULL, parameters = NULL)
BSseqTstat(gr = NULL, stats = NULL, parameters = NULL)
gr |
The genomic locations as an object of class |
stats |
The statistics, as a matrix-like object (see 'Details' below). |
parameters |
A list of parameters. |
The 'stats' matrix-like object will be coerced to a
DelayedMatrix object; see
?DelayedArray::DelayedMatrix
for the full list of
supported matrix-like objects. We recommend using matrix objects
for in-memory storage of data and HDF5Matrix for on-disk
storage of data.
Objects can be created by calls of the form BSseqTstat(...)
.
However, usually objects are returned by BSmooth.tstat(...)
and
not constructed by the user..
stats
:This is a DelayedMatrix object with columns representing various statistics for methylation loci along the genome.
parameters
:Object of class list
. A list of
parameters representing how the t-statistics were computed.
gr
:Object of class GRanges
giving genomic
locations.
Class hasGRanges
, directly.
The subsetting operator; one may only subset in one dimension, corresponding to methylation loci.
The show method.
This class extends hasGRanges
and therefore inherits a number
of useful GRanges
methods that operate on the gr
slot,
used for accessing and setting the genomic locations and also do
subsetByOverlaps
.
Package version 1.11.1 introduced a new version of representing
‘BSseqTstat’ objects. You can update old serialized (saved)
objects by invoking x <- updateObject(x)
.
Kasper Daniel Hansen [email protected]
The package vignette(s). hasGRanges
for accessing
the genomic locations. BSmooth.tstat
for a function
that returns objects of class BSseqTstat
, and dmrFinder
for a function that computes DMRs based on the t-statistics. Also see
BS.cancer.ex.tstat
for an example of the
class in the bsseqData package.
if(require(bsseqData)) { data(BS.cancer.ex.tstat) dmrFinder(BS.cancer.ex.tstat) }
if(require(bsseqData)) { data(BS.cancer.ex.tstat) dmrFinder(BS.cancer.ex.tstat) }
Converting a data.frame to a GRanges object. The data.frame needs columns like chr, start and end (strand is optional). Additional columns may be kept in the GRanges object.
data.frame2GRanges(df, keepColumns = FALSE, ignoreStrand = FALSE)
data.frame2GRanges(df, keepColumns = FALSE, ignoreStrand = FALSE)
df |
A |
keepColumns |
In case |
ignoreStrand |
In case |
An object of class GRanges
In case df
has rownames
, they will be used as
names
for the return object.
Kasper Daniel Hansen [email protected]
df <- data.frame(chr = "chr1", start = 1:3, end = 2:4, strand = c("+","-","+")) data.frame2GRanges(df, ignoreStrand = TRUE)
df <- data.frame(chr = "chr1", start = 1:3, end = 2:4, strand = c("+","-","+")) data.frame2GRanges(df, ignoreStrand = TRUE)
Finds differentially methylated regions for whole genome bisulfite sequencing data. Essentially identifies regions of the genome where all methylation loci have an associated t-statistic that is beyond a (low, high) cutoff.
dmrFinder(bstat, cutoff = NULL, qcutoff = c(0.025, 0.975), maxGap=300, stat = "tstat.corrected", verbose = TRUE)
dmrFinder(bstat, cutoff = NULL, qcutoff = c(0.025, 0.975), maxGap=300, stat = "tstat.corrected", verbose = TRUE)
bstat |
An object of class |
cutoff |
The cutoff of the t-statistics. This should be a vector
of length two giving the (low, high) cutoff. If |
qcutoff |
In case |
maxGap |
If two methylation loci are separated by this distance, break a possible DMR. This guarantees that the return DMRs have CpGs that are this distance from each other. |
stat |
Which statistic should be used? |
verbose |
Should the function be verbose? |
The workhorse function is BSmooth.tstat
which sets up a
t-statistic for a comparison between two groups.
Note that post-processing of these DMRs are likely to be necessary, filtering for example for length (or number of loci).
A data.frame
with columns
start , end , width , chr
|
genomic locations and width. |
n |
The number of methylation loci. |
invdensity |
Average length per loci. |
group1.mean |
The mean methylation level across samples and loci in 'group1'. |
group2.mean |
The mean methylation level across samples and loci in 'group2'. |
meanDiff |
The mean difference in methylation level; the
difference between |
idxStart , idxEnd , cluster
|
Internal use. |
areaStat |
The 'area' of the t-statistic; equal to the sum of the t-statistics for the individual methylation loci. |
direction |
either ‘hyper’ or ‘hypo’. |
areaStat.corrected |
Only present if |
Kasper Daniel Hansen [email protected].
KD Hansen, B Langmead, and RA Irizarry. BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biology (2012) 13:R83. doi:10.1186/gb-2012-13-10-r83.
BSmooth.tstat
for the function constructing the input
object, and BSseqTstat
for its class. In the
example below, we use BS.cancer.ex.tstat
as
the actual input object. Also see the package vignette(s) for a
detailed example.
if(require(bsseqData)) { dmrs0 <- dmrFinder(BS.cancer.ex.tstat, cutoff = c(-4.6, 4.6), verbose = TRUE) dmrs <- subset(dmrs0, abs(meanDiff) > 0.1 & n >= 3) }
if(require(bsseqData)) { dmrs0 <- dmrFinder(BS.cancer.ex.tstat, cutoff = c(-4.6, 4.6), verbose = TRUE) dmrs <- subset(dmrs0, abs(meanDiff) > 0.1 & n >= 3) }
This is a convenience function to find methylation loci, such as CpGs, in a reference genome.
The result is useful as the value of the loci
argument for read.bismark()
.
findLoci(pattern, subject, include = seqlevels(subject), strand = c("*", "+", "-"), fixed = "subject", resize = TRUE)
findLoci(pattern, subject, include = seqlevels(subject), strand = c("*", "+", "-"), fixed = "subject", resize = TRUE)
pattern |
A string specifying the pattern to search for, e.g. |
subject |
A string containing a file path to the genome sequence, in FASTA or 2bit format, to be searched. Alternatively, a BSgenome or DNAStringSet object storing the genome sequence to be searched. |
include |
A character vector indicating the seqlevels of |
strand |
A character scaler specifying the strand of |
fixed |
If |
resize |
A logical scalar specifying whether the ranges should be shifted to have width 1 and anchored by the start of the locus, e.g., resize a CpG dinucleotide to give the co-ordinates of the cytosine. |
This function provides a convenience wrapper for finding methylation loci in
a genome, based on running
vmatchPattern()
.
Users requiring finer-grained control should directly use the
vmatchPattern()
function and coerce
the result to a GRanges object.
A GRanges object storing the found loci.
Peter F. Hickey
Biostrings::vmatchPattern()
?BSgenome::`BSgenome-utils`
library(Biostrings) # Find CpG dinucleotides on the both strands of an artificial sequence my_seq <- DNAStringSet("ATCAGTCGC") names(my_seq) <- "test" findLoci(pattern = "CG", subject = my_seq) # Find CHG trinucleotides on the both strands of an artificial sequence findLoci(pattern = "CHG", subject = my_seq) # Find CpG dinucleotides on the + strand of chr17 from the human genome (hg38) if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38")) { findLoci( pattern = "CG", subject = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38, include = "chr17", strand = "+") }
library(Biostrings) # Find CpG dinucleotides on the both strands of an artificial sequence my_seq <- DNAStringSet("ATCAGTCGC") names(my_seq) <- "test" findLoci(pattern = "CG", subject = my_seq) # Find CHG trinucleotides on the both strands of an artificial sequence findLoci(pattern = "CHG", subject = my_seq) # Find CpG dinucleotides on the + strand of chr17 from the human genome (hg38) if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38")) { findLoci( pattern = "CG", subject = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38, include = "chr17", strand = "+") }
A function to compute Fisher-tests for an object of class BSseq
.
fisherTests(BSseq, group1, group2, lookup = NULL, returnLookup = TRUE, mc.cores = 1, verbose = TRUE)
fisherTests(BSseq, group1, group2, lookup = NULL, returnLookup = TRUE, mc.cores = 1, verbose = TRUE)
BSseq |
An object of class |
group1 |
A vector of sample names or indexes for the ‘treatment’ group. |
group2 |
A vector of sample names or indexes for the ‘control’ group. |
lookup |
A ‘lookup’ object, see details. |
returnLookup |
Should a ‘lookup’ object be returned, see details. |
mc.cores |
The number of cores used. Note that setting
|
verbose |
Should the function be verbose. |
This function computes row-wise Fisher's exact tests. It uses an
internal lookup table so rows which forms equivalent 2x2 tables are
group together and only a single test is computed. If
returnLookup
is TRUE
the return object contains the
lookup table which may be feed to another call to the function using
the lookup
argument.
If group1
, group2
designates more than 1 sample, the
samples are added together before testing.
This function can use multiple cores on the same computer.
This test cannot model biological variability.
if returnLookup
is TRUE
, a list with components
results
and lookup
, otherwise just the results
component. The results
(component) is a matrix with the same
number of rows as the BSseq
argument and 2 columns
p.value
(the unadjusted p-values) and log2OR
(log2
transformation of the odds ratio).
Kasper Daniel Hansen [email protected]
fisher.test for information about Fisher's test.
mclapply
for the mc.cores
argument.
M <- matrix(1:9, 3,3) colnames(M) <- c("A1", "A2", "A3") BStest <- BSseq(pos = 1:3, chr = c("chr1", "chr2", "chr1"), M = M, Cov = M + 2) results <- fisherTests(BStest, group1 = "A1", group2 = "A2", returnLookup = TRUE) results #------------------------------------------------------------------------------- # An example using a HDF5Array-backed BSseq object # library(HDF5Array) # See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details hdf5_BStest <- saveHDF5SummarizedExperiment(x = BStest, dir = tempfile()) results <- fisherTests(hdf5_BStest, group1 = "A1", group2 = "A2", returnLookup = TRUE) results
M <- matrix(1:9, 3,3) colnames(M) <- c("A1", "A2", "A3") BStest <- BSseq(pos = 1:3, chr = c("chr1", "chr2", "chr1"), M = M, Cov = M + 2) results <- fisherTests(BStest, group1 = "A1", group2 = "A2", returnLookup = TRUE) results #------------------------------------------------------------------------------- # An example using a HDF5Array-backed BSseq object # library(HDF5Array) # See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details hdf5_BStest <- saveHDF5SummarizedExperiment(x = BStest, dir = tempfile()) results <- fisherTests(hdf5_BStest, group1 = "A1", group2 = "A2", returnLookup = TRUE) results
Classes for fixed-width IRanges and GRanges, ie. objects where all ranges have the same width. The intention is for these classes to be added to GenomicRanges. Documented here temporarily.
See description. Otherwise works like IRanges and GRanges, except there are many unimplemented methods.
This is really a private class, with private methods, but R's NAMESPACE handling means they get unintentionally exported. Hence this documentation.
showClass("FWIRanges")
showClass("FWIRanges")
Obtain coverage for BSseq objects.
getCoverage(BSseq, regions = NULL, type = c("Cov", "M"), what = c("perBase", "perRegionAverage", "perRegionTotal"), withDimnames = TRUE)
getCoverage(BSseq, regions = NULL, type = c("Cov", "M"), what = c("perBase", "perRegionAverage", "perRegionTotal"), withDimnames = TRUE)
BSseq |
An object of class |
regions |
An optional |
type |
This returns either coverage or the total evidence for methylation at the loci. |
what |
The type of return object, see details. |
withDimnames |
A |
NOTE: The return type of getCoverage
varies depending on its
arguments.
If regions
are not specified (regions = NULL
) a
DelayedMatrix object (what = "perBase"
) is returned.
This will either contain the per-base coverage, the average coverage, or the
genome total coverage (depending on value of what
).
If what = "perBase"
and regions
are specified, a list is
returned. Each element of the list is a DelayedMatrix object
corresponding to the genomic loci inside the region. It is conceptually the
same as splitting the coverage by region.
If what = "perRegionAverage"
or what = "perRegionTotal"
and regions
are specified the return value is a
DelayedMatrix object. Each row of the
DelayedMatrix corresponds to a region and contains either the
average coverage or the total coverage in the region.
Kasper Daniel Hansen [email protected].
BSseq
for the BSseq
class.
data(BS.chr22) head(getCoverage(BS.chr22, type = "M")) reg <- GRanges(seqnames = c("chr22", "chr22"), ranges = IRanges(start = c(1, 2*10^7), end = c(2*10^7 +1, 4*10^7))) getCoverage(BS.chr22, regions = reg, what = "perRegionAverage") cList <- getCoverage(BS.chr22, regions = reg) length(cList) head(cList[[1]]) #------------------------------------------------------------------------------- # An example using a HDF5Array-backed BSseq object # library(HDF5Array) # See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details hdf5_BS.chr22 <- saveHDF5SummarizedExperiment(x = BS.chr22, dir = tempfile()) head(getCoverage(hdf5_BS.chr22, type = "M")) reg <- GRanges(seqnames = c("chr22", "chr22"), ranges = IRanges(start = c(1, 2 * 10 ^ 7), end = c(2 * 10 ^ 7 + 1, 4 * 10 ^ 7))) getCoverage(hdf5_BS.chr22, regions = reg, what = "perRegionAverage") hdf5_cList <- getCoverage(hdf5_BS.chr22, regions = reg) length(hdf5_cList) head(hdf5_cList[[1]])
data(BS.chr22) head(getCoverage(BS.chr22, type = "M")) reg <- GRanges(seqnames = c("chr22", "chr22"), ranges = IRanges(start = c(1, 2*10^7), end = c(2*10^7 +1, 4*10^7))) getCoverage(BS.chr22, regions = reg, what = "perRegionAverage") cList <- getCoverage(BS.chr22, regions = reg) length(cList) head(cList[[1]]) #------------------------------------------------------------------------------- # An example using a HDF5Array-backed BSseq object # library(HDF5Array) # See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details hdf5_BS.chr22 <- saveHDF5SummarizedExperiment(x = BS.chr22, dir = tempfile()) head(getCoverage(hdf5_BS.chr22, type = "M")) reg <- GRanges(seqnames = c("chr22", "chr22"), ranges = IRanges(start = c(1, 2 * 10 ^ 7), end = c(2 * 10 ^ 7 + 1, 4 * 10 ^ 7))) getCoverage(hdf5_BS.chr22, regions = reg, what = "perRegionAverage") hdf5_cList <- getCoverage(hdf5_BS.chr22, regions = reg) length(hdf5_cList) head(hdf5_cList[[1]])
Obtain methylation estimates for BSseq objects, both smoothed and raw.
getMeth(BSseq, regions = NULL, type = c("smooth", "raw"), what = c("perBase", "perRegion"), confint = FALSE, alpha = 0.95, withDimnames = TRUE)
getMeth(BSseq, regions = NULL, type = c("smooth", "raw"), what = c("perBase", "perRegion"), confint = FALSE, alpha = 0.95, withDimnames = TRUE)
BSseq |
An object of class |
regions |
An optional |
type |
This returns either smoothed or raw estimates of the methylation level. |
what |
The type of return object, see details. |
confint |
Should a confidence interval be return for the
methylation estimates (see below). This is only supported if
|
alpha |
alpha value for the confidence interval. |
withDimnames |
A |
NOTE: The return type of getMeth
varies depending on its
arguments.
If region = NULL
the what
argument is ignored. This is
also the only situation in which confint = TRUE
is supported.
The return value is either a DelayedMatrix
(confint = FALSE
or a list with three DelayedMatrix
components confint = TRUE
(meth
, upper
and
lower
), giving the methylation estimates and (optionally) confidence
intervals.
Confidence intervals for type = "smooth"
is based on standard
errors from the smoothing algorithm (if present). Otherwise it is
based on pointwise confidence intervals for binomial distributions
described in Agresti (see below), specifically the score confidence
interval.
If regions
are specified, what = "perBase"
will make the
function return a list, each element of the list being a
DelayedMatrix corresponding to a genomic region (and each row
of the DelayedMatrix being a loci inside the region). If
what = "perRegion"
the function returns a DelayedMatrix,
with each row corresponding to a region and containing the average
methylation level in that region.
A BSseq
object needs to be smoothed by the function
BSmooth
in order to support type = "smooth"
.
Kasper Daniel Hansen [email protected].
A Agresti and B Coull. Approximate Is Better than "Exact" for Interval Estimation of Binomial Proportions. The American Statistician (1998) 52:119-126.
BSseq
for the BSseq
class and
BSmooth
for smoothing such an object.
data(BS.chr22) head(getMeth(BS.chr22, type = "raw")) reg <- GRanges(seqnames = c("chr22", "chr22"), ranges = IRanges(start = c(1, 2*10^7), end = c(2*10^7 +1, 4*10^7))) head(getMeth(BS.chr22, regions = reg, type = "raw", what = "perBase")) #------------------------------------------------------------------------------- # An example using a HDF5Array-backed BSseq object # library(HDF5Array) # See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details hdf5_BS.chr22 <- saveHDF5SummarizedExperiment(x = BS.chr22, dir = tempfile()) head(getMeth(hdf5_BS.chr22, type = "raw")) head(getMeth(hdf5_BS.chr22, regions = reg, type = "raw", what = "perBase"))
data(BS.chr22) head(getMeth(BS.chr22, type = "raw")) reg <- GRanges(seqnames = c("chr22", "chr22"), ranges = IRanges(start = c(1, 2*10^7), end = c(2*10^7 +1, 4*10^7))) head(getMeth(BS.chr22, regions = reg, type = "raw", what = "perBase")) #------------------------------------------------------------------------------- # An example using a HDF5Array-backed BSseq object # library(HDF5Array) # See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details hdf5_BS.chr22 <- saveHDF5SummarizedExperiment(x = BS.chr22, dir = tempfile()) head(getMeth(hdf5_BS.chr22, type = "raw")) head(getMeth(hdf5_BS.chr22, regions = reg, type = "raw", what = "perBase"))
Essentially an accessor function for the statistics of a BSseqTstat
object.
getStats(bstat, regions = NULL, ...)
getStats(bstat, regions = NULL, ...)
bstat |
An object of class |
regions |
An optional |
... |
Additional arguments passed to the different backends based
on the class of |
Additional argument when the bstat
object is of class BSseqTstat
:
Which statistics column should be obtained.
An object of class data.frame
possible restricted to the
regions specified.
Kasper Daniel Hansen [email protected]
BSseqTstat
for the BSseqTstat
class, and
getCoverage
and getMeth
for similar
functions, operating on objects of class BSseq
.
if(require(bsseqData)) { data(BS.cancer.ex.tstat) head(getStats(BS.cancer.ex.tstat)) reg <- GRanges(seqnames = c("chr22", "chr22"), ranges = IRanges(start = c(1, 2*10^7), end = c(2*10^7 +1, 4*10^7))) head(getStats(BS.cancer.ex.tstat, regions = reg)) }
if(require(bsseqData)) { data(BS.cancer.ex.tstat) head(getStats(BS.cancer.ex.tstat)) reg <- GRanges(seqnames = c("chr22", "chr22"), ranges = IRanges(start = c(1, 2*10^7), end = c(2*10^7 +1, 4*10^7))) head(getStats(BS.cancer.ex.tstat, regions = reg)) }
Binomial and poisson goodness of fit statistics for BSSeq objects, including plotting capability.
poissonGoodnessOfFit(BSseq, nQuantiles = 10^5) binomialGoodnessOfFit(BSseq, method = c("MLE"), nQuantiles = 10^5) ## S3 method for class 'chisqGoodnessOfFit' print(x, ...) ## S3 method for class 'chisqGoodnessOfFit' plot(x, type = c("chisq", "pvalue"), plotCol = TRUE, qqline = TRUE, pch = 16, cex = 0.75, ...)
poissonGoodnessOfFit(BSseq, nQuantiles = 10^5) binomialGoodnessOfFit(BSseq, method = c("MLE"), nQuantiles = 10^5) ## S3 method for class 'chisqGoodnessOfFit' print(x, ...) ## S3 method for class 'chisqGoodnessOfFit' plot(x, type = c("chisq", "pvalue"), plotCol = TRUE, qqline = TRUE, pch = 16, cex = 0.75, ...)
BSseq |
An object of class |
x |
A chisqGoodnessOfFit object (as produced by
|
nQuantiles |
The number of (evenly-spaced) quantiles stored in the return object. |
method |
How is the parameter estimated. |
type |
Are the chisq or the p-values being plotted. |
plotCol |
Should the extreme quantiles be colored. |
qqline |
Add a |
pch , cex
|
Plotting symbols and size. |
... |
Additional arguments being passed to |
These functions compute and plot goodness of fit statistics for
BSseq
objects. For each methylation loci, the Poisson
goodness of fit statistic tests whether the coverage (at that loci) is
independent and identically Poisson distributed across the samples.
In a similar fashion, the Binomial goodness of fit statistic tests
whether the number of reads supporting methylation are independent and
identically binomial distributed across samples (with different size
parameters given by the coverage vector).
These functions do not handle NA
values.
The plotting method is invoked for its side effect. Both
poissonGoodnessOfFit
and binomialGoodnessOfFit
returns
an object of class chisqGoodnessOfFit
which is a list with components
chisq |
a vector of Chisq values. |
quantiles |
a vector of quantiles (of the chisq values). |
df |
degress of freedom |
Kasper Daniel Hansen [email protected]
For the plotting method, see qqplot
.
if(require(bsseqData)) { data(BS.cancer.ex) BS.cancer.ex <- updateObject(BS.cancer.ex) gof <- poissonGoodnessOfFit(BS.cancer.ex) plot(gof) #------------------------------------------------------------------------------- # An example using a HDF5Array-backed BSseq object # library(HDF5Array) # See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details hdf5_BS.cancer.ex <- saveHDF5SummarizedExperiment(x = BS.cancer.ex, dir = tempfile()) hdf5_gof <- poissonGoodnessOfFit(hdf5_BS.cancer.ex) plot(hdf5_gof) }
if(require(bsseqData)) { data(BS.cancer.ex) BS.cancer.ex <- updateObject(BS.cancer.ex) gof <- poissonGoodnessOfFit(BS.cancer.ex) plot(gof) #------------------------------------------------------------------------------- # An example using a HDF5Array-backed BSseq object # library(HDF5Array) # See ?SummarizedExperiment::saveHDF5SummarizedExperiment for details hdf5_BS.cancer.ex <- saveHDF5SummarizedExperiment(x = BS.cancer.ex, dir = tempfile()) hdf5_gof <- poissonGoodnessOfFit(hdf5_BS.cancer.ex) plot(hdf5_gof) }
A class with a GRanges slot, used as a building block for other classes. Provides basic accessor functions etc.
Objects can be created by calls of the form new("hasGRanges", ...)
.
gr
:Object of class GRanges
.
Subsets a single dimension.
Get the GRanges
object representing
genomic locations.
Start, end and width
for the genomic locations of the object, also replacement
functions. This accessor functions operate directly on the
gr
slot.
Getting and setting the
strand
of the genomic locations (the gr
slot).
Getting and setting the
seqlengths
of the genomic locations (the gr
slot).
Getting and setting the
seqlevels
of the genomic locations (the gr
slot).
Getting and setting the
seqnames
of the genomic locations (the gr
slot).
The show method.
(query = "hasGRanges", subject =
"hasGRanges")
: finds overlaps between the granges()
of
the two objects.
(query = "GenomicRanges", subject =
"hasGRanges")
: finds overlaps between query and the granges()
of
the subject.
(query = "hasGRanges", subject =
"GenomicRanges")
: finds overlaps between the granges()
of the query and the subject.
(query = "hasGRanges", subject =
"hasGRanges")
: Subset the query, keeping the genomic locations that
overlaps the subject.
(query = "hasGRanges", subject =
"GenomicRanges")
: Subset the query, keeping the genomic
locations that overlaps the subject.
(query = "GenomicRanges",
subject = "hasGRanges")
: Subset the query, keeping the genomic
locations that overlaps the subject.
If you extend the hasGRanges
class, you should consider writing
a subset method ([
), and a show
method. If the new
class supports single index subsetting, the subsetByOverlaps
methods will automatically extend.
Kasper Daniel Hansen [email protected]
showClass("hasGRanges")
showClass("hasGRanges")
Functions for plotting BSmooth methylation estimates. Typically used to display differentially methylated regions.
plotRegion(BSseq, region = NULL, extend = 0, main = "", addRegions = NULL, annoTrack = NULL, cex.anno = 1, geneTrack = NULL, cex.gene = 1.5, col = NULL, lty = NULL, lwd = NULL, BSseqStat = NULL, stat = "tstat.corrected", stat.col = "black", stat.lwd = 1, stat.lty = 1, stat.ylim = c(-8, 8), mainWithWidth = TRUE, regionCol = alpha("red", 0.1), addTicks = TRUE, addPoints = FALSE, pointsMinCov = 5, highlightMain = FALSE) plotManyRegions(BSseq, regions = NULL, extend = 0, main = "", addRegions = NULL, annoTrack = NULL, cex.anno = 1, geneTrack = NULL, cex.gene = 1.5, col = NULL, lty = NULL, lwd = NULL, BSseqStat = NULL, stat = "tstat.corrected", stat.col = "black", stat.lwd = 1, stat.lty = 1, stat.ylim = c(-8, 8), mainWithWidth = TRUE, regionCol = alpha("red", 0.1), addTicks = TRUE, addPoints = FALSE, pointsMinCov = 5, highlightMain = FALSE, verbose = TRUE)
plotRegion(BSseq, region = NULL, extend = 0, main = "", addRegions = NULL, annoTrack = NULL, cex.anno = 1, geneTrack = NULL, cex.gene = 1.5, col = NULL, lty = NULL, lwd = NULL, BSseqStat = NULL, stat = "tstat.corrected", stat.col = "black", stat.lwd = 1, stat.lty = 1, stat.ylim = c(-8, 8), mainWithWidth = TRUE, regionCol = alpha("red", 0.1), addTicks = TRUE, addPoints = FALSE, pointsMinCov = 5, highlightMain = FALSE) plotManyRegions(BSseq, regions = NULL, extend = 0, main = "", addRegions = NULL, annoTrack = NULL, cex.anno = 1, geneTrack = NULL, cex.gene = 1.5, col = NULL, lty = NULL, lwd = NULL, BSseqStat = NULL, stat = "tstat.corrected", stat.col = "black", stat.lwd = 1, stat.lty = 1, stat.ylim = c(-8, 8), mainWithWidth = TRUE, regionCol = alpha("red", 0.1), addTicks = TRUE, addPoints = FALSE, pointsMinCov = 5, highlightMain = FALSE, verbose = TRUE)
BSseq |
An object of class |
region |
A |
regions |
A |
extend |
Describes how much the plotting region should be
extended in either direction. The total width of the plot is equal to
the width of the region plus twice |
main |
The plot title. The default is to construct a title with information about which genomic region is being plotted. |
addRegions |
A set of additional regions to be highlighted on the
plots. As the |
annoTrack |
A named list of |
cex.anno |
|
geneTrack |
EXPERIMENTAL: A |
cex.gene |
|
col |
The color of the methylation estimates, see details. |
lty |
The line type of the methylation estimates, see details. |
lwd |
The line width of the methylation estimates, see details. |
BSseqStat |
An object of class |
stat |
Which statistics will be plotted (only used is
|
stat.col |
color for the statistics plot. |
stat.lwd |
line width for the statistics plot. |
stat.lty |
line type for the statistics plot. |
stat.ylim |
y-limits for the statistics plot. |
mainWithWidth |
Should the default title include information about width of the plot region. |
regionCol |
The color used for highlighting the region. |
addTicks |
Should tick marks showing the location of methylation loci, be added? |
addPoints |
Should the individual unsmoothed methylation estimates be plotted. This usually leads to a very confusing plot, but may be useful for diagnostic purposes. |
pointsMinCov |
The minimum coverage a methylation loci need in
order for the raw methylation estimates to be plotted. Useful for
filtering out low coverage loci. Only used if |
highlightMain |
Should the plot region be highlighted? |
verbose |
Should the function be verbose? |
The correct choice of aspect ratio depends on the width of the
plotting region. We tend to use width = 10, height = 5
.
plotManyRegions
is used to plot many regions (hundreds or
thousands), and is substantially quicker than repeated calls to
plotRegion
.
This function has grown to be rather complicated over time. For custom plotting, it is sometimes useful to use the function definition as a skeleton and directly modify the code.
This function is invoked for its side effect: producing a plot.
Kasper Daniel Hansen [email protected]
The package vignette has an extended example.
Parsing output from the Bismark alignment suite.
read.bismark(files, loci = NULL, colData = NULL, rmZeroCov = FALSE, strandCollapse = TRUE, BPPARAM = bpparam(), BACKEND = NULL, dir = tempfile("BSseq"), replace = FALSE, chunkdim = NULL, level = NULL, nThread = 1L, verbose = getOption("verbose"))
read.bismark(files, loci = NULL, colData = NULL, rmZeroCov = FALSE, strandCollapse = TRUE, BPPARAM = bpparam(), BACKEND = NULL, dir = tempfile("BSseq"), replace = FALSE, chunkdim = NULL, level = NULL, nThread = 1L, verbose = getOption("verbose"))
files |
The path to the files created by running Bismark's methylation
extractor, one sample per file.
Files ending in |
loci |
|
colData |
An optional DataFrame describing the samples.
Row names, if present, become the column names of the BSseq
object. If |
rmZeroCov |
A |
strandCollapse |
A |
BPPARAM |
An optional BiocParallelParam instance determining the parallel back-end to be used during evaluation. Currently supported are SerialParam (Unix, Mac, Windows), MulticoreParam (Unix and Mac), SnowParam (Unix, Mac, and Windows, limited to single-machine clusters), and BatchtoolsParam (Unix, Mac, Windows, only with the in-memory realization backend). See sections 'Parallelization and progress monitoring' and 'Realization backends' for further details. |
BACKEND |
|
dir |
Only applicable if |
replace |
Only applicable if |
chunkdim |
Only applicable if |
level |
Only applicable if |
nThread |
The number of threads used by |
verbose |
A |
A BSseq object.
The format of each file is automatically detected using the internal function bsseq:::.guessBismarkFileType()
.
Files ending in .gz
, .bz2
, .xz
, or .zip
will be automatically decompressed to tempdir()
.
Bismark's 'genome wide cytosine report' (https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-genome-wide-cytosine-report-optional-is-tab-delimited-in-the-following-format-1-based-coords) and 'coverage' (https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-coverage-output-looks-like-this-tab-delimited-1-based-genomic-coords) formats are both supported.
If setting loci = NULL
, then we strongly recommend using the 'genome wide cytosine report' output format because this includes strand information for each locus.
The 'coverage' output does not contain strand information and so the strand
of the returned BSseq object will be set to *
unless stranded loci
are supplied.
Neither the 'bedGraph' output format (https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-bedgraph-output-optional-looks-like-this-tab-delimited-0-based-start-1-based-end-coords) nor the 'bismark_methylation_extractor' output format (https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-bismark_methylation_extractor-output-is-in-the-form-tab-delimited-1-based-coords) are supported. The former does not include the required counts of methylated and unmethylated reads hile the is an intermediate file containing read-level, rather than locus-level, data on methylation.
The genomic co-ordinates of the Bismark output files may be zero-based or one-based depending on whether the --zero_based
argument was used when running Bismark's methylation extractor.
Furthermore, the default co-ordinate counting system varies by version of Bismark.
bsseq makes no assumptions about the basis of the genomic co-ordinates and it is left to the user to ensure that the appropriate basis is used in the analysis of their data.
Since Bioconductor packages typically use one-based co-ordinates, we strongly recommend that your Bismark files are also one-based.
read.bismark()
We recommend the following to achieve fast and efficient importing of Bismark files:
Specify the set of methylation loci via the loci
argument.
Use Bismark files in the 'coverage' output format.
Leave rmZeroCov = FALSE
.
Use a BPPARAM
with a moderate number of workers (cores).
Use BACKEND = "HDF5Array"
.
Use multiple threads per worker (i.e. nThread
> 1).
Each point is discussed below.
loci
Specifying the set of methylation loci via the loci
argument means that read.bismark()
does not need to first parse all files
to identify the set of candidate loci.
Provided that rmZeroCov = FALSE
, this means that each file is only read once.
This may be a considerable saving when there are a large number of files
.
If you are unsure whether the below-described shortcuts apply to your data, leave loci = NULL
and let read.bismark()
identify the set of candidate loci from files
.
You may wish to use the findLoci()
function to find all methylation loci of interest in your reference genome (e.g., all CpGs) and then pass the result via the loci
argument.
Alternatively, if all files
are 'genome wide cytosine reports' for samples aligned to the same reference genome, then all files
contain the exact same set of methylation loci.
In this case, you may wish to first construct loci
using the internal function bsseq:::.readBismarkAsFWGRanges()
applied to a single file, e.g., loci = bsseq:::.readBismarkAsFWGRanges(files[1], rmZeroCov, strandCollapse)
.
It will generally be faster to parse Bismark files in the 'coverage' output format than those in the 'genome wide cytosine report' format This is because the former only includes loci with non-zero coverage and so the file size is often considerably smaller, particularly for shallowly sequenced samples (e.g., those from single-cell bisulfite sequencing).
rmZeroCov = FALSE
If you set rmZeroCov = TRUE
, then read.bismark()
must first parse all the files
to identify which loci have zero coverage in all samples and then filter these out from the set of candidate loci.
This will happen even if you supply loci
with a GenomicRanges of candidate loci.
Furthermore, any coverage-based filtering of methylation loci is best left until you have constructed your final BSseq object.
In our experience, the final BSseq object is often the product of combining multiple BSseq objects, each constructed with a separate call to read.bismark()
.
In such cases, it is premature to use rmZeroCov = TRUE
when running each read.bismark()
; regretably, combining these objects will often then lead to an inefficiently stored BSseq object.
BPPARAM
with a moderate number of workers (cores)Each file can be processed on its own, so you can process in parallel as many files as you have workers. However, if using the HDF5Array backend, then writing to the HDF5 file cannot be performed in parallel and so becomes the bottleneck. Nonetheless, by using a moderate number of workers (2 - 10), we can ensure there is processed data available to write to disk as soon as the current write is completed.
BACKEND = "HDF5Array"
By using the HDF5Array realization backend from HDF5Array, we reduce the amount of data that is kept in-memory at any one time. Once each file is parsed, the data are written to the HDF5 file and are no longer needed in-memory. When combined with multiple workers (cores), this means that each file will only need to read and retain in-memory 1 sample's worth of data at a time.
Conversely, if you opt for all data to be in-memory (via BACKEND = NULL
), then each worker will pass each file's data back to the main process and memory usage will steadily accumulate to often unreasonable levels.
nThread
> 1read.bismark
uses data.table::fread
to
read each file, which supports threaded-parallisation. Depending on the
system, increasing nThread
can achieve near-linear speed-ups in the
number of threads for reading each file. Care needs to be taken when
nThread
> 1 is used in conjunction with a parallel backend via
BPPARAM
to ensure the system isn't overloaded. For example, using
BPPARAM = MulticoreParam(workers = 10)
with nThread = 4
may
use up to 40 workers simultaneously.
The read.bismark()
function creates a BSseq object with two assays, M
and Cov
.
The choice of realization backend controls whether these assays are stored in-memory as an ordinary matrix or on-disk as a HDF5Array, for example.
The choice of realization backend is controlled by the BACKEND
argument, which defaults to the current value of DelayedArray::getAutoRealizationBackend()
.
read.bismark()
supports the following realization backends:
NULL
(in-memory): This stores each new assay in-memory using an ordinary matrix.
HDF5Array
(on-disk): This stores each new assay on-disk in a HDF5 file using an HDF5Matrix from HDF5Array.
Please note that certain combinations of realization backend and parallelization backend are currently not supported.
For example, the HDF5Array realization backend is currently only compatible when used with a single-machine parallelization backend (i.e. it is not compatible
with a SnowParam that specifies an ad hoc cluster of
multiple machines).
BSmooth()
will issue an error when given such incompatible realization and parallelization backends.
Additional arguments related to the realization backend can be passed via the ...
argument.
These arguments must be named and are passed to the relevant RealizationSink constructor.
For example, the ...
argument can be used to specify the path to the HDF5 file to be
used by BSmooth()
.
Please see the examples at the bottom of the page.
read.bismark()
now uses the BiocParallel package to implement
parallelization. This brings some notable improvements:
Imported files can now be written directly to an on-disk realization backend by the worker. This dramatically reduces memory usage compared to previous versions of bsseq that required all results be retained in-memory.
Parallelization is now supported on Windows through the use of a
SnowParam object as the value of BPPARAM
.
Detailed and extensive job logging facilities.
All parallelization options are controlled via the BPPARAM
argument.
In general, we recommend that users combine multicore (single-machine)
parallelization with an on-disk realization backend (see section,
'Realization backend'). For Unix and Mac users, this means using
a MulticoreParam. For Windows users, this means using a
single-machine SnowParam. Please consult the BiocParallel
documentation to take full advantage of the more advanced features.
A useful feature of BiocParallel are progress bars to monitor the
status of long-running jobs, such as BSmooth()
. Progress bars are
controlled via the progressbar
argument in the
BiocParallelParam constructor.
BiocParallel also supports extensive and detailed logging facilities. Please consult the BiocParallel documentation to take full advantage these advanced features.
Peter Hickey [email protected]
collapseBSseq()
for collapsing (aggregating) data from sample's with multiple Bismark methylation extractor files (e.g., technical replicates).
# Run read.bismark() on a single sample to construct a matrix-backed BSseq # object. infile <- system.file("extdata/test_data.fastq_bismark.bismark.cov.gz", package = "bsseq") bsseq <- read.bismark(files = infile, colData = DataFrame(row.names = "test_data"), rmZeroCov = FALSE, strandCollapse = FALSE, verbose = TRUE) # This is a matrix-backed BSseq object. sapply(assays(bsseq, withDimnames = FALSE), class) bsseq ## Not run: # Run read.bismark() on a single sample to construct a HDF5Array-backed BSseq # object (with data written to 'test_dir') test_dir <- tempfile("BSseq") bsseq <- read.bismark(files = infile, colData = DataFrame(row.names = "test_data"), rmZeroCov = FALSE, strandCollapse = FALSE, BACKEND = "HDF5Array", dir = test_dir, verbose = TRUE) # This is a HDF5Array-backed BSseq object. sapply(assays(bsseq, withDimnames = FALSE), class) # The 'M' and 'Cov' assays are in the HDF5 file 'assays.h5' (in 'test_dir'). sapply(assays(bsseq, withDimnames = FALSE), path) ## End(Not run)
# Run read.bismark() on a single sample to construct a matrix-backed BSseq # object. infile <- system.file("extdata/test_data.fastq_bismark.bismark.cov.gz", package = "bsseq") bsseq <- read.bismark(files = infile, colData = DataFrame(row.names = "test_data"), rmZeroCov = FALSE, strandCollapse = FALSE, verbose = TRUE) # This is a matrix-backed BSseq object. sapply(assays(bsseq, withDimnames = FALSE), class) bsseq ## Not run: # Run read.bismark() on a single sample to construct a HDF5Array-backed BSseq # object (with data written to 'test_dir') test_dir <- tempfile("BSseq") bsseq <- read.bismark(files = infile, colData = DataFrame(row.names = "test_data"), rmZeroCov = FALSE, strandCollapse = FALSE, BACKEND = "HDF5Array", dir = test_dir, verbose = TRUE) # This is a HDF5Array-backed BSseq object. sapply(assays(bsseq, withDimnames = FALSE), class) # The 'M' and 'Cov' assays are in the HDF5 file 'assays.h5' (in 'test_dir'). sapply(assays(bsseq, withDimnames = FALSE), path) ## End(Not run)
Construct BSseq objects from nanopore BED files
read.modbam2bed( files, colData = NULL, rmZeroCov = FALSE, strandCollapse = TRUE )
read.modbam2bed( files, colData = NULL, rmZeroCov = FALSE, strandCollapse = TRUE )
files |
vector, BED files |
colData |
data frame, phenotypic data with samples as rows and variables as columns |
rmZeroCov |
A logical (1) indicating whether methylation loci that have zero coverage in all samples be removed |
strandCollapse |
A logical (1) indicating whether stand-symmetric methylation loci (i.e. CpGs) should be collapsed across strands |
This function reads in nanopore sequencing modified BED files to Bsseq objects. Nanopore sequencing data (i.e. aggregated modified base counts) is stored in modified-base BAM files. These modified-base BAM files are converted to bedMethyl (BED) files using modbam2bed.
After installing modbam2bed, a conda environment is activated. Index files
for BAM files are created using samtools index
. The code requires aligned reads with the Mm and Ml tags (MM and ML also supported), and the reference sequence used for alignment (<reference.fasta>).
-e, -- extended
to output canonical, modified, and filtered bases;
-m, -- mod_base=BASE
to output modified base of interest, one of: 5mC, 5hmC, 5fC, 5caC, 5hmU, 5fU, 5caU, 6mA, 5oxoG, Xao. (Or modA, modC, modG, modT, modU, modN for generic modified base);
-r, --region=chr:start-end
to output chromosome or genomic region of interest;
-f, --threshold=THRESHOLD
to output filtered bases for
probability lower than threshold (default = 0.66)
After creating BED files using modbam2bed, the BED files are read in and the Bsseq object is constructed via read.modbam2bed()
function. The function reads in BED files, extract genomic regions, methylation, coverage, ambiguous modification status data and sample information and then construct Bsseq object using BSseq
function within the package.
BSseq object
files <- c(system.file("extdata/modbam2bed/ctr1.chr10.chr11.bed.gz", package = "bsseq"), system.file("extdata/modbam2bed/ctr2.chr10.chr11.bed.gz", package = "bsseq"), system.file("extdata/modbam2bed/ctr3.chr10.chr11.bed.gz", package = "bsseq"), system.file("extdata/modbam2bed/tret1.chr10.chr11.bed.gz", package = "bsseq"), system.file("extdata/modbam2bed/tret2.chr10.chr11.bed.gz", package = "bsseq"), system.file("extdata/modbam2bed/tret3.chr10.chr11.bed.gz", package = "bsseq")) pd <- data.frame(condition = rep(c("control", "treatment"), each = 3), replicate = rep(c("rep1", "rep2", "rep3"), times = 2)) bsseq_nano <- bsseq::read.modbam2bed(files,colData=pd,rmZeroCov = FALSE, strandCollapse=TRUE)
files <- c(system.file("extdata/modbam2bed/ctr1.chr10.chr11.bed.gz", package = "bsseq"), system.file("extdata/modbam2bed/ctr2.chr10.chr11.bed.gz", package = "bsseq"), system.file("extdata/modbam2bed/ctr3.chr10.chr11.bed.gz", package = "bsseq"), system.file("extdata/modbam2bed/tret1.chr10.chr11.bed.gz", package = "bsseq"), system.file("extdata/modbam2bed/tret2.chr10.chr11.bed.gz", package = "bsseq"), system.file("extdata/modbam2bed/tret3.chr10.chr11.bed.gz", package = "bsseq")) pd <- data.frame(condition = rep(c("control", "treatment"), each = 3), replicate = rep(c("rep1", "rep2", "rep3"), times = 2)) bsseq_nano <- bsseq::read.modbam2bed(files,colData=pd,rmZeroCov = FALSE, strandCollapse=TRUE)
Construct BSseq objects from nanopore BED files
read.modkit( files, colData = NULL, rmZeroCov = FALSE, strandCollapse = TRUE )
read.modkit( files, colData = NULL, rmZeroCov = FALSE, strandCollapse = TRUE )
files |
vector, BED files |
colData |
data frame, phenotypic data with samples as rows and variables as columns |
rmZeroCov |
A logical (1) indicating whether methylation loci that have zero coverage in all samples be removed |
strandCollapse |
A logical (1) indicating whether stand-symmetric methylation loci (i.e. CpGs) should be collapsed across strands |
This function reads in nanopore sequencing modified BED files to Bsseq objects. Nanopore sequencing data (i.e. aggregated modified base counts) is stored in modified-base BAM files. These modified-base BAM files are converted to bedMethyl (BED) files using modkit.
Modkit outputs modified reads, unmodified reads, ambiguous modification reads (reads where the probability was below the threshold and usually failing the lowest 10th percentile), and other modified reads.
After creating BED files using modkit, the BED files are read in and the Bsseq object is constructed via read.modkit()
function. The function reads in BED files, extract genomic regions, methylation, coverage, ambiguous modification status data and sample information and then construct Bsseq object using BSseq
function within the package. Other modification bases such as hydroxymethylation are extracted and added to the methylation matrix when present.
BSseq objects
# No other modification present files <- c(system.file("extdata/modkit/chr21.chr22.HG002.top1000.bed.gz", package = "bsseq")) bsseq_nano <- read.modkit(files, rmZeroCov = FALSE, strandCollapse=FALSE) # Other modification present files <- c(system.file("extdata/modkit/Hypo1.first50Bed.txt",package = "bsseq")) bsseq_nano <- read.modkit(files, rmZeroCov = FALSE, strandCollapse=FALSE)
# No other modification present files <- c(system.file("extdata/modkit/chr21.chr22.HG002.top1000.bed.gz", package = "bsseq")) bsseq_nano <- read.modkit(files, rmZeroCov = FALSE, strandCollapse=FALSE) # Other modification present files <- c(system.file("extdata/modkit/Hypo1.first50Bed.txt",package = "bsseq")) bsseq_nano <- read.modkit(files, rmZeroCov = FALSE, strandCollapse=FALSE)