Title: | Detection and inference of differentially methylated regions from Whole Genome Bisulfite Sequencing |
---|---|
Description: | This package implements an approach for scanning the genome to detect and perform accurate inference on differentially methylated regions from Whole Genome Bisulfite Sequencing data. The method is based on comparing detected regions to a pooled null distribution, that can be implemented even when as few as two samples per population are available. Region-level statistics are obtained by fitting a generalized least squares (GLS) regression model with a nested autoregressive correlated error structure for the effect of interest on transformed methylation proportions. |
Authors: | Keegan Korthauer [cre, aut] , Rafael Irizarry [aut] , Yuval Benjamini [aut], Sutirtha Chakraborty [aut] |
Maintainer: | Keegan Korthauer <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.27.0 |
Built: | 2024-12-29 07:27:29 UTC |
Source: | https://github.com/bioc/dmrseq |
This is the annotation information returned from
getAnnot
, subsetted for chromosome 21 for convenience
in running the examples. The annotation is obtained using the
annotatr
package.
data(annot.chr21)
data(annot.chr21)
a GRangesList
object with two elements returned
by getAnnot
. The first
contains CpG category information in the first element (optional)
coding gene sequence information in the second element (optional).
At least one of these elements needs to be non-null in order for
any annotation to be plotted, but it is not necessary to contain
both.
Obtained from running
annoTrack
function and then subsetting the results to
only include chromosome 21. A script which executes these steps
and constructs the annot.chr21
object may be found in ‘inst/scripts/get_annot.chr21.R’
data(annot.chr21)
data(annot.chr21)
This dataset represents chromosome 21 from the IMR90 and H1 cell lines sequenced in Lister et al. Only CpG methylation are included. The two samples from each cell line are two different extractions (ie. technical replicates), and are pooled in the analysis in the original paper.
data(BS.chr21)
data(BS.chr21)
An object of class BSseq
.
All coordinates are in hg18.
Obtained from
http://neomorph.salk.edu/human_methylome/data.html specifically,
the files mc_h1_r1.tar.gz, mc_h1_r2.tar.gz,
mc_imr90_r1.tar.gz, mc_imr90_r2.tar.gz
A script which downloads these files and constructs the BS.chr21
object may be found in ‘inst/scripts/get_BS.chr21.R’ - this was
based off of and modified from the get_BS.chr22.R script in the
bsseq
package. The object constructed here contains a
different chromosome (22 instead of 21), and two additional samples
(h1 and imr90 instead of just imr90) to enable identification of
cell type-DMRs for examples.
R Lister et al. Human DNA methylomes at base resolution show widespread epigenomic differences. Nature (2009) 462, 315-322.
data(BS.chr21) BS.chr21
data(BS.chr21) BS.chr21
Function to add visual representation of CpG categories and/or coding sequences to DMR plot
dmrPlotAnnotations(gr, annoTrack)
dmrPlotAnnotations(gr, annoTrack)
gr |
a |
annoTrack |
a |
An internal function that takes an annotation
SimpleGRangesList
object that
contains CpG category information in the first element (optional) and / or
coding gene sequence information in the second element (optional). If neither
of these are present, then nothing will be plotted.
None
Example output from dmrseq
function run on the
example dataset BS.chr21
.
data(dmrs.ex)
data(dmrs.ex)
a data.frame that contains the results of the inference. The data.frame contains one row for each candidate region, and 10 columns, in the following order: 1. chr = chromosome, 2. start = start basepair position of the region, 3. end = end basepair position of the region, 4. indexStart = the index of the region's first CpG, 5. indexEnd = the index of the region's last CpG, 6. L = the number of CpGs contained in the region, 7. area = the sum of the smoothed beta values 8. beta = the coefficient value for the condition difference, 9. stat = the test statistic for the condition difference, 10. pval = the permutation p-value for the significance of the test statistic, and 11. qval = the q-value for the test statistic (adjustment for multiple comparisons to control false discovery rate).
Obtained from running the examples in dmrseq
.
A script which executes these steps
and constructs the dmrs.ex
object may be found in ‘inst/scripts/get_dmrs.ex.R’
data(dmrs.ex)
data(dmrs.ex)
Performs a two-step approach that (1) detects candidate regions, and (2) scores candidate regions with an exchangeable (across the genome) statistic and evaluates statistical significance using a permuation test on the pooled null distribution of scores.
dmrseq( bs, testCovariate, adjustCovariate = NULL, cutoff = 0.1, minNumRegion = 5, smooth = TRUE, bpSpan = 1000, minInSpan = 30, maxGapSmooth = 2500, maxGap = 1000, verbose = TRUE, maxPerms = 10, matchCovariate = NULL, BPPARAM = bpparam(), stat = "stat", block = FALSE, blockSize = 5000, chrsPerChunk = 1 )
dmrseq( bs, testCovariate, adjustCovariate = NULL, cutoff = 0.1, minNumRegion = 5, smooth = TRUE, bpSpan = 1000, minInSpan = 30, maxGapSmooth = 2500, maxGap = 1000, verbose = TRUE, maxPerms = 10, matchCovariate = NULL, BPPARAM = bpparam(), stat = "stat", block = FALSE, blockSize = 5000, chrsPerChunk = 1 )
bs |
bsseq object containing the methylation values as well as the phenotype matrix that contains sample level covariates |
testCovariate |
Character value indicating which variable
(column name) in |
adjustCovariate |
an (optional) character value or vector
indicating which variables (column names) in |
cutoff |
scalar value that represents the absolute value (or a vector of two numbers representing a lower and upper bound) for the cutoff of the single CpG coefficient that is used to discover candidate regions. Default value is 0.10. |
minNumRegion |
positive integer that represents the minimum number of CpGs to consider for a candidate region. Default value is 5. Minimum value is 3. |
smooth |
logical value that indicates whether or not to smooth the CpG level signal when discovering candidate regions. Defaults to TRUE. |
bpSpan |
a positive integer that represents the length in basepairs
of the smoothing span window if |
minInSpan |
positive integer that represents the minimum number of
CpGs in a smoothing span window if |
maxGapSmooth |
integer value representing maximum number of basepairs
in between neighboring CpGs to be included in the same
cluster when performing smoothing (should generally be larger than
|
maxGap |
integer value representing maximum number of basepairs in between neighboring CpGs to be included in the same DMR. |
verbose |
logical value that indicates whether progress messages should be printed to stdout. Defaults value is TRUE. |
maxPerms |
a positive integer that represents the maximum number of permutations that will be used to generate the global null distribution of test statistics. Default value is 10. |
matchCovariate |
An (optional) character value
indicating which variable (column name) of |
BPPARAM |
a |
stat |
a character vector indicating the name of the column of the output to use as the region-level test statistic. Default value is 'stat' which is the region level-statistic designed to be comparable across the genome. It is not recommended to change this argument, but it can be done for experimental purposes. Possible values are: 'L' - the number of loci in the region, 'area' - the sum of the smoothed loci statistics, 'beta' - the effect size of the region, 'stat' - the test statistic for the region, or 'avg' - the average smoothed loci statistic. |
block |
logical indicating whether to search for large-scale (low
resolution) blocks of differential methylation (default is FALSE, which
means that local DMRs are desired). If TRUE, the parameters for
|
blockSize |
numeric value indicating the minimum number of basepairs
to be considered a block (only used if |
chrsPerChunk |
a positive integer value indicating the number of chromosomes per chunk. The default is 1, meaning that the data will be looped through one chromosome at a time. When pairing up multiple chromosomes per chunk, sizes (in terms of numbers of CpGs) will be taken into consideration to balance the sizes of each chunk. |
a GRanges
object that contains the results of the inference.
The object contains one row for each candidate region, sorted by q-value
and then chromosome. The standard
GRanges
chr, start, and end are included, along with at least
7 metadata
columns, in the following order:
1. L = the number of CpGs contained in the region,
2. area = the sum of the smoothed beta values
3. beta = the coefficient value for the condition difference (there
will be more than one column here if a multi-group comparison
was performed),
4. stat = the test statistic for the condition difference,
5. pval = the permutation p-value for the significance of the test
statistic, and
6. qval = the q-value for the test statistic (adjustment
for multiple comparisons to control false discovery rate).
7. index = an IRanges
containing the indices of the region's
first CpG to last CpG.
# load example data data(BS.chr21) # the covariate of interest is the 'CellType' column of pData(BS.chr21) testCovariate <- 'CellType' # run dmrseq on a subset of the chromosome (10K CpGs) regions <- dmrseq(bs=BS.chr21[240001:250000,], cutoff = 0.05, testCovariate=testCovariate)
# load example data data(BS.chr21) # the covariate of interest is the 'CellType' column of pData(BS.chr21) testCovariate <- 'CellType' # run dmrseq on a subset of the chromosome (10K CpGs) regions <- dmrseq(bs=BS.chr21[240001:250000,], cutoff = 0.05, testCovariate=testCovariate)
Uses the annotatr
package to retrieve annotation information (
CpG category and gene coding sequences) for the annoTrack
argument
of plotDMRs
. Allows for 5
re-tries if download fails (to allow for a spotty internet connection).
getAnnot(genomeName)
getAnnot(genomeName)
genomeName |
a character object that indicates which organism is
under study. Use the function |
Note that this package needs to attach the annotatr
package,
and will
return NULL if this cannot be done. You can still use the
plotDMRs
function without this optional annotation step,
just by leaving the annoTrack
argument as NULL.
a SimpleGRangesList
object with two elements returned
by getAnnot
. The first
contains CpG category information in the first element (optional)
coding gene sequence information in the second element (optional).
At least one of these elements needs to be non-null in order for
any annotation to be plotted, but it is not necessary to contain
both.
# get annotation information for hg19 annoTrack <- getAnnot('hg19')
# get annotation information for hg19 annoTrack <- getAnnot('hg19')
This function calculates raw mean methylation differences for the covariate of interest over a set of DMRs (or regions of interest), assuming a simple two-group comparison.
meanDiff(bs, dmrs, testCovariate)
meanDiff(bs, dmrs, testCovariate)
bs |
a |
dmrs |
a data.frame with one row per DMR. This can be in the format
of |
testCovariate |
a character indicating the covariate of interest in
the |
numeric vector of raw mean methylation differences.
data(BS.chr21) data(dmrs.ex) rawDiff <- meanDiff(BS.chr21, dmrs=dmrs.ex, testCovariate="CellType")
data(BS.chr21) data(dmrs.ex) rawDiff <- meanDiff(BS.chr21, dmrs=dmrs.ex, testCovariate="CellType")
Generates trace plots of methylation proportions by genomic position.
plotDMRs( BSseq, regions = NULL, testCovariate = NULL, extend = (end(regions) - start(regions) + 1)/2, main = "", addRegions = regions, annoTrack = NULL, col = NULL, lty = NULL, lwd = NULL, label = NULL, mainWithWidth = TRUE, regionCol = .alpha("#C77CFF", 0.2), addTicks = TRUE, addPoints = TRUE, pointsMinCov = 1, highlightMain = FALSE, qval = TRUE, stat = TRUE, verbose = TRUE, includeYlab = TRUE, compareTrack = NULL, labelCols = NULL, horizLegend = FALSE, addLines = TRUE, linesMinCov = 1 )
plotDMRs( BSseq, regions = NULL, testCovariate = NULL, extend = (end(regions) - start(regions) + 1)/2, main = "", addRegions = regions, annoTrack = NULL, col = NULL, lty = NULL, lwd = NULL, label = NULL, mainWithWidth = TRUE, regionCol = .alpha("#C77CFF", 0.2), addTicks = TRUE, addPoints = TRUE, pointsMinCov = 1, highlightMain = FALSE, qval = TRUE, stat = TRUE, verbose = TRUE, includeYlab = TRUE, compareTrack = NULL, labelCols = NULL, horizLegend = FALSE, addLines = TRUE, linesMinCov = 1 )
BSseq |
An object of class BSseq. |
regions |
A data.frame containing the DMRs (output from the main
|
testCovariate |
integer value or vector indicating which of columns of
|
extend |
Describes how much the plotting region should be extended in either direction. The total width of the plot is equal to the width of the region plus twice extend. |
main |
The plot title. The default is to construct a title with information about which genomic region is being plotted. |
addRegions |
A set of additional regions to be highlighted on the
plots. Same format as the |
annoTrack |
a |
col |
The color of the methylation estimates. It is recommended to
leave this value as default (NULL), and specify a value of
|
lty |
The line type of the methylation estimates. It is recommended to
pass this information through the |
lwd |
The line width of the methylation estimates. It is recommended to
pass this information through the |
label |
The condition/population labels for the plot legend. If NULL
(default) this is taken from the |
mainWithWidth |
logical value indicating whether the default title should include information about width of the plot region. |
regionCol |
The color used for highlighting the region. |
addTicks |
logical value indicating whether tick marks showing the location of methylation loci should be added. Default is TRUE. |
addPoints |
logical value indicating whether the individual methylation estimates be plotted as points. |
pointsMinCov |
The minimum coverage a methylation loci need in order for the raw methylation estimates to be plotted. Useful for filtering out low coverage loci. Only used if addPoints = TRUE. Default value is 1 (no filtering). |
highlightMain |
logical value indicating whether the plot region should be highlighted. |
qval |
logical value indicating whether the region FDR estimate
(q-value) should be displayed in the plot title. The value is extracted
from the |
stat |
logical value indicating whether the region statistic
should be displayed in the plot title. The value is extracted from the
|
verbose |
logical value indicating whether progress messages should be printed to the screen. |
includeYlab |
a logical indicating whether to include the Y axis label 'Methylation' (useful to turn off if combining multiple region figures and you do not want to include redundant y axis label information) |
compareTrack |
a named GRangesList object that contains up to four custom tracks (GRanges objects) which will be plotted below the region. Only one of 'compareTrack' or 'annoTrack' can be specified since there is only for plotting either the built in GpG category and exon tracks, *or* a custom set of tracks. |
labelCols |
a character vector with names of the mcols slot of the GRanges items in ‘compareTrack’. Only used if plotting custom tracks using the ‘compareTrack’ argument. If specified, the (first) value in that column is printed along with a label that includes the name of the list item. If NULL (default), just the name of the track is printed. |
horizLegend |
logical indicating whether the legend should be horizontal instead of vertical (default FALSE). This is useful if you need to plot many labels and want to preserve whitespace. |
addLines |
logical indicating whether to plot smooth lines between points. Default is true. Can be useful to turn this off for very small regions. |
linesMinCov |
The minimum coverage a methylation loci need in order to be used for plotting of smoothed lines. Useful for filtering out low coverage loci. Only used if addLines = TRUE. Default value is 1 (no filtering). |
Creates aesthetially pleasing DMR plots. By default will plot individual
points with size proportional to coverage, along with a smoothed line
for each sample. Elements will be colored by biological condition
(label
). Also has functionality to add annotations below the main
plot (CpG category, genes) if annoTrack
is specified.
None (generates a plot)
# load the example data data(BS.chr21) # load example results (computed with dmrseq function) data(dmrs.ex) # get annotation information (using getAnnot function) # here we'll load the example annotation from chr21 data(annot.chr21) # plot the 1st DMR plotDMRs(BS.chr21, regions=dmrs.ex[1,], testCovariate=1, annoTrack=annot.chr21)
# load the example data data(BS.chr21) # load example results (computed with dmrseq function) data(dmrs.ex) # get annotation information (using getAnnot function) # here we'll load the example annotation from chr21 data(annot.chr21) # plot the 1st DMR plotDMRs(BS.chr21, regions=dmrs.ex[1,], testCovariate=1, annoTrack=annot.chr21)
Uses ggplot2 to plot smoothed density histograms of methylation
proportions (beta values), or coverage. Methylation proportion densities
are weighted by coverage.
The number of curves plotted
will be equal to the number of different values of testCovariate
,
unless bySample
is TRUE. This can take quite some time to
execute for a large object, so it is recommended to first take a random
sample of loci (say one million) before plotting.
plotEmpiricalDistribution( bs, testCovariate = NULL, bySample = FALSE, type = "M", adj = 2.5 )
plotEmpiricalDistribution( bs, testCovariate = NULL, bySample = FALSE, type = "M", adj = 2.5 )
bs |
a BSseq object |
testCovariate |
character specifying the column name of the
|
bySample |
logical whether to plot a separate line for each sample,
even if the grouping |
type |
a character indicating which type of density to plot - the methylation (beta) values ("M") or the coverage ("Cov"). Default is "M". |
adj |
a numeric value for the |
a ggplot object
data(BS.chr21) # plot beta values by sample group plotEmpiricalDistribution(BS.chr21, testCovariate="CellType")
data(BS.chr21) # plot beta values by sample group plotEmpiricalDistribution(BS.chr21, testCovariate="CellType")
Add simulated DMRs to observed control data. Control data will be split into two (artificial) populations.
simDMRs(bs, num.dmrs = 3000, delta.max0 = 0.3)
simDMRs(bs, num.dmrs = 3000, delta.max0 = 0.3)
bs |
a BSseq object containing only control samples (from the same population) for which simulated DMRs will be added after dividing the population into two artificial groups. |
num.dmrs |
an integer specifying how many DMRs to add. |
delta.max0 |
a proportion value indicating the mode value for the difference in proportion of methylated CpGs in the simulated DMRs (the actual value will be drawn from a scaled Beta distribution centered at this value). Default value is 0.3. |
A named list object with 5 elements: (1)
gr.dmrs
is a GRanges
object with num.dmrs
ranges that represent the random DMRs added. (2) dmr.mncov
is a
numeric vector that contains the mean coverage in each simulated DMR. (3)
dmr.L
is a numeric vector that contains the number of CpGs in each
simulated DMR. (4) bs
is the BSseq object that contains the
simulated DMRs. (5) deltas
is a numeric vector that contains the
effect size used for each DMR.
# Add simulated DMRs to a BSseq dataset # This is just for illustrative purposes - ideally you would # add DMRs to a set of samples from the same condition (in our # example data, we have data from two different cell types) # In this case, we shuffle the samples by cell type to create # a null comparison. data(BS.chr21) BS.chr21.sim <- simDMRs(bs=BS.chr21[1:10000,c(1,3,2,4)], num.dmrs=50) # show the simulated DMRs GRanges object show(BS.chr21.sim$gr.dmrs) # show the updated BSseq object that includes the simulated DMRs show(BS.chr21.sim$bs) # examine effect sizes of the DMRs head(BS.chr21.sim$delta)
# Add simulated DMRs to a BSseq dataset # This is just for illustrative purposes - ideally you would # add DMRs to a set of samples from the same condition (in our # example data, we have data from two different cell types) # In this case, we shuffle the samples by cell type to create # a null comparison. data(BS.chr21) BS.chr21.sim <- simDMRs(bs=BS.chr21[1:10000,c(1,3,2,4)], num.dmrs=50) # show the simulated DMRs GRanges object show(BS.chr21.sim$gr.dmrs) # show the updated BSseq object that includes the simulated DMRs show(BS.chr21.sim$bs) # examine effect sizes of the DMRs head(BS.chr21.sim$delta)