This vignette introduces analysis methods for data from
high-throughput sequencing of bisulphite treated DNA to detect cytosine
methylation. The segmentSeq
package was originally designed
to detect siRNA loci Hardcastle2011 and
many of the methods developed for this can be used to detect loci of
cytosine methylation from replicated (or unreplicated) sequencing
data.
Preparation of the segmentSeq package proceeds as in siRNA analysis.
We begin by loading the segmentSeq
package.
library(segmentSeq)
#> Loading required package: baySeq
#> Loading required package: S4Vectors
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: generics
#>
#> Attaching package: 'generics'
#> The following objects are masked from 'package:base':
#>
#> as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
#> setequal, union
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
#> mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#> rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
#> unsplit, which.max, which.min
#>
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#>
#> findMatches
#> The following objects are masked from 'package:base':
#>
#> I, expand.grid, unname
#> Loading required package: parallel
#> Loading required package: GenomicRanges
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: ShortRead
#> Loading required package: BiocParallel
#> Loading required package: Biostrings
#> Loading required package: XVector
#>
#> Attaching package: 'Biostrings'
#> The following object is masked from 'package:base':
#>
#> strsplit
#> Loading required package: Rsamtools
#> Loading required package: GenomicAlignments
#> Loading required package: SummarizedExperiment
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#>
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#>
#> colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#> colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#> colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#> colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#> colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#> colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#> colWeightedMeans, colWeightedMedians, colWeightedSds,
#> colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#> rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#> rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#> rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#> rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#> rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#> rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#> rowWeightedSds, rowWeightedVars
#> Loading required package: Biobase
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
#>
#> Attaching package: 'Biobase'
#> The following object is masked from 'package:MatrixGenerics':
#>
#> rowMedians
#> The following objects are masked from 'package:matrixStats':
#>
#> anyMissing, rowMedians
Note that because the experiments that segmentSeq
is
designed to analyse are usually massive, we should use (if possible)
parallel processing as implemented by the parallel
package.
If using this approach, we need to begin by define a cluster.
The following command will use eight processors on a single machine; see
the help page for ‘makeCluster’ for more information. If we don’t want
to parallelise, we can proceed anyway with a NULL
cluster.
Results may be slightly different depending on whether or not a cluster
is used owing to the non-deterministic elements of the method.
if(FALSE) { # set to FALSE if you don't want parallelisation
numCores <- min(8, detectCores())
cl <- makeCluster(numCores)
} else {
cl <- NULL
}
The segmentSeq
package is designed to read in output
from the YAMA (Yet Another Methylome Aligner) program. This is a
perl-based package using either bowtie or bowtie2 to align bisulphite
treated reads (in an unbiased manner) to a reference and identify the
number of times each cytosine is identified as methylated or
unmethylated. Unlike most other aligners, YAMA does not require that
reads that map to more than one location are discarded, instead it
reports the number of alternate matches to the reference for each
cytosine. This is then used by segmentSeq
to weight the
observed number of methylated/un-methylated cytosines at a location. The
files used here have been compressed to save space.
datadir <- system.file("extdata", package = "segmentSeq")
files <- c(
"short_18B_C24_C24_trim.fastq_CG_methCalls.gz",
"short_Sample_17A_trimmed.fastq_CG_methCalls.gz",
"short_13_C24_col_trim.fastq_CG_methCalls.gz",
"short_Sample_28_trimmed.fastq_CG_methCalls.gz")
mD <- readMeths(files = files, dir = datadir,
libnames = c("A1", "A2", "B1", "B2"), replicates = c("A","A","B","B"),
nonconversion = c(0.004777, 0.005903, 0.016514, 0.006134))
#> Reading files.......done!
#> Finding unique cytosines......done!
#> Processing samples.......done!
We can begin by plotting the distribution of methylation for these samples. The distribution can be plotted for each sample individually, or as an average across multiple samples. We can also subtract one distribution from another to visualise patterns of differential methylation on the genome.
par(mfrow = c(2,1))
dists <- plotMethDistribution(mD, main = "Distributions of methylation", chr = "Chr1")
plotMethDistribution(mD,
subtract = rowMeans(sapply(dists, function(x) x[,2])),
main = "Differences between distributions", chr = "Chr1")
Next, we process this alignmentData
object to produce a
segData
object. This segData
object contains a
set of potential segments on the genome defined by the start and end
points of regions of overlapping alignments in the
alignmentData
object. It then evaluates the number of tags
that hit in each of these segments.
sD <- processAD(mD, cl = cl)
#> Chromosome: Chr1
#> Finding start-stop co-ordinates...done!
#> 249271 candidate loci found.
We can now construct a segment map from these potential segments.
A fast method of segmentation can be achieved by assuming a binomial distribution on the data with an uninformative beta prior, and identifying those potential segments which have a sufficiently large posterior likelihood that the proportion of methylation exceeds some critical value. This value can be determined by examining the data using the ‘thresholdFinder’ function, but expert knowledge is likely to provide a better guess as to where methylation becomes biologically significant.
thresh = 0.2
hS <- heuristicSeg(sD = sD, aD = mD, prop = thresh, cl = cl, gap = 100, getLikes = FALSE)
#> Number of candidate loci: 249271
#> Number of candidate nulls: 163635......done!
#> Strand *
#> Checking overlaps.....done.
#> Selecting loci...done!
hS
#> GRanges object with 2955 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] Chr1 108-948 *
#> [2] Chr1 5449-5465 *
#> [3] Chr1 6452 *
#> [4] Chr1 6520 *
#> [5] Chr1 7020-7034 *
#> ... ... ... ...
#> [2951] Chr1 1993118-1993128 *
#> [2952] Chr1 1993149 *
#> [2953] Chr1 1993335 *
#> [2954] Chr1 1994611 *
#> [2955] Chr1 1994857-1994886 *
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
#> An object of class "lociData"
#> 2955 rows and 4 columns
#>
#> Slot "replicates"
#> A A B B
#> Slot "groups":
#> list()
#>
#> Slot "data":
#> [,1] [,2] [,3] [,4]
#> [1,] 247:169 175:133 165:149 28:31
#> [2,] 0:49 1:86 23:36 2:2
#> [3,] 2:0 5:3 0:3 0:1
#> [4,] 0:3 0:9 2:1 0:0
#> [5,] 0:119 2:78 20:37 0:2
#> 2950 more rows...
#>
#> Slot "annotation":
#> data frame with 0 columns and 2955 rows
#>
#> Slot "locLikelihoods" (stored on log scale):
#> Matrix with 2955 rows.
#> A B
#> 1 1 1
#> 2 0 1
#> 3 1 0
#> 4 0 1
#> 5 0 1
#> ... ... ...
#> 2951 1 1
#> 2952 1 1
#> 2953 1 0
#> 2954 1 0
#> 2955 0 1
#>
#> Expected number of loci in each replicate group
#> A B
#> 1968 2168
Within a methylation locus, it is not uncommon to find completely
unmethylated cytosines. If the coverage of these cytosines is too high,
it is possible that these will cause the locus to be split into two or
more fragments. The mergeMethSegs
function can be used to
overcome this splitting by merging loci with identical patterns of
expression that are not separated by too great a gap. Merging in this
manner is optional, but recommended.
We can then estimate posterior likelihoods on the defined loci by applying empirical Bayesian methods. These will not change the locus definition, but will assign likelihoods that the identified loci represent a true methylation locus in each replicate group.
Classification of the potential segments can also be carried out using empirical Bayesian methods. These are extremely computationally intensive, but allow biological variation within replicates to be more accurately modelled, thus providing an improved identification of methylation loci.
By one of these methods, we finally acquire an annotated
methData
object, with the annotations describing the
co-ordinates of each segment.
We can use this methData
object, in combination with the
alignmentMeth
object, to plot the segmented genome.
We can also examine the methData
object for
differentially methylated regions using the beta-binomial methods Hardcastle:2013 implemented in
baySeq
. We first define a group structure on the data.
The methObservables function pre-calculates a set of data to improve the speed of prior and posterior estimation (at some minor memory cost).
The density function used here is a composite of the beta-binomial and a binomial distribution that accounts for the reported non-conversion rates.
We can then determine a prior distribution on the parameters of the model for the data.
We can then find the posterior likelihoods of the models defined in the groups structure.
We can then retrieve the data for the top differentially methylated regions.
Finally, to be a good citizen, we stop the cluster we started earlier:
Thomas J. Hardcastle and Krystyna A. Kelly and David C. Baulcombe. Identifying small RNA loci from high-throughput sequencing data. Bioinformatics (2012).
Thomas J. Hardcastle and Krystyna A. Kelly. Empirical Bayesian analysis of paired high-throughput sequencing data with a beta-binomial distribution. BMC Bioinformatics (2013).
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] parallel stats4 stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] segmentSeq_2.41.1 ShortRead_1.65.0
#> [3] GenomicAlignments_1.43.0 SummarizedExperiment_1.37.0
#> [5] Biobase_2.67.0 MatrixGenerics_1.19.0
#> [7] matrixStats_1.4.1 Rsamtools_2.23.1
#> [9] Biostrings_2.75.1 XVector_0.47.0
#> [11] BiocParallel_1.41.0 GenomicRanges_1.59.1
#> [13] GenomeInfoDb_1.43.2 IRanges_2.41.1
#> [15] S4Vectors_0.45.2 BiocGenerics_0.53.3
#> [17] generics_0.1.3 baySeq_2.41.0
#> [19] BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] xfun_0.49 bslib_0.8.0 hwriter_1.3.2.1
#> [4] latticeExtra_0.6-30 lattice_0.22-6 tools_4.4.2
#> [7] bitops_1.0-9 Matrix_1.7-1 RColorBrewer_1.1-3
#> [10] lifecycle_1.0.4 GenomeInfoDbData_1.2.13 compiler_4.4.2
#> [13] deldir_2.0-4 statmod_1.5.0 codetools_0.2-20
#> [16] htmltools_0.5.8.1 sys_3.4.3 buildtools_1.0.0
#> [19] sass_0.4.9 yaml_2.3.10 crayon_1.5.3
#> [22] jquerylib_0.1.4 cachem_1.1.0 DelayedArray_0.33.2
#> [25] limma_3.63.2 abind_1.4-8 locfit_1.5-9.10
#> [28] digest_0.6.37 maketools_1.3.1 fastmap_1.2.0
#> [31] grid_4.4.2 cli_3.6.3 SparseArray_1.7.2
#> [34] S4Arrays_1.7.1 edgeR_4.5.0 UCSC.utils_1.3.0
#> [37] rmarkdown_2.29 pwalign_1.3.0 httr_1.4.7
#> [40] jpeg_0.1-10 interp_1.1-6 png_0.1-8
#> [43] evaluate_1.0.1 knitr_1.49 rlang_1.1.4
#> [46] Rcpp_1.0.13-1 BiocManager_1.30.25 jsonlite_1.8.9
#> [49] R6_2.5.1 zlibbioc_1.52.0