segmentSeq: methods for detecting methylation loci and differential methylation

Introduction

This vignette introduces analysis methods for data from high-throughput sequencing of bisulphite treated DNA to detect cytosine methylation. The segmentSeq package was originally designed to detect siRNA loci Hardcastle2011 and many of the methods developed for this can be used to detect loci of cytosine methylation from replicated (or unreplicated) sequencing data.

Preparation

Preparation of the segmentSeq package proceeds as in siRNA analysis. We begin by loading the segmentSeq package.

library(segmentSeq)
#> Loading required package: baySeq
#> Loading required package: S4Vectors
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: generics
#> 
#> Attaching package: 'generics'
#> The following objects are masked from 'package:base':
#> 
#>     as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
#>     setequal, union
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
#>     mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#>     rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
#>     unsplit, which.max, which.min
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#> 
#>     findMatches
#> The following objects are masked from 'package:base':
#> 
#>     I, expand.grid, unname
#> Loading required package: parallel
#> Loading required package: GenomicRanges
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: ShortRead
#> Loading required package: BiocParallel
#> Loading required package: Biostrings
#> Loading required package: XVector
#> 
#> Attaching package: 'Biostrings'
#> The following object is masked from 'package:base':
#> 
#>     strsplit
#> Loading required package: Rsamtools
#> Loading required package: GenomicAlignments
#> Loading required package: SummarizedExperiment
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#> 
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#> 
#>     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#>     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#>     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#>     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#>     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#>     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#>     colWeightedMeans, colWeightedMedians, colWeightedSds,
#>     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#>     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#>     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#>     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#>     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#>     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#>     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#>     rowWeightedSds, rowWeightedVars
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> 
#> Attaching package: 'Biobase'
#> The following object is masked from 'package:MatrixGenerics':
#> 
#>     rowMedians
#> The following objects are masked from 'package:matrixStats':
#> 
#>     anyMissing, rowMedians

Note that because the experiments that segmentSeq is designed to analyse are usually massive, we should use (if possible) parallel processing as implemented by the parallel package. If using this approach, we need to begin by define a cluster. The following command will use eight processors on a single machine; see the help page for ‘makeCluster’ for more information. If we don’t want to parallelise, we can proceed anyway with a NULL cluster. Results may be slightly different depending on whether or not a cluster is used owing to the non-deterministic elements of the method.

if(FALSE) { # set to FALSE if you don't want parallelisation
    numCores <- min(8, detectCores())
    cl <- makeCluster(numCores)
} else {
    cl <- NULL
}

The segmentSeq package is designed to read in output from the YAMA (Yet Another Methylome Aligner) program. This is a perl-based package using either bowtie or bowtie2 to align bisulphite treated reads (in an unbiased manner) to a reference and identify the number of times each cytosine is identified as methylated or unmethylated. Unlike most other aligners, YAMA does not require that reads that map to more than one location are discarded, instead it reports the number of alternate matches to the reference for each cytosine. This is then used by segmentSeq to weight the observed number of methylated/un-methylated cytosines at a location. The files used here have been compressed to save space.

datadir <- system.file("extdata", package = "segmentSeq")
files <- c(
  "short_18B_C24_C24_trim.fastq_CG_methCalls.gz",
  "short_Sample_17A_trimmed.fastq_CG_methCalls.gz",
  "short_13_C24_col_trim.fastq_CG_methCalls.gz",
  "short_Sample_28_trimmed.fastq_CG_methCalls.gz")

mD <- readMeths(files = files, dir = datadir,
libnames = c("A1", "A2", "B1", "B2"), replicates = c("A","A","B","B"),
nonconversion = c(0.004777, 0.005903, 0.016514, 0.006134))
#> Reading files.......done!
#> Finding unique cytosines......done!
#> Processing samples.......done!

We can begin by plotting the distribution of methylation for these samples. The distribution can be plotted for each sample individually, or as an average across multiple samples. We can also subtract one distribution from another to visualise patterns of differential methylation on the genome.

par(mfrow = c(2,1))
dists <- plotMethDistribution(mD, main = "Distributions of methylation", chr = "Chr1")
plotMethDistribution(mD,
                     subtract = rowMeans(sapply(dists, function(x) x[,2])),
                     main = "Differences between distributions", chr = "Chr1")
Distributions of methylation on the genome (first two million bases of chromosome 1).

Distributions of methylation on the genome (first two million bases of chromosome 1).

Next, we process this alignmentData object to produce a segData object. This segData object contains a set of potential segments on the genome defined by the start and end points of regions of overlapping alignments in the alignmentData object. It then evaluates the number of tags that hit in each of these segments.

sD <- processAD(mD, cl = cl)
#> Chromosome: Chr1
#> Finding start-stop co-ordinates...done!
#> 249271 candidate loci found.

We can now construct a segment map from these potential segments.

Segmentation by heuristic Bayesian methods

A fast method of segmentation can be achieved by assuming a binomial distribution on the data with an uninformative beta prior, and identifying those potential segments which have a sufficiently large posterior likelihood that the proportion of methylation exceeds some critical value. This value can be determined by examining the data using the ‘thresholdFinder’ function, but expert knowledge is likely to provide a better guess as to where methylation becomes biologically significant.

thresh = 0.2
hS <- heuristicSeg(sD = sD, aD = mD, prop = thresh, cl = cl, gap = 100, getLikes = FALSE)
#> Number of candidate loci: 249271
#> Number of candidate nulls: 163635......done!
#> Strand *
#> Checking overlaps.....done.
#> Selecting loci...done!
hS
#> GRanges object with 2955 ranges and 0 metadata columns:
#>          seqnames          ranges strand
#>             <Rle>       <IRanges>  <Rle>
#>      [1]     Chr1         108-948      *
#>      [2]     Chr1       5449-5465      *
#>      [3]     Chr1            6452      *
#>      [4]     Chr1            6520      *
#>      [5]     Chr1       7020-7034      *
#>      ...      ...             ...    ...
#>   [2951]     Chr1 1993118-1993128      *
#>   [2952]     Chr1         1993149      *
#>   [2953]     Chr1         1993335      *
#>   [2954]     Chr1         1994611      *
#>   [2955]     Chr1 1994857-1994886      *
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#> An object of class "lociData"
#> 2955 rows and 4 columns
#> 
#> Slot "replicates"
#> A A B B
#> Slot "groups":
#> list()
#> 
#> Slot "data":
#>      [,1]    [,2]    [,3]    [,4] 
#> [1,] 247:169 175:133 165:149 28:31
#> [2,] 0:49    1:86    23:36   2:2  
#> [3,] 2:0     5:3     0:3     0:1  
#> [4,] 0:3     0:9     2:1     0:0  
#> [5,] 0:119   2:78    20:37   0:2  
#> 2950 more rows...
#> 
#> Slot "annotation":
#> data frame with 0 columns and 2955 rows
#> 
#> Slot "locLikelihoods" (stored on log scale):
#> Matrix with  2955  rows.
#>        A   B
#> 1      1   1
#> 2      0   1
#> 3      1   0
#> 4      0   1
#> 5      0   1
#> ...  ... ...
#> 2951   1   1
#> 2952   1   1
#> 2953   1   0
#> 2954   1   0
#> 2955   0   1
#> 
#> Expected number of loci in each replicate group
#>    A    B 
#> 1968 2168

Within a methylation locus, it is not uncommon to find completely unmethylated cytosines. If the coverage of these cytosines is too high, it is possible that these will cause the locus to be split into two or more fragments. The mergeMethSegs function can be used to overcome this splitting by merging loci with identical patterns of expression that are not separated by too great a gap. Merging in this manner is optional, but recommended.

hS <- mergeMethSegs(hS, mD, gap = 5000, cl = cl)

We can then estimate posterior likelihoods on the defined loci by applying empirical Bayesian methods. These will not change the locus definition, but will assign likelihoods that the identified loci represent a true methylation locus in each replicate group.

hSL <- lociLikelihoods(hS, mD, cl = cl)

Segmentation by empirical Bayesian Methods

Classification of the potential segments can also be carried out using empirical Bayesian methods. These are extremely computationally intensive, but allow biological variation within replicates to be more accurately modelled, thus providing an improved identification of methylation loci.

%eBS <- classifySeg(sD, hS, mD, cl = cl)

Visualising loci

By one of these methods, we finally acquire an annotated methData object, with the annotations describing the co-ordinates of each segment.

We can use this methData object, in combination with the alignmentMeth object, to plot the segmented genome.

plotMeth(mD, hSL, chr = "Chr1", limits = c(1, 50000), cap = 10)
#> Plotting sample: 1
#> Plotting sample: 2
#> Plotting sample: 3
#> Plotting sample: 4

Differential Methylation analysis

We can also examine the methData object for differentially methylated regions using the beta-binomial methods Hardcastle:2013 implemented in baySeq. We first define a group structure on the data.

groups(hSL) <- list(NDE = c(1,1,1,1), DE = c("A", "A", "B", "B"))

The methObservables function pre-calculates a set of data to improve the speed of prior and posterior estimation (at some minor memory cost).

hSL <- methObservables(hSL)

The density function used here is a composite of the beta-binomial and a binomial distribution that accounts for the reported non-conversion rates.

densityFunction(hSL) <- bbNCDist

We can then determine a prior distribution on the parameters of the model for the data.

hSL <- getPriors(hSL, cl = cl)

We can then find the posterior likelihoods of the models defined in the groups structure.

hSL <- getLikelihoods(hSL, cl = cl)

We can then retrieve the data for the top differentially methylated regions.

topCounts(hSL, "DE")

Finally, to be a good citizen, we stop the cluster we started earlier:

if(!is.null(cl))
    stopCluster(cl)

Bibliography

Thomas J. Hardcastle and Krystyna A. Kelly and David C. Baulcombe. Identifying small RNA loci from high-throughput sequencing data. Bioinformatics (2012).

Thomas J. Hardcastle and Krystyna A. Kelly. Empirical Bayesian analysis of paired high-throughput sequencing data with a beta-binomial distribution. BMC Bioinformatics (2013).

Session Info

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] segmentSeq_2.41.1           ShortRead_1.65.0           
#>  [3] GenomicAlignments_1.43.0    SummarizedExperiment_1.37.0
#>  [5] Biobase_2.67.0              MatrixGenerics_1.19.0      
#>  [7] matrixStats_1.4.1           Rsamtools_2.23.1           
#>  [9] Biostrings_2.75.1           XVector_0.47.0             
#> [11] BiocParallel_1.41.0         GenomicRanges_1.59.1       
#> [13] GenomeInfoDb_1.43.2         IRanges_2.41.1             
#> [15] S4Vectors_0.45.2            BiocGenerics_0.53.3        
#> [17] generics_0.1.3              baySeq_2.41.0              
#> [19] BiocStyle_2.35.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] xfun_0.49               bslib_0.8.0             hwriter_1.3.2.1        
#>  [4] latticeExtra_0.6-30     lattice_0.22-6          tools_4.4.2            
#>  [7] bitops_1.0-9            Matrix_1.7-1            RColorBrewer_1.1-3     
#> [10] lifecycle_1.0.4         GenomeInfoDbData_1.2.13 compiler_4.4.2         
#> [13] deldir_2.0-4            statmod_1.5.0           codetools_0.2-20       
#> [16] htmltools_0.5.8.1       sys_3.4.3               buildtools_1.0.0       
#> [19] sass_0.4.9              yaml_2.3.10             crayon_1.5.3           
#> [22] jquerylib_0.1.4         cachem_1.1.0            DelayedArray_0.33.2    
#> [25] limma_3.63.2            abind_1.4-8             locfit_1.5-9.10        
#> [28] digest_0.6.37           maketools_1.3.1         fastmap_1.2.0          
#> [31] grid_4.4.2              cli_3.6.3               SparseArray_1.7.2      
#> [34] S4Arrays_1.7.1          edgeR_4.5.0             UCSC.utils_1.3.0       
#> [37] rmarkdown_2.29          pwalign_1.3.0           httr_1.4.7             
#> [40] jpeg_0.1-10             interp_1.1-6            png_0.1-8              
#> [43] evaluate_1.0.1          knitr_1.49              rlang_1.1.4            
#> [46] Rcpp_1.0.13-1           BiocManager_1.30.25     jsonlite_1.8.9         
#> [49] R6_2.5.1                zlibbioc_1.52.0