Simulating Whole-Genome Inherited Bisulphite Sequencing Data


Package: methInheritSim
Authors: Pascal Belleau, Astrid Deschênes and Arnaud Droit
Version: 1.29.0
Compiled date: 2024-12-19
License: Artistic-2.0

Licensing

The methInheritSim package and the underlying methInheritSim code are distributed under the Artistic license 2.0. You are free to use and redistribute this software.

Citing

If you use this package for a publication, we would ask you to cite the following:

Pascal Belleau, Astrid Deschênes, Marie-Pier Scott-Boyer, Romain Lambrot, Mathieu Dalvai, Sarah Kimmins, Janice Bailey, Arnaud Droit; Inferring and modeling inheritance of differentially methylated changes across multiple generations, Nucleic Acids Research, Volume 46, Issue 14, 21 August 2018, Pages e85. DOI: https://doi.org/10.1093/nar/gky362

Introduction

DNA methylation plays an important role in the biology of tissue development and diseases. High-throughput sequencing techniques enable genome-wide detection of differentially methylated elements (DME), commonly sites (DMS) or regions (DMR). The analysis of treatment effects on DNA methylation, from one generation to the next (inter-generational) and across generations that were not exposed to the initial environment (trans-generational) represent complex designs. There are two main approaches to study the methylation inheritance, the first is based on segregation in pedigree while the second uses the intersection between the DME of each generation (useful when pedigree is unknown). The power and the false positve rate of those types of design are relatively hard to evaluate.

We present a package that simulates the methylation inheritance. Using real datasets, the package generates a synthetic chromosome by sampling regions. Two different distributions are used to simulate the methylation level at each CpG site: one for the DMS and one for all the other sites. The second distribution takes advantage of parameters estimated using the control datasets. The package also offers the option to select the proportion of sites randomly fixed as DMS, as well as, the fraction of the cases that inherited the DMS in the subsequent generations.

The methInheritSim package generates simulated multigenerational DMS datasets that are useful to evaluate the power and the false discovery rate of experiment design analysis, such as the methylInheritance package does.The multigenerational DMS datasets can also be used to compare the efficiency of different inheritance detection software.

Loading methInheritSim package

As with any R package, the methInheritSim package should first be loaded with the following command:

library(methInheritSim)

Description of the simulation process

The first step of the simulation process is to create a synthetic chromosome made up of methylated sites. The synthetic methylated sites (or CpG sites) are generated using a real dataset (methData parameter). The read dataset only needs to contain methylation for controls on one generation; a real multigenerational dataset is not needed.

Two parameters are critical during this process:

  • nbBlock: The number of blocks randomly selected in the real dataset genome.
  • nbCpG: The number of consecutive methylated sites that must contain each selected block.

Those two parameters unable to reproduce CpG islands of customizable size. It also reproduces the relation between the methylation level and the distance associated to adjacent methylated sites.

Figure 1. Creation of a synthetic chromosome
Figure 1. Creation of a synthetic chromosome

For each methylated site of the synthetic chromosome, the alpha and beta parameters of a Beta distribution are estimated from the mean and variance of the proportion of C/T at the site of the real control dataset.

Simulated control dataset

A Beta distribution is used to simulate the proportion of C/T in the methylated sites of the simulated control dataset.

Using the synthetic chromosome, DMS are randomly selected from the methylated sites. The rateDiff parameter fixes the mean of the proportion sites that are differentially methylated (DMS).

To recreate differentially methylated regions (DMR), the successors site of a DMS, located within 1000 base pairs, has a higher probability to be selected as a DMS.

The inheritance is done through the DMR. This means that when the following generation inherits of a DMR region, it inherits all of the DMS present in the region. The propInherite parameter fixes the proportion of DMR that are inherited.

Simulated case dataset

For the methylated sites in the F1 generation of the simulated case dataset, a Beta distribution is used to simulate the proportion of C/T. This is the exact same distribution as for the control dataset.

A proportion of cases, fixed by the vpDiff parameter, are selected to be have DMS. Those DMS are assigned an updated proportion of C/T that follows a shifted Beta distribution with parameters estimated using the mean of control ± vDiff. The vpDiff parameter is similar to penetrance. Not all sites of the selected cases will have DMS, only a proportion of those sites, as fixed by rateDiff than represent the mean proportion of sites selected as DMS.

In the subsequent generation, only a proportion of the DMS present in the initial simulated case dataset are selected to be inherited. The proposition of inherited DMS is calculated as:

vpDiff × vInheritancenumber of generations after F2

The proportion of C/T of those selected inherited sites follows a shifted Beta distribution with parameters estimated using mean of control ± (vDiff xpropHetero). The propHetero is 0.5 if one of the parent is a control.

Case study

The simulated dataset

A dataset containing methylation data (6 cases and 6 controls) has been generated using the methInheritSim package using a real dataset from Rat experiment (the real dataset is not public yet, so we used a simulation based on it). The data have been formated, using
the methylkit package, into a methylBase object (using the methylkit functions: filterByCoverage, normalizeCoverage and unite).

## Load  read DMS dataset (not in this case but normaly)
data(samplesForChrSynthetic)

## Print the first three rows of the object
head(samplesForChrSynthetic, n = 3)
##   chr start  end strand coverage1 numCs1 numTs1 coverage2 numCs2 numTs2
## 1   S  1000 1000      +        89      0     89        70      0     70
## 2   S  7090 7090      +        82      1     81        94      1     93
## 3   S  7204 7204      +        72      2     70        91      0     91
##   coverage3 numCs3 numTs3 coverage4 numCs4 numTs4 coverage5 numCs5 numTs5
## 1        84      3     81        85      0     85        69      0     69
## 2        87      0     87        87      2     85        93      1     92
## 3        78      0     78        82      0     82        80      1     79
##   coverage6 numCs6 numTs6 coverage7 numCs7 numTs7 coverage8 numCs8 numTs8
## 1        84      0     84        80      0     80        81      0     81
## 2        78      1     77        71      3     68        77      1     76
## 3        65      1     64        73      1     72        76      0     76
##   coverage9 numCs9 numTs9 coverage10 numCs10 numTs10 coverage11 numCs11 numTs11
## 1        92      0     92         71       1      70         92       2      90
## 2        72      1     71         83       1      82         84       1      83
## 3        67      0     67         78       1      77         86       3      83
##   coverage12 numCs12 numTs12
## 1         95       0      95
## 2         92       1      91
## 3         71       1      70

The simulation

The simulation is run using the runSim function. The outputDir parameter fixes the directory where the results are stored.

## Directory where the files related to the simulation will be saved
temp_dir <- "test_runSim"

## Run the simulation
runSim(methData = samplesForChrSynthetic,  # The dataset use for generate 
                                           # the synthetic chr.
        nbSynCHR = 1,       # The number of synthetic chromosome
        nbSimulation = 2,   # The number of simulation for each parameter
        nbBlock = 10, nbCpG = 20, # The number of site in the 
                                  # synthetic chr is nbBLock * nbCpG
        nbGeneration = 3,    # At least 2 generations must be present
        vNbSample = c(3, 6), # The number of controls (= number of cases) in
                             # each simulation
        vpDiff = c(0.9),   # Mean proportion of samples with  
                           # differentially methylated values
        vpDiffsd = c(0.1), # Standard deviation associated to vpDiff
        vDiff = c(0.8),    # The shift of the mean of the C/T ratio in 
                           # the differentially methylated sites
        vInheritance = c(0.5),  # The proportion of cases that inherit 
                                # differentially methylated sites
        propInherite = 0.3,     # The proportion of diffementially methylated
                                # regions that are inherited
        rateDiff = 0.3,    # The mean frequency of the differentially 
                           # methylated regions
        minRate = 0.2,     # The minimum rate for differentially
                           # methylated sites
        propHetero = 0.5,  # The reduction of vDiff for the following
                           # generations
        keepDiff = FALSE,  # When FALSE, the differentially methylated
                           # sites are the same in all simulations
        outputDir = temp_dir, # Directory where files are saved
        fileID = "S1",
        runAnalysis = TRUE, 
        nbCores = 1, 
        vSeed = 32)        # Fix seed to unable reproductive results
## [1] 0
        
        # The files generated
        dir(temp_dir)
##  [1] "methDiff_S1_1_3_0.9_0.8_0.5_1.rds"  "methDiff_S1_1_3_0.9_0.8_0.5_2.rds" 
##  [3] "methDiff_S1_1_6_0.9_0.8_0.5_1.rds"  "methDiff_S1_1_6_0.9_0.8_0.5_2.rds" 
##  [5] "meth_S1_1_3_0.9_0.8_0.5_1.rds"      "meth_S1_1_3_0.9_0.8_0.5_2.rds"     
##  [7] "meth_S1_1_6_0.9_0.8_0.5_1.rds"      "meth_S1_1_6_0.9_0.8_0.5_2.rds"     
##  [9] "methylGR_S1_1_3_0.9_0.8_0.5_1.rds"  "methylGR_S1_1_3_0.9_0.8_0.5_2.rds" 
## [11] "methylGR_S1_1_6_0.9_0.8_0.5_1.rds"  "methylGR_S1_1_6_0.9_0.8_0.5_2.rds" 
## [13] "methylObj_S1_1_3_0.9_0.8_0.5_1.rds" "methylObj_S1_1_3_0.9_0.8_0.5_2.rds"
## [15] "methylObj_S1_1_6_0.9_0.8_0.5_1.rds" "methylObj_S1_1_6_0.9_0.8_0.5_2.rds"
## [17] "simData_S1_1_3_0.9_0.8_0.5_1.rds"   "simData_S1_1_3_0.9_0.8_0.5_2.rds"  
## [19] "simData_S1_1_6_0.9_0.8_0.5_1.rds"   "simData_S1_1_6_0.9_0.8_0.5_2.rds"  
## [21] "stateDiff_S1_1_3_0.9_0.8_0.5_1.rds" "stateDiff_S1_1_3_0.9_0.8_0.5_2.rds"
## [23] "stateDiff_S1_1_6_0.9_0.8_0.5_1.rds" "stateDiff_S1_1_6_0.9_0.8_0.5_2.rds"
## [25] "syntheticChr_S1_1.rds"              "treatment_S1_1_3.rds"              
## [27] "treatment_S1_1_6.rds"

Files generated by the simulation

Three types of files are generated by default:

  1. Synthetic chromosome in GRanges format (“syntheticChr” prefix) 
  2. Simulation information in GRanges format (“simData” prefix) 
  3. Information about DMS state (“stateDiff” prefix)

Synthetic chromosome in GRanges format (“syntheticChr” prefix)

The first type of files contains information about the synthetic chromosome. This information is stored as a GRanges that contains the CpG (or methylated sites).The GRanges has four metadata inherited from the real dataset:

  • chrOri, the chromosome from the real dataset
  • startOri, the position of the site in the real dataset
  • meanCTRL, the mean of the C/T proportion of the control in the real dataset
  • varCTRL, the variance of the C/T proportion of the control in the real dataset

The file name is composed of those elements, separated by “_“:

  1. The string “syntheticChr”
  2. The code of the simulation (the fileID parameter, ex: “S1”“)
  3. The chromosome number
  4. The file extension “.rds”

An example of a valid file name: syntheticChr_S1_1.rds

## The synthetic chromosome
syntheticChr <- readRDS("demo_runSim/syntheticChr_S1_1.rds")

## In GRanges format, only Cpg present
head(syntheticChr, n=3)
## GRanges object with 3 ranges and 4 metadata columns:
##       seqnames    ranges strand |    chrOri  startOri   meanCTRL     varCTRL
##          <Rle> <IRanges>  <Rle> | <numeric> <numeric>  <numeric>   <numeric>
##   [1]        S      1000      + |         1  13520847 0.07650477 4.28722e-03
##   [2]        S      1029      + |         1  13520876 0.04003131 7.69157e-04
##   [3]        S      1042      + |         1  13520889 0.00185185 2.05761e-05
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Simulation information in GRanges format (“simData” prefix)

The second type of files contains information about the simulation stored in a GRanges format. The GRanges object has four metadata related to real dataset:

  • meanDiff, the mean of the C/T proportion for the shifted distribution
  • meanCTRL.meanCTRL, the mean of the C/T proportion for the control distribution
  • partitionCase, the number of cases simulated with the shifted distribution
  • partitionCtrl, the number of cases simulated with the control distribution

Plus a metadata for each sample (case or control):

  • case.V[number] or ctrl.V[number] the simulated proportion of C/T

The file name is composed of those elements, separated by “_“:

  1. The string “simData”
  2. The code of the simulation (the fileID parameter, ex: “S1”)
  3. The chromosome number (between 1 and the value of the nbSynCHR)
  4. The number of controls as specified by the vNBSample parameter
  5. The value of the vpDiff parameter
  6. The value of the vDiff parameter
  7. The value of the vInheritance parameter
  8. The ID of the simulation (between 1 and the value of the nbSimulation)
  9. The file extension “.rds”

An example of a valid file name: simData_S1_1_3_0.9_0.8_0.5_1.rds

#### The simulation dataset
simData <- readRDS("demo_runSim/simData_S1_1_3_0.9_0.8_0.5_1.rds")

#### Information for the first generation F1
head(simData[[1]], n=3)
## GRanges object with 3 ranges and 10 metadata columns:
##       seqnames    ranges strand |  meanDiff meanCTRL.meanCTRL partitionCase
##          <Rle> <IRanges>  <Rle> | <numeric>         <numeric>     <numeric>
##   [1]        S      1000      + | 0.0765048        0.07650477             0
##   [2]        S      1029      + | 0.8400313        0.04003131             3
##   [3]        S      1042      + | 0.8018519        0.00185185             3
##       partitionCtrl     ctrl.V1    ctrl.V2     ctrl.V3   case.V1   case.V2
##           <numeric>   <numeric>  <numeric>   <numeric> <numeric> <numeric>
##   [1]             3 0.054890382 0.04634528 0.055577894 0.0242984 0.0209004
##   [2]             0 0.041316815 0.10223372 0.046938232 0.8598469 0.8496649
##   [3]             0 0.000186761 0.00234001 0.000133774 0.8049543 0.7933025
##         case.V3
##       <numeric>
##   [1] 0.0342341
##   [2] 0.8339943
##   [3] 0.8032047
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

#### Information for the second generation F2
head(simData[[2]], n=3)
## GRanges object with 3 ranges and 10 metadata columns:
##       seqnames    ranges strand |   meanDiff meanCTRL.meanCTRL partitionCase
##          <Rle> <IRanges>  <Rle> |  <numeric>         <numeric>     <numeric>
##   [1]        S      1000      + | 0.07650477        0.07650477             0
##   [2]        S      1029      + | 0.04003131        0.04003131             0
##   [3]        S      1042      + | 0.00185185        0.00185185             0
##       partitionCtrl    ctrl.V1   ctrl.V2     ctrl.V3     case.V1     case.V2
##           <numeric>  <numeric> <numeric>   <numeric>   <numeric>   <numeric>
##   [1]             3 0.04712026 0.1291856 1.28747e-01 0.113037704 8.89853e-02
##   [2]             3 0.00829558 0.0248276 1.99802e-02 0.063738300 8.58190e-03
##   [3]             3 0.00176697 0.0175513 2.60458e-10 0.000874404 6.02450e-11
##           case.V3
##         <numeric>
##   [1] 6.03347e-02
##   [2] 6.75514e-02
##   [3] 4.15936e-07
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Information about DMS state (“stateDiff” prefix)

The third type of files contains a list with 2 entries. The first entry is called stateDiff and contains a vector of integer (0 and 1) with a length corresponding the length of stateInfo object. The statDiff object indicates, using a 1, the positions where the CpG sites are differentially methylated. The second entry is called statInherite and contains a vector of integer (0 and 1) with a length corresponding the length of stateInfo. The statInherite indicates, using a 1, the positions where the CpG values are inherited.

The file name is composed of those elements, separated by “_“:

  1. The string “stateDiff”
  2. The code of the simulation (the fileID parameter, ex: “S1”)
  3. The chromosome number (between 1 and the value of the nbSynCHR)
  4. The number of controls as specified by the vNBSample parameter
  5. The value of the vpDiff parameter
  6. The value of the vDiff parameter
  7. The value of the vInheritance parameter
  8. The ID of the simulation (between 1 and the value of the nbSimulation)
  9. The file extension “.rds”

An example of a valid file name: stateDiff_S1_1_3_0.9_0.8_0.5_1.rds

#### The DMS state information
stateDiff <- readRDS("demo_runSim/stateDiff_S1_1_3_0.9_0.8_0.5_1.rds")

#### In stateDiff, the position of DMS is indicated by 1
#### in stateInherite, the position of inherited DMS is indicated by 1
head(stateDiff)
## $stateDiff
##   [1] 0 1 1 1 0 1 1 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 1 1 1 1 1 0 1 0 0 0 1 1 1 0
##  [38] 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1
##  [75] 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0
## [112] 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 1 1 0 1 0
## [149] 1 0 1 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 0
## [186] 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0
## 
## $stateInherite
##   [1] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [38] 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [75] 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0
## [149] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
## [186] 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0

Conclusion

The methInheritSim package generates simulated multigenerational DMS datasets. Several simulator parameters can be derived from real dataset provided by the user in order to replicate realistic case-control scenarios.

The results of a simulation could be analysed, using the methylInheritance package, to evaluate the power and the false discovery rate of an experiment design.