Getting Started with Methylation-based Inference of Regulatory Activity

The MIRA Bioconductor package

This vignette gives a broad overview of MIRA. You can find a realistic example workflow, in “Applying MIRA to a Biological Question.”

MIRA (Methylation-based Inference of Regulatory Activity) is an R package that infers regulatory activity from DNA methylation data. It does this by aggregating DNA methylation data from a set of regions across the genome, producing a single summary profile of DNA methylation for those regions. MIRA then uses this profile to produce a score (the MIRA score) that infers the level of regulatory activity on the basis of the shape of the DNA methylation profile. The region set is provided by the user and should contain regions that correspond to a shared feature, such as binding of a particular transcription factor or DNase hypersensitivity sites. The concept of MIRA relies on the observation that DNA methylation tends to be lower in regions where transcription factors are bound. Since DNA methylation will generally be lower in active regions, the shape of the MIRA profile and the associated score can be used as a metric to compare regulatory activity in different samples and conditions. MIRA thus allows one to predict transcription factor activity data given only DNA methylation data as input. MIRA works with genome-scale, single-nucleotide-resolution methylation data, such as reduced representation bisulfite sequencing (RRBS) or whole genome bisulfite sequencing (WGBS) data. MIRA overcomes sparsity in DNA methylation data by aggregating across many regions. Below are examples of methylation profiles and associated MIRA scores for two (contrived) samples. This vignette will demonstrate how to obtain a methylation profile and MIRA score from the starting inputs of DNA methylation data and a region set.

##     featureID sampleName     score sampleType
##        <char>     <char>     <num>     <char>
## 1: RegionSet1    Sample1 0.9808293 Condition1
## 2: RegionSet1    Sample2 0.2876821 Condition2

Required inputs

You need 2 things to run MIRA:

  1. Nucleotide-resolution DNA methylation data
  2. A set of genomic regions

Let’s describe each one in more detail:

DNA methylation data

MIRA requires DNA methylation data after methylation calling. For a given genomic coordinate (the location of the C in a CpG), MIRA needs the methylation level (0 to 1) or data that can be used to calculate this: counts of methylated reads and total reads. The total number of reads can be used for screening purposes. This data should be represented as a data.table for each sample, which we call a BSDT (Bisulfite data.table). The BSDT will have these columns: chr, start (the coordinate of the C of the CpG) and the methylProp column with methylation level at each cytosine. Alternatively, the methylProp column may be generated from methylCount (number of methylated reads) and coverage (total number of reads covering this site) columns where methylProp = methylCount/coverage and these columns may be included in the data.table in addition to the methylProp column. When running MIRA on multiple samples, it is recommended to make each sample an item of a named list, with sample names as the names for the list. Since some existing R packages for DNA methylation use different formats, we include a format conversion function that can be used to convert SummarizedExperiment-based objects like you would obtain from the bsseq, methylPipe, and BiSeq packages to the necessary format for MIRA (SummarizedExperimentToDataTable function). A BSseq object is an acceptable input to MIRA and will be converted to the right format internally but for the sake of parallelization it is recommended that BSseq objects be converted to the right format outside the MIRA function so that MIRA can be run on each sample in parallel. Here is an example of a data.table in the right format for input to MIRA:

data("exampleBSDT", package="MIRA")
head(exampleBSDT)
## Key: <chr, start>
##       chr   start methylCount coverage
##    <char>   <int>       <int>    <int>
## 1:   chr1 1062326           0        3
## 2:   chr1 1062330           0        3
## 3:   chr1 1062339           0        4
## 4:   chr1 1062448           2       27
## 5:   chr1 1062478           0       23
## 6:   chr1 1062480           0       19

Region sets

A region set is a GRanges object containing genomic regions that share a biological annotation. For example, it could be ChIP peaks for a transcription factor. Many types of region sets may be used with MIRA, including ChIP regions for transcription factors or histone modifications, promoters for a set of related genes, sites of motif matches, or DNase hypersensitivity sites. Many such region sets may be found in online repositories and we have pulled together some major sources at http://databio.org/regiondb. At the end of this vignette, we show how to load a database of regions with a few lines of code using the LOLA package. You may also want to check out the AnnotationHub Bioconductor package which gives access to many region sets in an R-friendly format. For use in MIRA, each region set should be a GRanges object and multiple region sets may be passed to MIRA as a GRangesList with each list element being a region set. Here is an example of a region set, which we will use in this vignette:

data("exampleRegionSet", package="MIRA")
head(exampleRegionSet)
## GRanges object with 6 ranges and 0 metadata columns:
##        seqnames          ranges strand
##           <Rle>       <IRanges>  <Rle>
##   2763     chr1 1064067-1064582      *
##   2433     chr1 1307973-1308488      *
##   2948     chr1 1374984-1375499      *
##   3845     chr1 1778779-1779294      *
##    430     chr1 1780451-1780653      *
##   2249     chr1 2578251-2578766      *
##   -------
##   seqinfo: 36 sequences from an unspecified genome; no seqlengths

Analysis workflow

The general workflow is as follows:
1. Data inputs: start with single-nucleotide resolution methylation data and one or more sets of genomic regions, as described above. 2. Expand the regions sizes so that MIRA will be able to get a broad methylation profile surrounding your feature of interest. All regions should be expanded to the same final size. 3. Aggregate methylation data across regions to get a MIRA profile. 4. Calculate MIRA score based on shape of MIRA profile.

The input

First let’s load the necessary libraries and example data. exampleBSDT is a data.table containing genomic location, number of reads methylated (methylCount), and total read number (coverage) from bisulfite sequencing of a lymphoblastoid cell line. exampleRegionSet is a GRanges object with regions that share a biological annotation, in this case Nrf1 binding in a different lymphoblastoid cell line.

library(MIRA)
library(GenomicRanges) # for the `resize`, `width` functions
data("exampleRegionSet", package="MIRA")
data("exampleBSDT", package="MIRA")

While we only have one sample (Gm06990_1) in this example, we would normally want to have our samples in a named list so we will use that format here as well:

BSDTList <- list(exampleBSDT)
names(BSDTList) <- "Gm06990_1"

Since our input methylation data did not have a column for proportion of methylation at each site (methylProp), we need to add this based on the methylation count and total coverage data.

BSDTList <- addMethPropCol(BSDTList)

Expand your regions

A short but important step. MIRA scores are based on the difference between methylation surrounding the regions versus the methylation at the region itself, so we need to expand our regions to include surrounding bases. Let’s use the resize function from the GenomicRanges package to increase the size of our regions and make them all the same size. It is recommended to make all regions the same size so the final methylation profile will be easier to interpret. The average size of our regions is currently about 500:

mean(width(exampleRegionSet))
## [1] 498.788

For this vignette, 4000 bp regions should be wide enough to capture the dip:

exampleRegionSet <- resize(exampleRegionSet, 4000, fix="center")
mean(width(exampleRegionSet))
## [1] 4000

Normally, we want to have each region set (only one in our case) be an item in a list, with corresponding names given to the list:

exampleRegionSetGRL <- GRangesList(exampleRegionSet)
names(exampleRegionSetGRL) <- "lymphoblastoid_NRF1"

Aggregating methylation across regions

Next we aggregate methylation across regions to get a summary methylation profile with the aggregateMethyl function. MIRA divides each region into bins, aggregates methylation levels for cytosines within a given bin for each region individually, and then aggregates matching bins over all the regions (all 1st bins together, 2nd bins together, etc.). A couple important parameters: The binNum argument determines how many approximately equally sized bins each region is split into. This could affect the resolution or noisiness of the MIRA profile because using more bins will result in smaller bin sizes (potentially increasing resolution) but also less reads per bin (potentially increasing noise). The minBaseCovPerBin argument in aggregateMethyl is used to screen out region sets that have any bins with less than a minimum coverage. Here we use the default (500). Let’s aggregate the methylation and then view the MIRA profile.

bigBin <- lapply(X=BSDTList, FUN=aggregateMethyl, GRList=exampleRegionSetGRL, 
                binNum=11)
bigBinDT <- bigBin[[1]]

We add the sample name to the data.table then plot the MIRA profile:

sampleName = rep(names(bigBin), nrow(bigBinDT))
bigBinDT = cbind(bigBinDT, sampleName)
plotMIRAProfiles(binnedRegDT=bigBinDT)

Calculating the MIRA score

To calculate MIRA scores based on the MIRA profiles, we will apply the scoring function, calcMIRAScore, to the data.table that contains the methylation aggregated in bins. calcMIRAScore calculates a score for each group of bins corresponding to a sample and region set, based on the degree of the dip in the profile. With MIRA’s default scoring method (logratio), calcMIRAScore will take the log of the ratio of the outside edges of the dip (identified by calcMIRAScore) to the center of the dip. Higher MIRA scores are associated with deeper dips. A flat MIRA profile would have a score of 0.

sampleScores <- calcMIRAScore(bigBinDT,
                        regionSetIDColName="featureID",
                        sampleIDColName="sampleName")
head(sampleScores)
##              featureID sampleName    score
##                 <char>     <char>    <num>
## 1: lymphoblastoid_NRF1  Gm06990_1 1.393593

A note on annotation. For real uses of MIRA, samples and region sets should be annotated. In order to save memory, it is not recommended that sample name or sample type be included as columns in the BSDT before the aggregation step. An annotation file that matches sample name with sample type (sampleType column) can be used to add sample type information to the data.table after aggregating methylation. A sampleType column is not required for the main functions but is used for the plotting functions. See “Applying MIRA to a Biological Question” vignette for a workflow including annotation.

Interpreting the results

Regulatory information can be inferred from the MIRA scores and profiles but interpretation depends on the type of region set and samples used. The basic assumption is that deeper dips and higher MIRA scores would be associated with generally higher activity at the regions that were tested, which might also inform you about the activity of the regulatory feature that defines the region set (eg the activity of a transcription factor if using ChIP-seq regions). However, MIRA does not directly tell you the activity of the regulatory feature and, in some cases, higher MIRA scores will not be associated with higher activity of the regulatory feature. For example, samples with higher MIRA scores for a transcription factor binding region set could have more activity of that transcription factor but there could be other causes like increased activity of a different transcription factor that binds to many of the same regions. Additionally, it is possible that the samples with higher scores generally had more similar chromatin states to the samples from which the region set was derived than samples with lower scores did (eg when using MIRA on samples from multiple tissue/cell types and the region set was derived from the same tissue/cell type as some but not all of the samples). The general interpretation rule is that a higher MIRA score means more activity but we encourage the user to think carefully about what biological question they are actually addressing based on the region sets and samples used. For more on interpreting the results, see the “Applying MIRA to a Biological Question” vignette.

Bonus: Loading region sets with LOLA

Here is one simple way to get region sets to use with MIRA. First, download the LOLA Core Database (actual files, not the cached version) from here (this might take a while). Next, let’s load the LOLA package and use it to load some of the region sets into R!

The following code will not be evaluated because of the large files involved and long loading time but shows the general process (loading the almost 3,000 region sets could take 10-30 minutes). Alternatively to avoid the long loading process, you can just download the database and load region sets into R one by one with data.table::fread or another such function.

library(LOLA)
pathToDB <- "path/to/LOLACore/hg38"
regionDB <- loadRegionDB(pathToDB)

The regionDB is a list that includes some annotation and a GRangesList of region sets that can be accessed with regionDB$regionGRL. Check out the LOLA docs here if you want some more information. Now you have plenty of region sets to use with MIRA!