Title: | Methylation-Based Inference of Regulatory Activity |
---|---|
Description: | DNA methylation contains information about the regulatory state of the cell. MIRA aggregates genome-scale DNA methylation data into a DNA methylation profile for a given region set with shared biological annotation. Using this profile, MIRA infers and scores the collective regulatory activity for the region set. MIRA facilitates regulatory analysis in situations where classical regulatory assays would be difficult and allows public sources of region sets to be leveraged for novel insight into the regulatory state of DNA methylation datasets. |
Authors: | Nathan Sheffield <http://www.databio.org> [aut], Christoph Bock [ctb], John Lawson [aut, cre] |
Maintainer: | John Lawson <[email protected]> |
License: | GPL-3 |
Version: | 1.29.0 |
Built: | 2024-10-30 08:21:36 UTC |
Source: | https://github.com/bioc/MIRA |
Adding methylProp column that has proportion of reads that were methylated for each site based on number of methylated reads divided by total number of reads. Note: Assigns methylProp column by reference with ":="
addMethPropCol(BSDTList)
addMethPropCol(BSDTList)
BSDTList |
A bisulfite datatable or list of datatables with a column for number of methylated reads (methylCount) and a column for number of total reads (coverage) for each cytosine that was measured. |
The BSDTList but with extra 'methylProp' column on each data.table in list.
data("exampleBSDT", package = "MIRA") exampleBSDT[, methylProp := NULL] # removing methylProp column addMethPropCol(list(exampleBSDT))
data("exampleBSDT", package = "MIRA") exampleBSDT[, methylProp := NULL] # removing methylProp column addMethPropCol(list(exampleBSDT))
The main function for aggregating methylation data in MIRA analysis. Aggregates methylation across all regions in a given region set to give a summary methylation profile for each region set.
aggregateMethyl(BSDT, GRList, binNum = 11, minBaseCovPerBin = 500)
aggregateMethyl(BSDT, GRList, binNum = 11, minBaseCovPerBin = 500)
BSDT |
A single data.table that has DNA methylation data on individual sites. Alternatively a BSseq object is allowed which will be converted internally to data.tables. The data.table input should have columns: "chr" for chromosome, "start" for cytosine coordinate, "methylProp" for proportion of methylation (0 to 1), optionally "methylCount" for number of methylated reads, and optionally "coverage" for total number of reads. |
GRList |
A GRangesList object containing region sets, each set corresponding to a type of regulatory element. Each region set in the list should be named. A named list of data.tables also works. |
binNum |
How many bins each region should be split into for aggregation of the DNA methylation data. |
minBaseCovPerBin |
Screen out region sets that have any bins in the final methylation profile with 'sumCoverage' below the 'minBaseCovPerBin' threshold. 'sumCoverage' is an output column: during aggregation, the 'coverage' values for each base in a bin are added, then these sums are added for corresponding bins from all regions, producing a 'sumCoverage' value for each bin. 'minBaseCovPerBin' is only used if there is a "coverage" column in the input methylation data.table. 'sumCoverage' is greater than or equal to the number of separate reads that contributed to a given bin. |
Each region is split into bins. For a given set of regions, methylation is first aggregated (averaged) within each bin in each region. Then methylation from corresponding bins from each region are aggregated (averaged) across all regions (all first bins together, all second bins together, etc.), giving an aggregate methylation profile. This process is done for each region set.
a data.table with binNum rows for each region set containing aggregated methylation data. If the input was a BSseq object with multiple samples, a list of data.tables will be returned with one data.table for each sample. Each region was split into bins; methylation was put in these bins; Output contains sum of the all corresponding bins for the regions of each region set, ie for all regions in each region set: first bins summed, second bins summed, etc. Columns of the output should be "bin", "methylProp", "sumCoverage" (only if coverage was an input column, described below), "featureID" (ID for the region set). For information on symmetry of bins and output when a region set has strand info, see ?BSBinAggregate. 'sumCoverage' is calculated as follows: during aggregation, the 'coverage' values for each base in a bin are added, then these sums are added for corresponding bins from all regions, producing a 'sumCoverage' value for each bin.
data("exampleBSDT", package = "MIRA") data("exampleRegionSet", package = "MIRA") exBinDT <- aggregateMethyl(exampleBSDT, exampleRegionSet)
data("exampleBSDT", package = "MIRA") data("exampleRegionSet", package = "MIRA") exBinDT <- aggregateMethyl(exampleBSDT, exampleRegionSet)
A data.table with binned methylation data for the first part of the "Applying MIRA to a Biological Question" vignette. Has data for 3 Ewing samples and 3 muscle-related samples (human skeletal myoblasts and other related samples). It is the product of 'aggregateMethyl'.
data(bigBinDT1)
data(bigBinDT1)
A data.table, data.frame object
A data.table with binned methylation data for the second part of the "Applying MIRA to a Biological Question" vignette. Has data for 12 Ewing samples and 12 muscle-related samples (human skeletal myoblasts and other related samples). It is the product of 'aggregateMethyl'.
data(bigBinDT2)
data(bigBinDT2)
A data.table, data.frame object
Given a start, end, and number of bins, to divide, this function will split the regions into bins. Bins will be only approximately the same size, due to rounding. (they should not be more than 1 different).
binRegion(start, end, bins, idDF = NULL, strand = "*")
binRegion(start, end, bins, idDF = NULL, strand = "*")
start |
Coordinate for beginning of range/range. |
end |
Coordinate for end of range/region. |
bins |
How many bins to divide this range/region. |
idDF |
A string/vector of strings that has chromosome (e.g. "chr1") for given start and end values |
strand |
"strand" column of the data.table (or single strand value if binRegion is only used on one region). Default is "*". |
Use case: take a set of regions, like CG islands, and bin them; now you can aggregate signal scores across the bins, giving you an aggregate signal in bins across many regions of the same type.
In theory, this just runs on 3 values, but you can run it inside a data.table j expression to divide a bunch of regions in the same way.
A data.table, expanded to nrow = number of bins, with these id columns: id: region ID binID: repeating ID (this is the value to aggregate across) ubinID: unique bin IDs
library(data.table) start <- c(100, 1000, 3000) end <- c(500, 1400, 3400) chr <- c("chr1", "chr1", "chr2") strand <- c("*", "*", "*") # strand not included in object # since MIRA assumes "*" already unless given something else regionsToBinDT <- data.table(chr, start, end) numberOfBins <- 15 # data.table "j command" using column names and numberOfBins variable binnedRegionDT <- regionsToBinDT[, binRegion(start, end, numberOfBins, chr)]
library(data.table) start <- c(100, 1000, 3000) end <- c(500, 1400, 3400) chr <- c("chr1", "chr1", "chr2") strand <- c("*", "*", "*") # strand not included in object # since MIRA assumes "*" already unless given something else regionsToBinDT <- data.table(chr, start, end) numberOfBins <- 15 # data.table "j command" using column names and numberOfBins variable binnedRegionDT <- regionsToBinDT[, binRegion(start, end, numberOfBins, chr)]
First bins regions and averages the proportion of methylation for all methylation sites within each bin (ie the methylation of all sites within region 1, bin 1 are averaged, then all sites within region 1, bin 2 are averaged, etc.) Then aggregates methylation across all regions by bin by averaging the proportion of methylation in each corresponding bin (ie all bin1's together, all bin2's together, etc.).
BSBinAggregate(BSDT, rangeDT, binNum, minBaseCovPerBin = 500, byRegionGroup = TRUE, splitFactor = NULL, hasCoverage = TRUE)
BSBinAggregate(BSDT, rangeDT, binNum, minBaseCovPerBin = 500, byRegionGroup = TRUE, splitFactor = NULL, hasCoverage = TRUE)
BSDT |
A single data table that has DNA methylation data on individual sites including a "chr" column with chromosome, a "start" column with the coordinate number for the cytosine, a "methylProp" column with proportion of methylation (0 to 1), optionally a "methylCount" column with number of methylated reads for each site, and optionally a "coverage" column with total number of reads for each site (hasCoverage param). |
rangeDT |
A data table with the sets of regions to be binned, with columns named "start", "end". Strand may also be given and will affect the output. See "Value" section. |
binNum |
Number of bins across the region. |
minBaseCovPerBin |
Filter out bins where the sum of coverage values is less than X before returning. |
byRegionGroup |
Default TRUE will aggregate methylation over corresponding bins for each region (all bin1's aggregated, all bin2's, etc). byRegionGroup = FALSE is deprecated. |
splitFactor |
With default NULL, aggregation will be done separately/individually for each sample. |
hasCoverage |
Default TRUE. Whether there is a coverage column |
With splitFactor = NULL, it will return a data.table with binNum rows, containing aggregated methylation data over regions in region set "rangeDT". Each region was split into bins; methylation was put in these bins; Output contains sum of the all corresponding bins for the regions of each region set ie for all regions in each region set: first bins summed, second bins summed, etc. Columns of the output should be "bin", "methylProp", and, if coverage was included as input col, "coverage"
Info about how strand of rangeDT affects output: The MIRA profile will be symmetrical if no strand information is given for the regions (produced by averaging the profile with the reverse of the profile), because the orientation of the regions is arbitrary with respect to biological features (like a promoter for instance) that could be oriented directionally (e.g. 5' to 3'). If strand information is given, regions on the minus strand will be flipped before being aggregated with plus strand regions so the MIRA profile will be in 5' to 3' orientation.
data("exampleBSDT") # exampleBSDT data("exampleRegionSet") # exampleRegionSet exampleBSDT <- addMethPropCol(exampleBSDT) aggregateBins <- BSBinAggregate(BSDT = exampleBSDT, rangeDT = exampleRegionSet, binNum = 11, splitFactor = NULL)
data("exampleBSDT") # exampleBSDT data("exampleRegionSet") # exampleRegionSet exampleBSDT <- addMethPropCol(exampleBSDT) aggregateBins <- BSBinAggregate(BSDT = exampleBSDT, rangeDT = exampleRegionSet, binNum = 11, splitFactor = NULL)
Parses the x/y format of methylation calls, splitting them into individual columns: "methylCount" column for number of methylated reads for site and "coverage" column for total number of reads covering that site. Input files should have the following columns: "chr", "start", "end", "meth", "rate", "strand".
BSreadBiSeq(files, contrastList = NULL, cores = 4, returnAsList = FALSE)
BSreadBiSeq(files, contrastList = NULL, cores = 4, returnAsList = FALSE)
files |
a vector of file paths |
contrastList |
Generally not needed for MIRA. A list of named character vectors, each with length equal to the number of items in files. These will translate into column names in the final table. |
cores |
number of processors. |
returnAsList |
Whether to return the output as a list or as one big data.table. |
This can run into memory problems if there are too many files... because of the way parallel lacks long vector support. The solution is to just use a single core; or to pass mc.preschedule = FALSE; This makes it so that each file is processed as a separate job. Much better.
Data from each input file joined together into one big data.table. If returnAsList = TRUE, then input from each file will be in its own data.table in a list.
shortBSDTFile <- system.file("extdata", "shortRRBS.bed", package = "MIRA") shortBSDT <- BSreadBiSeq(shortBSDTFile)
shortBSDTFile <- system.file("extdata", "shortRRBS.bed", package = "MIRA") shortBSDT <- BSreadBiSeq(shortBSDTFile)
Converts a BSseq object to a list of data.tables with one data.table per sample. A wrapper of the SummarizedExperimentToDataTable function.
bsseqToDataTable(BSseqObj)
bsseqToDataTable(BSseqObj)
BSseqObj |
An object of class BSseq, can have smoothed or raw methylation data. |
MIRAFormatBSDTList A list of data.tables containing the methylation data. One data.table per sample with the column names: 'chr', 'start' (methylation coordinate), 'methylCount' (number of methylated reads), 'coverage' (total number of reads), and 'methylProp' (proportion of methylated reads). The order of the list is the order of samples in the columns of the BSseq object. If sample names are in the BSseq object, then a named list will be returned.
data("exampleBSseqObj") MIRAFormatBSDTList <- bsseqToDataTable(exampleBSseqObj)
data("exampleBSseqObj") MIRAFormatBSDTList <- bsseqToDataTable(exampleBSseqObj)
This will take a data.table that has the methylation level in each bin in the MIRA profile and return a single score. For the "logRatio" method, this score summarizes how large the 'dip' in methylation is at the center of that methylation profile. A column for sample ID/name and a column for region set ID/name should be included in the data.table because a separate score will be given for each sample/region set combination.
See 'method' parameter for details on scoring calculations.
calcMIRAScore(binnedDT, shoulderShift = "auto", method = "logRatio", usedStrand = FALSE, regionSetIDColName = "featureID", sampleIDColName = "sampleName")
calcMIRAScore(binnedDT, shoulderShift = "auto", method = "logRatio", usedStrand = FALSE, regionSetIDColName = "featureID", sampleIDColName = "sampleName")
binnedDT |
A data.table with columns for: bin ("bin"), methylation level ("methylProp"), region set ID/name (default expected column name is "featureID" but this is configurable via a parameter), sample name (default expected column name is "sampleName" but this is configurable via a parameter). The bin column is not used for calculations since it is assumed by the function that the rows will be in the order of the bins (so the function will work without a bin column although the bin column assists in human readability of the input data.table) |
shoulderShift |
Used to determine the number of bins away from the center to use as the shoulders. Default value "auto" optimizes the shoulderShift variable for each sample/region set combination to try find the outside edges of the dip. shoulderShift may be manually set as an integer that will be used for all sample/region set combinations. "auto" does not currently work with region sets that include strand info. Brief description/example of the algorithm for "auto": for concave up MIRA profiles with an odd number of bins, the first potential shoulder will be the 2nd bin from the center (if the center was 0). This is the first bin that could not be included in center methylation value for scoring. From there, the shoulder will be changed to the next bin towards the outside of the profile if the next bin over has a higher methylation value or if the average methylation value of the next two bins are higher than the methylation value of the current shoulder bin. The potential shoulder bin keeps moving outward toward the edges of the MIRA profile until neither of these conditions are met and whatever bin it stops on is used for the shoulder in the MIRA score calculation. For symmetrical MIRA profiles (the general use case), the shoulders picked on both sides of the center will be the same number of bins away from the center bin (symmetrical shoulders). |
method |
The scoring method. "logRatio" is the log of the ratio of outer edges to the middle. This ratio is the average of outside values of the dip (shoulders) divided by either the center value if it is lower than the two surrounding values (lower for concave up profiles or higher for concave down profiles) or if it is not lower (higher for concave down profiles), an average of the three middle values. For an even binNum, the middle four values would be averaged with the 1st and 4th being weighted by half (as if there were 3 values). A higher score with "logRatio" corresponds to a deeper dip. "logRatio" is the only scoring method currently but more methods may be added in the future. |
usedStrand |
If strand information is included as part of an input region set when aggregating methylation, the MIRA profile will probably not be symmetrical. In this case, the automatic shoulderShift sensing (done when shoulderShift="auto") needs to be done for both sides of the dip instead of just one side so set usedStrand=TRUE if strand was included for a region set. usedStrand=TRUE only has an effect on the function when shoulderShift="auto". |
regionSetIDColName |
A character object. The name of the column that has region set names/identifiers. |
sampleIDColName |
A character object. The name of the column that has sample names/identifiers. |
A data.table with a column for region set ID (default name is featureID), sample ID (default name is sampleName), and MIRA score (with name "score"). There will be one row and MIRA score for each sample/region set combination. The MIRA score quantifies the "dip" of the MIRA profile which is an aggregation of methylation over all regions in a region set.
data("exampleBins") calcMIRAScore(exampleBins)
data("exampleBins") calcMIRAScore(exampleBins)
Has columns for name and sample type.
data(exampleAnnoDT1)
data(exampleAnnoDT1)
A data.table, data.frame object
A data.table with annotation data (sampleName and sampleType columns) for 24 Ewing samples and 24 muscle-related samples (human skeletal myoblasts and other related samples).
data(exampleAnnoDT2)
data(exampleAnnoDT2)
A data.table, data.frame object
Artificial data for two samples/conditions.
data(exampleBins)
data(exampleBins)
A data.table, data.frame object
This object contains reduced representation bisulfite sequencing data from the GM06990 B-lymphoblastoid cell line that was generated by the ENCODE project (GEO accession GSE27584, sample GSM980583). Has columns for position (chr, start), number methylated reads, and number of reads total. The sample name is Gm06990_1. Only a subset of the full data is used in this object and the data was switched from hg19 to hg38.
data(exampleBSDT)
data(exampleBSDT)
A data.table, data.frame object
A BSseq object which has the top 20 rows of the BSseq object "BS.chr22" which was included as built-in data in the bsseq package
data(exampleBSseqObj)
data(exampleBSseqObj)
A BSseq object
A GRanges object with a ranges corresponding to Nrf1 Chip-seq peaks in the GM12878 B-lymphoblastoid cell line (GEO accession GSE31477, sample GSM935309). Only a subset of the ranges are used in this object and the regions were switched from hg19 to hg38.
data(exampleRegionSet)
data(exampleRegionSet)
A GRanges object
MIRA is a score that infers regulatory activity of genomic elements based on DNA methylation data. It assesses the degree of dip in methylation level surrounding a regulatory site of interest, such as transcription factor binding sites. This package provides functions for aggregating methylation data across region sets, in bins.
Nathan Sheffield
John Lawson
Plot one or multiple methylation profiles. Displays each region set in a different subplot. If you only want to plot certain region sets, subset with the 'featID' parameter.
plotMIRAProfiles(binnedRegDT, featID = unique(binnedRegDT[, featureID]), plotType = "line", colBlindOption = FALSE, sampleTypeColName = "sampleType")
plotMIRAProfiles(binnedRegDT, featID = unique(binnedRegDT[, featureID]), plotType = "line", colBlindOption = FALSE, sampleTypeColName = "sampleType")
binnedRegDT |
A datatable with specific column names containing: bin numbers(binnedRegionDT column), aggregated methylation values (methylProp column), name of the region set (featureID column), case/control column (sampleType column), sample name (sampleName column). |
featID |
Region set names in a single string or vector of strings. |
plotType |
Line or jitter (ggplot2). |
colBlindOption |
If TRUE, function will plot with a color blind friendly palette which could be helpful when plotting multiple colors. |
sampleTypeColName |
character object. The name of the column that contains sample type or condition information (eg case vs control). Line color will be assigned based on this if it is present and there are more than two unique sample types. |
A plot of class "gg"/ "ggplot" that shows MIRA profiles
data("exampleBins", package = "MIRA") MIRAplot <- plotMIRAProfiles(binnedRegDT = exampleBins)
data("exampleBins", package = "MIRA") MIRAplot <- plotMIRAProfiles(binnedRegDT = exampleBins)
Splits up samples by sample type. Displays each region set in a different subplot. If you only want to plot certain region sets, subset with the 'featID' parameter.
plotMIRAScores(scoreDT, featID = unique(scoreDT[, featureID]), colBlindOption = FALSE)
plotMIRAScores(scoreDT, featID = unique(scoreDT[, featureID]), colBlindOption = FALSE)
scoreDT |
A datatable with the following columns: score, featureID (names of region sets), ideally include 'sampleType'. |
featID |
Region set name/names in a single string or vector of strings. |
colBlindOption |
If TRUE, function will plot with a color blind friendly palette which could be helpful when plotting multiple colors. |
Due to the limited number of colors in the palette, a warning will be issued if there are too many (more than 9) region sets ('featureID's).
a plot of class "gg"/"ggplot" that shows MIRA scores with geom_boxplot and geom_jitter (or geom_violin instead of boxplot if no sampleType column is given).
data(bigBinDT2) exScores <- calcMIRAScore(bigBinDT2) # adding annotation sampleType <- rep(c("Ewing", "Muscle-related"), each = 24) exScores <- cbind(exScores, sampleType) exScorePlot <- plotMIRAScores(exScores)
data(bigBinDT2) exScores <- calcMIRAScore(bigBinDT2) # adding annotation sampleType <- rep(c("Ewing", "Muscle-related"), each = 24) exScores <- cbind(exScores, sampleType) exScorePlot <- plotMIRAScores(exScores)
Combines data.tables from a list as data.table::rbindlist except an extra column is added which can mark which rows came from which original data.table. The name of this new column can be specified (newColName param) and the values are the names of the input list. This function was originally intended for a list where each data.table corresponds to a separate sample and the list names correspond to the names of the samples. Now all samples can be combined into one data.table but which sample the information came from can be kept.
rbindNamedList(namedList, newColName = "sampleName")
rbindNamedList(namedList, newColName = "sampleName")
namedList |
A list of data.tables. One sample per data.table/list item. Names of the list are the sample names. |
newColName |
The name of the column that will be added to the data.tables |
mergedDT A single data.table that combines those in the input list (data.table::rbindlist) but also added a column (default, newColName="sampleName") to each one to keep track of which rows were from which sample
# NOTE: generally sample names should be added after running aggregateMethyl() # rather than before in order to save memory (since data.table has less rows # after aggregation) data(exampleBSseqObj) multiSampleList <- bsseqToDataTable(exampleBSseqObj) combinedDT <- rbindNamedList(multiSampleList)
# NOTE: generally sample names should be added after running aggregateMethyl() # rather than before in order to save memory (since data.table has less rows # after aggregation) data(exampleBSseqObj) multiSampleList <- bsseqToDataTable(exampleBSseqObj) combinedDT <- rbindNamedList(multiSampleList)
Packages that use SummarizedExperiment-based classes for DNA methylation data which could be converted with this function include "bsseq", "methylPipe", and "BiSeq" packages. Of the methylCountDF, coverageDF, and methylPropDF arguments, either methylPropDF must be given or both methylCountDF and coverageDF must be given. After that, whichever is not given will be calculated from the others. If coverageDF and methylCountDF are both not given, coverage will be assumed to be 1 at each site with a methylProp value. Acceptable formats for the three "DF" parameters include: data.frame, data.table, matrix, and DelayedMatrix classes. If converting a BSseq object, see ?bsseqToDataTable for a convenient wrapper of this function.
SummarizedExperimentToDataTable(coordinates, methylCountDF = NULL, coverageDF = NULL, methylPropDF = NULL, sample_names = NULL)
SummarizedExperimentToDataTable(coordinates, methylCountDF = NULL, coverageDF = NULL, methylPropDF = NULL, sample_names = NULL)
coordinates |
Coordinates for the methylation loci as a GRanges (or GPos) object (in same order as methylCountDF/coverageDF/methylPropDF, whichever is given). The start coordinate should be the coordinate of the cytosine. |
methylCountDF |
An object of matrix/data.frame/similar format that contains the number of reads with methylated C's for the loci in 'coordinates' argument. It should have one column per sample with rows that correspond to 'coordinates'. |
coverageDF |
An object of matrix/data.frame/similar format that contains the total number of reads for the loci in 'coordinates' argument. It should have one column per sample with rows that correspond to 'coordinates'. |
methylPropDF |
An object of matrix/data.frame/similar format that contains the total number of reads for the loci in 'coordinates' argument. It should have one column per sample with rows that correspond to 'coordinates'. |
sample_names |
A character vector with sample names in the same order as the columns of methylCountDF/coverageDF/methylPropDF samples in the columns |
MIRAFormatBSDTList A list of data.tables containing the methylation data. One data.table per sample with the column names: 'chr', 'start' (methylation coordinate), 'methylCount' (number of methylated reads), 'coverage' (total number of reads), and 'methylProp' (proportion of methylated reads). The order of the list is the order of samples in the columns of methylCountDF/coverageDF/methylPropDF. If sample names are explicitly given as input ('sample_names' argument) or can be derived from other arguments to the function then a named list will be returned.
data("exampleBSseqObj") MIRAFormatBSDTList <- SummarizedExperimentToDataTable(coordinates = bsseq::granges(exampleBSseqObj), methylCountDF = bsseq::getCoverage(BSseq = exampleBSseqObj, type = "M"), coverageDF = bsseq::getCoverage(BSseq = exampleBSseqObj, type = "Cov"), methylPropDF = bsseq::getMeth(BSseq = exampleBSseqObj, type = "raw"), sample_names = bsseq::sampleNames(exampleBSseqObj))
data("exampleBSseqObj") MIRAFormatBSDTList <- SummarizedExperimentToDataTable(coordinates = bsseq::granges(exampleBSseqObj), methylCountDF = bsseq::getCoverage(BSseq = exampleBSseqObj, type = "M"), coverageDF = bsseq::getCoverage(BSseq = exampleBSseqObj, type = "Cov"), methylPropDF = bsseq::getMeth(BSseq = exampleBSseqObj, type = "raw"), sample_names = bsseq::sampleNames(exampleBSseqObj))