Title: | Epigenomic R-based analysis with hidden Markov models |
---|---|
Description: | epigraHMM provides a set of tools for the analysis of epigenomic data based on hidden Markov Models. It contains two separate peak callers, one for consensus peaks from biological or technical replicates, and one for differential peaks from multi-replicate multi-condition experiments. In differential peak calling, epigraHMM provides window-specific posterior probabilities associated with every possible combinatorial pattern of read enrichment across conditions. |
Authors: | Pedro Baldoni [aut, cre] |
Maintainer: | Pedro Baldoni <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.15.0 |
Built: | 2024-10-30 07:48:57 UTC |
Source: | https://github.com/bioc/epigraHMM |
This function adds model offsets to epigraHMMDataSet
addOffsets(object, offsets)
addOffsets(object, offsets)
object |
an epigraHMMDataSet |
offsets |
a matrix with model offsets |
To be added
An epigraHMMDataSet with an 'offsets' assay filled in.
https://github.com/plbaldoni/epigraHMM
# Creating dummy object countData <- list('counts' = matrix(rpois(4e5,10),ncol = 4), 'controls' = matrix(rpois(4e5,5),ncol = 4)) colData <- data.frame(condition = c('A','A','B','B'), replicate = c(1,2,1,2)) object <- epigraHMMDataSetFromMatrix(countData,colData) # Adding pre-computed offsets object <- addOffsets(object = object, offsets = matrix(rnorm(4e5),ncol = 4))
# Creating dummy object countData <- list('counts' = matrix(rpois(4e5,10),ncol = 4), 'controls' = matrix(rpois(4e5,5),ncol = 4)) colData <- data.frame(condition = c('A','A','B','B'), replicate = c(1,2,1,2)) object <- epigraHMMDataSetFromMatrix(countData,colData) # Adding pre-computed offsets object <- addOffsets(object = object, offsets = matrix(rnorm(4e5),ncol = 4))
Given results from epigraHMM's differential peak caller, this function will output either posterior probabilities or combinatorial patterns associated with the mixture components of the embedded mixture model.
callPatterns( object, peaks, hdf5 = metadata(object)$output, type = "all", fdr = NULL, pattern = NULL, ranges = NULL )
callPatterns( object, peaks, hdf5 = metadata(object)$output, type = "all", fdr = NULL, pattern = NULL, ranges = NULL )
object |
an epigraHMMDataSet |
peaks |
a GRanges object with differential peaks from 'callPeaks' |
hdf5 |
a character with the location of the epigraHMM HDF5 output file |
type |
a character string that defines which output will be givem (see details; default is 'all') |
fdr |
the desired fdr thresholding level to define combinatorial patterns |
pattern |
a string that explicitly specifies the combinatorial pattern to be output |
ranges |
a GRanges object with the genomic ranges to subset the output |
The output of 'callPatterns' is always restricted to genomic windows intersecting peaks.
If ‘type = ’all'‘, all windows’ posterior probabilities associated with all differential combinatorial patterns are returned. If ‘type = ’fdr'', users must also specify the input argument 'pattern' and this function will output windows wich are associated with the given 'pattern' that pass a particular fdr threshold level. If ‘type = ’max'', this function will output the combinatorial pattern which has the maximal posterior probability for each window. If ‘type = ’ranges'', the windows that are output are restricted to those that intersect the 'ranges' input argument.
A GRanges object with metadata
Pedro L. Baldoni, [email protected]
https://github.com/plbaldoni/epigraHMM
# Creating dummy object countData <- cbind(rbind(matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 10, size = 5), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 10, size = 5), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1)), rbind(matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 10, size = 5), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 10, size = 5), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1))) colData <- data.frame(condition = c('A','B'), replicate = c(1,1)) rowRanges <- GenomicRanges::GRanges('chrA', IRanges::IRanges(start = seq(1,by = 500, length.out = nrow(countData)),width = 500)) object <- epigraHMMDataSetFromMatrix(countData,colData,rowRanges = rowRanges) # Initializing object <- initializer(object,controlEM()) # Running epigraHMM object <- epigraHMM(object,controlEM(),type = 'differential',dist = 'nb') # Calling peaks peaks <- callPeaks(object = object, hdf5 = S4Vectors::metadata(object)$output, method = 'viterbi') # Extracting posterior probabilities patterns <- callPatterns(object = object,peaks = peaks,type = 'max')
# Creating dummy object countData <- cbind(rbind(matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 10, size = 5), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 10, size = 5), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1)), rbind(matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 10, size = 5), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 10, size = 5), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1))) colData <- data.frame(condition = c('A','B'), replicate = c(1,1)) rowRanges <- GenomicRanges::GRanges('chrA', IRanges::IRanges(start = seq(1,by = 500, length.out = nrow(countData)),width = 500)) object <- epigraHMMDataSetFromMatrix(countData,colData,rowRanges = rowRanges) # Initializing object <- initializer(object,controlEM()) # Running epigraHMM object <- epigraHMM(object,controlEM(),type = 'differential',dist = 'nb') # Calling peaks peaks <- callPeaks(object = object, hdf5 = S4Vectors::metadata(object)$output, method = 'viterbi') # Extracting posterior probabilities patterns <- callPatterns(object = object,peaks = peaks,type = 'max')
This function imports the output from 'epigraHMM' and outputs a set of peaks (consensus or differential) for a given FDR control threshold or Viterbi sequence.
callPeaks( object, hdf5 = metadata(object)$output, method = "viterbi", saveToFile = FALSE, control = NULL )
callPeaks( object, hdf5 = metadata(object)$output, method = "viterbi", saveToFile = FALSE, control = NULL )
object |
an epigraHMMDataSet |
hdf5 |
a character with the location of the epigraHMM HDF5 output file |
method |
either 'viterbi' or a numeric FDR control threshold (e.g. 0.05). Default is 'viterbi'. |
saveToFile |
a logical indicating whether or not to save the results to file. Output files are always saved with peaks of interest defined on the region level. Default is FALSE. |
control |
list of control arguments from controlEM(). This is an optional parameter and it is only required when 'saveToFile = TRUE' so that the output directory can be obtained. Default is NULL. |
A GRanges object with differential peak calls in BED 6+3 format
Pedro L. Baldoni, [email protected]
https://github.com/plbaldoni/epigraHMM
# Creating dummy object countData <- rbind(matrix(rnbinom(1e3,mu = 2,size = 10),ncol = 1), matrix(rnbinom(2e3,mu = 7.5,size = 5),ncol = 1), matrix(rnbinom(1e3,mu = 2,size = 10),ncol = 1)) colData <- data.frame(condition = 'A', replicate = 1) rowRanges <- GenomicRanges::GRanges('chrA', IRanges::IRanges(start = seq(from = 1, length.out = 4e3,by = 250),width = 250)) object <- epigraHMMDataSetFromMatrix(countData,colData,rowRanges) # Initializing object <- initializer(object,controlEM()) # Running epigraHMM object <- epigraHMM(object,controlEM(),type = 'consensus',dist = 'nb') # Calling peaks peaks <- callPeaks(object = object, hdf5 = S4Vectors::metadata(object)$output, method = 'viterbi')
# Creating dummy object countData <- rbind(matrix(rnbinom(1e3,mu = 2,size = 10),ncol = 1), matrix(rnbinom(2e3,mu = 7.5,size = 5),ncol = 1), matrix(rnbinom(1e3,mu = 2,size = 10),ncol = 1)) colData <- data.frame(condition = 'A', replicate = 1) rowRanges <- GenomicRanges::GRanges('chrA', IRanges::IRanges(start = seq(from = 1, length.out = 4e3,by = 250),width = 250)) object <- epigraHMMDataSetFromMatrix(countData,colData,rowRanges) # Initializing object <- initializer(object,controlEM()) # Running epigraHMM object <- epigraHMM(object,controlEM(),type = 'consensus',dist = 'nb') # Calling peaks peaks <- callPeaks(object = object, hdf5 = S4Vectors::metadata(object)$output, method = 'viterbi')
This function removes the effect from covariates of interest (such as GC content) from experimental counts
cleanCounts(object, effectNames, byNames = NULL, log = TRUE)
cleanCounts(object, effectNames, byNames = NULL, log = TRUE)
object |
an epigraHMMDataSet |
effectNames |
a character vector with the names of assays for which the effect will be removed from the experimental counts. Names in 'effectNames' must be assays stored in the epigraHMMDataSet 'object'. |
byNames |
a character vector with the name of an assay containing stratification variables which will be used to define stratum-specific effects. Examples of byNames assays include the 'peaks' assay from 'initializer()'. In this case, models will be fit separately for peaks and non-peaks regions. This can be useful for effects such as GC content, which are known to have a differential effect between peaks and non-peak regions. Default is NULL, i.e., effects will be removed without stratification. |
log |
a logical indicating if the effect from 'effectNames' should be log-transformed in the regression model (default is TRUE) |
An epigraHMMDataSet with an 'offset' assay filled in.
Pedro L. Baldoni, [email protected]
https://github.com/plbaldoni/epigraHMM
# Creating dummy object gc <- rbeta(3e3,50,50) countData <- list('counts' = rbind(matrix(rnbinom(2e3,mu = 7.5,size = 10),ncol = 1), matrix(rnbinom(3e3,mu = exp(0.5 + 8*gc),size = 5),ncol = 1), matrix(rnbinom(2e3,mu = 7.5,size = 10),ncol = 1)), 'gc' = matrix(c(rbeta(2e3,50,50),gc,rbeta(2e3,50,50)),ncol = 1)) colData <- data.frame(condition = 'A', replicate = 1) object <- epigraHMMDataSetFromMatrix(countData,colData) # Initializing object <- initializer(object = object,controlEM()) # Cleaning counts object <- cleanCounts(object = object,effectNames = 'gc',byNames = 'peaks') # Plotting the cleaned data #par(mfrow = c(2,1)) #smoothScatter(log1p(assay(object))~assay(object,'gc'),xlab = 'gc',ylab = 'log counts') #smoothScatter(as.numeric(log(assay(object)+1) - assay(object,'offsets'))~assay(object,'gc'), # xlab = 'gc',ylab = 'log cleaned counts')
# Creating dummy object gc <- rbeta(3e3,50,50) countData <- list('counts' = rbind(matrix(rnbinom(2e3,mu = 7.5,size = 10),ncol = 1), matrix(rnbinom(3e3,mu = exp(0.5 + 8*gc),size = 5),ncol = 1), matrix(rnbinom(2e3,mu = 7.5,size = 10),ncol = 1)), 'gc' = matrix(c(rbeta(2e3,50,50),gc,rbeta(2e3,50,50)),ncol = 1)) colData <- data.frame(condition = 'A', replicate = 1) object <- epigraHMMDataSetFromMatrix(countData,colData) # Initializing object <- initializer(object = object,controlEM()) # Cleaning counts object <- cleanCounts(object = object,effectNames = 'gc',byNames = 'peaks') # Plotting the cleaned data #par(mfrow = c(2,1)) #smoothScatter(log1p(assay(object))~assay(object,'gc'),xlab = 'gc',ylab = 'log counts') #smoothScatter(as.numeric(log(assay(object)+1) - assay(object,'offsets'))~assay(object,'gc'), # xlab = 'gc',ylab = 'log cleaned counts')
This function passes controlling parameters for the EM algorithm implemented in the epigraHMM package.
controlEM( epsilonEM = c(MRCPE = 0.001, MACPE = 0.001, ARCEL = 0.001), maxIterEM = 500, minIterEM = 3, gapIterEM = 3, maxCountEM = 3, maxDisp = 1000, criterion = "all", minZero = .Machine$double.xmin, probCut = 0.05, quiet = TRUE, maxIterInnerEM = 5, epsilonInnerEM = 0.001, trimOffset = 3, pattern = NULL, tempDir = tempdir(), fileName = "epigraHMM", pruningThreshold = NULL, quietPruning = TRUE )
controlEM( epsilonEM = c(MRCPE = 0.001, MACPE = 0.001, ARCEL = 0.001), maxIterEM = 500, minIterEM = 3, gapIterEM = 3, maxCountEM = 3, maxDisp = 1000, criterion = "all", minZero = .Machine$double.xmin, probCut = 0.05, quiet = TRUE, maxIterInnerEM = 5, epsilonInnerEM = 0.001, trimOffset = 3, pattern = NULL, tempDir = tempdir(), fileName = "epigraHMM", pruningThreshold = NULL, quietPruning = TRUE )
epsilonEM |
a named vector of positive values specifying up to four possible convergence criterion tolerances for the EM algorithm (see 'criterion' below). Default is c('MRCPE' = 1e-3, 'MACPE' = 1e-3,'ARCEL' = 1e-3). |
maxIterEM |
a positive integer giving the maximum number of EM iterations. Default is 500. |
minIterEM |
a positive integer giving the minimum number of EM iterations to start evaluating the convergence. Default is 3. |
gapIterEM |
a positive integer giving the number of EM iterations apart to compute the convergence criterion. Default is 3. |
maxCountEM |
a positive integer giving the number of consecutive EM iterations satisfying the convergence criterion in order to stop the algorithm. Default is 3. |
maxDisp |
a positive value for the upper limit constraint of the dispersion parameters. Default is 1000. |
criterion |
a character specifying the convergence criterion. Either "MRCPE" (maximum absolute relative change in parameter estimates), "MACPE" (maximum absolute change of parameter estimates), "ARCEL" (absolute relative change of the Q-function), or "all" (simultaneously check for MRCPE, MACPE, and ARCEL). Default is "all". |
minZero |
a positive value for the minimum positive value allowed in computations to avoid having zeros. Default is .Machine$double.xmin. |
probCut |
a number between 0 and 1 for the cutoff of the rejection controlled EM algorithm. Default 0.05. |
quiet |
a logical indicating whether to print messages. Default is TRUE. |
maxIterInnerEM |
a positive integer giving the maximum number of inner EM iterations. Default is 5. |
epsilonInnerEM |
a positive value with the convergence tolerance value for the inner EM algorithm. The criterion for the inner EM is "MRCPE". Default is 1e-3. |
trimOffset |
either NULL or a positive integer indicating the number of decimal places to be used in the offset. Default is 3. |
pattern |
either NULL (the default) or a list with length equal to the number of differential patterns to be modeled by the differential HMM state. See Details section below. |
tempDir |
a string where results will be saved. Default is 'tempdir()'. |
fileName |
a string with the name of the result files. Default is 'epigraHMM'. |
pruningThreshold |
a numeric value between 0 and 1 to consider when pruning rare combinatorial patterns. Default is NULL (see Details). |
quietPruning |
a logical indicating whether to print messages during the pruning step. Default is TRUE. |
If pattern
is NULL, every possible combinatorial pattern will be considered. If pattern
is a list, elements of it should specify the differential patterns to be modeled by each mixture component.
For instance, if pattern = list(2,c(1,3)) the mixture model will have two components that will represent the enrichment of condition 2 alone and the enrichment of conditions 1 and 3 together.
If pruningThreshold
is a value between 0 and 1, say 0.05, epigraHMM
will sequentially remove differential combinatorial patterns of enrichment
from any mixture model component with associated posterior mixture proportion
less than 0.05.
A list with components equal to the arguments
Pedro L. Baldoni, [email protected]
https://github.com/plbaldoni/epigraHMM
# No more than 100 EM iterations control <- controlEM(maxIterEM = 100)
# No more than 100 EM iterations control <- controlEM(maxIterEM = 100)
This function runs either consensus (one condition, multiple samples) or differential (multiple conditions and samples) peak callers for epigenomic data.
epigraHMM(object, control, type, dist = "nb")
epigraHMM(object, control, type, dist = "nb")
object |
an epigraHMMDataSet |
control |
list of control arguments from |
type |
character, either |
dist |
character, either |
An epigraHMMDataSet object with the results from epigraHMM
Pedro L. Baldoni, [email protected]
https://github.com/plbaldoni/epigraHMM
# Creating dummy object countData <- rbind(matrix(rnbinom(1e3,mu = 2,size = 10),ncol = 1), matrix(rnbinom(2e3,mu = 7.5,size = 5),ncol = 1), matrix(rnbinom(1e3,mu = 2,size = 10),ncol = 1)) colData <- data.frame(condition = 'A', replicate = 1) object <- epigraHMMDataSetFromMatrix(countData,colData) # Initializing object <- initializer(object,controlEM()) # Running epigraHMM object <- epigraHMM(object,controlEM(),type = 'consensus',dist = 'nb')
# Creating dummy object countData <- rbind(matrix(rnbinom(1e3,mu = 2,size = 10),ncol = 1), matrix(rnbinom(2e3,mu = 7.5,size = 5),ncol = 1), matrix(rnbinom(1e3,mu = 2,size = 10),ncol = 1)) colData <- data.frame(condition = 'A', replicate = 1) object <- epigraHMMDataSetFromMatrix(countData,colData) # Initializing object <- initializer(object,controlEM()) # Running epigraHMM object <- epigraHMM(object,controlEM(),type = 'consensus',dist = 'nb')
This function creates a RangedSummarizedExperiment
object from of a set of BAM files.
It is used to store the input data, the model offsets, and the results from the peak calling algorithms.
epigraHMMDataSetFromBam( bamFiles, colData, genome, windowSize, gapTrack = TRUE, blackList = TRUE )
epigraHMMDataSetFromBam( bamFiles, colData, genome, windowSize, gapTrack = TRUE, blackList = TRUE )
bamFiles |
a string vector (or a list of string vectors) with the path for BAM files. If bamFiles is a list of string vectors, vectors must be named, have the same dimension, and, at least, a vector with name 'counts' must exist (see details). |
colData |
a |
genome |
either a single string with the name of the reference genome (e.g. 'hg19') or a GRanges object with ranges to be tilled into a set of non-overlapping windows. |
windowSize |
an integer specifying the size of genomic windows where read counts will be computed. |
gapTrack |
either a logical ( |
blackList |
either a logical ( |
The index ".bai" files must be stored in the same directory of their respective BAM files. The index files must be named after their respective BAM files with the additional ".bai" suffix.
‘epigraHMMDataSetFromBam' will store experimental data (e.g. ChIP-seq counts) from bamFiles (or bamFiles[[’counts']], if a list is provided). Additional data (e.g. input control counts) will be stored similarly with their respective list names.
By default, the function computes read counts using csaw's estimated fragment length via cross correlation analysis. For experimental counts (e.g. ChIP-seq), sequencing reads are shifted downstream half of the estimated fragment length. For additional counts (e.g. input control), sequencing reads are not shifted prior to counting.
Additional columns included in the colData input will be passed to the resulting epigraHMMDataSet assay and can be acessed via colData()
function.
The genome
argument will call GenomeInfoDb::Seqinfo() to fetch the chromosome lengths of the specified genome.
See ?GenomeInfoDb::Seqinfo for the list of UCSC genomes that are currently supported.
If gapTrack = TRUE
and the name of a reference genome is passed as input through genome
(e.g. 'hg19'),
the function will discard any genomic coordinate overlapping regions specified by the respective UCSC gap table.
If gapTrack
is a GRanges object, the function will discard any genomic coordinate overlaping regions from gapTrack
.
If blackList = TRUE
and the name of a reference genome is passed as input through genome
(e.g. 'hg19'),
The function will fetch the manually curated blacklist tracks (Version 2) from https://github.com/Boyle-Lab/Blacklist/tree/master/lists.
Current available genomes are ce10, dm3, hg19, hg38, and mm10.
If blackList
is a GRanges object, the function will discard any genomic coordinate overlaping regions from blackList
.
An epigraHMMDataSet object with sorted colData regarding conditions and replicates. Experimental counts will be stored in the 'counts' assay in the resulting epigraHMMDataSet object. Additional experimental data will be stored with their respective names from the list bamFiles.
Pedro L. Baldoni, [email protected]
https://github.com/plbaldoni/epigraHMM DOI: 10.1093/nar/gkv1191 DOI: 10.1038/s41598-019-45839-z DOI: 10.1038/nature11247
bamFiles <- system.file("extdata","euratrans", "lv-H3K27me3-SHR-male-bio2-tech1.bam", package="chromstaRData") colData <- data.frame(condition = 'SHR', replicate = 1) object <- epigraHMMDataSetFromBam(bamFiles = bamFiles, colData = colData, genome = 'rn4', windowSize = 25000, gapTrack = TRUE, blackList = TRUE)
bamFiles <- system.file("extdata","euratrans", "lv-H3K27me3-SHR-male-bio2-tech1.bam", package="chromstaRData") colData <- data.frame(condition = 'SHR', replicate = 1) object <- epigraHMMDataSetFromBam(bamFiles = bamFiles, colData = colData, genome = 'rn4', windowSize = 25000, gapTrack = TRUE, blackList = TRUE)
This function creates a RangedSummarizedExperiment
object from matrices of counts.
It is used to store the input data, the model offsets, and the results from the peak calling algorithms.
epigraHMMDataSetFromMatrix(countData, colData, rowRanges = NULL)
epigraHMMDataSetFromMatrix(countData, colData, rowRanges = NULL)
countData |
a matrix (or a list of matrices). If countData is a list of matrices, matrices must be named, have the same dimensions, and, at least, a matrix with name 'counts' must exist (see details). |
colData |
a |
rowRanges |
an optional GRanges object with the genomic coordinates of the |
Additional columns included in the colData input will be passed to the resulting epigraHMMDataSet assay and can be acessed via colData()
function.
An epigraHMMDataSet object with sorted colData regarding conditions and replicates. Experimental counts will be stored in the 'counts' assay in the resulting epigraHMMDataSet object. If ‘countData' is a list of matrices, the resulting ’counts' assay will be equal to ‘countData[[’counts']]'.
Additional matrices can be included in the epigraHMMDataSet. For example, if one wants to include counts from an input control experiment from ‘countData[[’controls']]‘, an assay ’control' will be added to the resulting epigraHMMDataSet..
Pedro L. Baldoni, [email protected]
https://github.com/plbaldoni/epigraHMM
countData <- list('counts' = matrix(rpois(4e5,10),ncol = 4), 'controls' = matrix(rpois(4e5,5),ncol = 4)) colData <- data.frame(condition = c('A','A','B','B'), replicate = c(1,2,1,2)) object <- epigraHMMDataSetFromMatrix(countData,colData)
countData <- list('counts' = matrix(rpois(4e5,10),ncol = 4), 'controls' = matrix(rpois(4e5,5),ncol = 4)) colData <- data.frame(condition = c('A','A','B','B'), replicate = c(1,2,1,2)) object <- epigraHMMDataSetFromMatrix(countData,colData)
This function estimates the transition probabilities for a k-state Markov chain based on a sequence of integers that represent states of the chain
estimateTransitionProb(chain, numStates)
estimateTransitionProb(chain, numStates)
chain |
a vector of integers |
numStates |
an integer, the number of states in the Markov chain |
A k-by-k matrix of transition probabilities, such that k is the number of states of the chain
https://github.com/plbaldoni/epigraHMM
trueMat <- matrix(c(0.9,0.1,0.1,0.9),2,2) simChain <- simulateMarkovChain(trueMat,1e3) estMat <- estimateTransitionProb(simChain,2) # estMat should be close to trueMat estMat
trueMat <- matrix(c(0.9,0.1,0.1,0.9),2,2) simChain <- simulateMarkovChain(trueMat,1e3) estMat <- estimateTransitionProb(simChain,2) # estMat should be close to trueMat estMat
E-step of HMM (forward-backward probability + posterior probability calculation)
expStep(pi, gamma, logf, hdf5)
expStep(pi, gamma, logf, hdf5)
pi |
a vector of probabilities (sum of probabilities should sum to one) |
gamma |
a matrix of transition probabilities (row sums should be one) |
logf |
a matrix of observed log-likelihood values. Columns represent hidden states, rows represent genomic regions |
hdf5 |
path to where the hdf5 is saved |
#Creating dummy object countData <- rbind(matrix(rnbinom(1e3,mu = 2,size = 10),ncol = 1), matrix(rnbinom(2e3,mu = 7.5,size = 5),ncol = 1), matrix(rnbinom(1e3,mu = 2,size = 10),ncol = 1)) colData <- data.frame(condition = 'A', replicate = 1) object <- epigraHMMDataSetFromMatrix(countData,colData) #Initializing object <- initializer(object,controlEM()) #Running epigraHMM object <- epigraHMM(object,controlEM(),type = 'consensus',dist = 'nb') #Example expStep(pi = c(0.99,0.02), gamma = matrix(c(0.99,0.01,0.01,0.99),nrow = 2), logf = cbind(dnbinom(rnbinom(100,mu = 2,size = 10),mu = 2,size = 10,log = TRUE), dnbinom(rnbinom(100,mu = 7.5,size = 5),mu = 7.5,size = 5,log = TRUE)), hdf5 = file.path(tempdir(),'tmp.h5'))
#Creating dummy object countData <- rbind(matrix(rnbinom(1e3,mu = 2,size = 10),ncol = 1), matrix(rnbinom(2e3,mu = 7.5,size = 5),ncol = 1), matrix(rnbinom(1e3,mu = 2,size = 10),ncol = 1)) colData <- data.frame(condition = 'A', replicate = 1) object <- epigraHMMDataSetFromMatrix(countData,colData) #Initializing object <- initializer(object,controlEM()) #Running epigraHMM object <- epigraHMM(object,controlEM(),type = 'consensus',dist = 'nb') #Example expStep(pi = c(0.99,0.02), gamma = matrix(c(0.99,0.01,0.01,0.99),nrow = 2), logf = cbind(dnbinom(rnbinom(100,mu = 2,size = 10),mu = 2,size = 10,log = TRUE), dnbinom(rnbinom(100,mu = 7.5,size = 5),mu = 7.5,size = 5,log = TRUE)), hdf5 = file.path(tempdir(),'tmp.h5'))
Data from EZH2, H3K27me3, and H3K36me3 ChIP-seq data from Helas3 cell line. For illustrative purposes, the data has been subset to chromosome 19. The dataset contains two replicates from each mark.
data(helas3)
data(helas3)
An object of class "epigraHMMDataSet"
.
Davis et al. (2018) NAR 46(D1):D794-D801. (PubMed)
# # # The data 'helas3' was created as follows. # options(timeout=9999999) # # url <- 'http://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeBroadHistone/' # samples <- c('wgEncodeBroadHistoneHelas3H3k36me3StdAlnRep1.bam', # 'wgEncodeBroadHistoneHelas3H3k36me3StdAlnRep2.bam', # 'wgEncodeBroadHistoneHelas3H3k27me3StdAlnRep1.bam', # 'wgEncodeBroadHistoneHelas3H3k27me3StdAlnRep2.bam', # 'wgEncodeBroadHistoneHelas3Ezh239875AlnRep1.bam', # 'wgEncodeBroadHistoneHelas3Ezh239875AlnRep2.bam') # # input <- paste0(url,samples) # output <- paste0(tempdir(),samples) # # for(idx in seq_len(length(input))){ # download.file(url = input[idx],destfile = output[idx]) # download.file(url = paste0(input[idx],'.bai'), # destfile = paste0(output[idx],'.bai')) # } # # gr <- segmentGenome(genome = 'hg19', # window = 1000,rm.gap = TRUE,rm.blacklist = TRUE) # # cData <- data.frame(condition = rep(c('H3K36me3','H3K27me3','EZH2'),each = 2), # replicate = rep(c(1,2),times = 3)) # # subGr <- gr[seqnames(gr) == 'chr19' & start(gr) >= 40e6 & end(gr) <= 50e6] # # helas3 <- # epigraHMMDataSetFromBam(bamFiles = output,colData = cData, # genome = subGr,windowSize = 1000) data(helas3) helas3
# # # The data 'helas3' was created as follows. # options(timeout=9999999) # # url <- 'http://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeBroadHistone/' # samples <- c('wgEncodeBroadHistoneHelas3H3k36me3StdAlnRep1.bam', # 'wgEncodeBroadHistoneHelas3H3k36me3StdAlnRep2.bam', # 'wgEncodeBroadHistoneHelas3H3k27me3StdAlnRep1.bam', # 'wgEncodeBroadHistoneHelas3H3k27me3StdAlnRep2.bam', # 'wgEncodeBroadHistoneHelas3Ezh239875AlnRep1.bam', # 'wgEncodeBroadHistoneHelas3Ezh239875AlnRep2.bam') # # input <- paste0(url,samples) # output <- paste0(tempdir(),samples) # # for(idx in seq_len(length(input))){ # download.file(url = input[idx],destfile = output[idx]) # download.file(url = paste0(input[idx],'.bai'), # destfile = paste0(output[idx],'.bai')) # } # # gr <- segmentGenome(genome = 'hg19', # window = 1000,rm.gap = TRUE,rm.blacklist = TRUE) # # cData <- data.frame(condition = rep(c('H3K36me3','H3K27me3','EZH2'),each = 2), # replicate = rep(c(1,2),times = 3)) # # subGr <- gr[seqnames(gr) == 'chr19' & start(gr) >= 40e6 & end(gr) <= 50e6] # # helas3 <- # epigraHMMDataSetFromBam(bamFiles = output,colData = cData, # genome = subGr,windowSize = 1000) data(helas3) helas3
This function returns the BIC and expected log-likelihood function of the model, with respect to the last conditional distribution of unknown enrichment peaks given the data. The latter is also known as 'Q-function' in the EM context.
info(object)
info(object)
object |
an epigraHMMDataSet |
A list with BIC, and expected log-likelihood function of the model. If the input object contains results from a differential analysis, 'info' will also output the enrichment patterns associated with each mixture component used in the mixture model.
Pedro L. Baldoni, [email protected]
https://github.com/plbaldoni/epigraHMM
# Creating dummy object countData <- rbind(matrix(rnbinom(1e3,mu = 2,size = 10),ncol = 1), matrix(rnbinom(2e3,mu = 7.5,size = 5),ncol = 1), matrix(rnbinom(1e3,mu = 2,size = 10),ncol = 1)) colData <- data.frame(condition = 'A', replicate = 1) object <- epigraHMMDataSetFromMatrix(countData,colData) # Initializing object <- initializer(object,controlEM()) # Running epigraHMM object <- epigraHMM(object,controlEM(),type = 'consensus',dist = 'nb') # Get info info(object)
# Creating dummy object countData <- rbind(matrix(rnbinom(1e3,mu = 2,size = 10),ncol = 1), matrix(rnbinom(2e3,mu = 7.5,size = 5),ncol = 1), matrix(rnbinom(1e3,mu = 2,size = 10),ncol = 1)) colData <- data.frame(condition = 'A', replicate = 1) object <- epigraHMMDataSetFromMatrix(countData,colData) # Initializing object <- initializer(object,controlEM()) # Running epigraHMM object <- epigraHMM(object,controlEM(),type = 'consensus',dist = 'nb') # Get info info(object)
This function call enriched windows individually for each sample in an epigraHMMDataSet. These are then used for initializing purposes in epigraHMM. By default, the Viterbi algorithm is used to determine enriched windows. Input controls and normalizing offsets are not utilized in this initialization step.
initializer(object, control)
initializer(object, control)
object |
an epigraHMMDataSet |
control |
list of control arguments from controlEM() |
To be added
An epigraHMMDataSet with a 'peaks' assay filled in.
Pedro L. Baldoni, [email protected]
https://github.com/plbaldoni/epigraHMM
# Creating dummy object countData <- rbind(matrix(rnbinom(1e3,mu = 2,size = 10),ncol = 1), matrix(rnbinom(2e3,mu = 7.5,size = 5),ncol = 1), matrix(rnbinom(1e3,mu = 2,size = 10),ncol = 1)) colData <- data.frame(condition = 'A', replicate = 1) object <- epigraHMMDataSetFromMatrix(countData,colData) # Initializing object <- initializer(object,controlEM()) # Visualizing initialization peaks #plot(assay(object),type = 'l') #lines(7.5*assay(object,'peaks'),col = 'red')
# Creating dummy object countData <- rbind(matrix(rnbinom(1e3,mu = 2,size = 10),ncol = 1), matrix(rnbinom(2e3,mu = 7.5,size = 5),ncol = 1), matrix(rnbinom(1e3,mu = 2,size = 10),ncol = 1)) colData <- data.frame(condition = 'A', replicate = 1) object <- epigraHMMDataSetFromMatrix(countData,colData) # Initializing object <- initializer(object,controlEM()) # Visualizing initialization peaks #plot(assay(object),type = 'l') #lines(7.5*assay(object,'peaks'),col = 'red')
M-step (maximization w.r.t. initial and transition probabilities)
maxStepProb(hdf5)
maxStepProb(hdf5)
hdf5 |
path to where the hdf5 is saved |
#Creating dummy object countData <- rbind(matrix(rnbinom(1e3,mu = 2,size = 10),ncol = 1), matrix(rnbinom(2e3,mu = 7.5,size = 5),ncol = 1), matrix(rnbinom(1e3,mu = 2,size = 10),ncol = 1)) colData <- data.frame(condition = 'A', replicate = 1) object <- epigraHMMDataSetFromMatrix(countData,colData) #Initializing object <- initializer(object,controlEM()) #Running epigraHMM object <- epigraHMM(object,controlEM(),type = 'consensus',dist = 'nb') #Example maxStepProb(hdf5 = S4Vectors::metadata(object)$output)
#Creating dummy object countData <- rbind(matrix(rnbinom(1e3,mu = 2,size = 10),ncol = 1), matrix(rnbinom(2e3,mu = 7.5,size = 5),ncol = 1), matrix(rnbinom(1e3,mu = 2,size = 10),ncol = 1)) colData <- data.frame(condition = 'A', replicate = 1) object <- epigraHMMDataSetFromMatrix(countData,colData) #Initializing object <- initializer(object,controlEM()) #Running epigraHMM object <- epigraHMM(object,controlEM(),type = 'consensus',dist = 'nb') #Example maxStepProb(hdf5 = S4Vectors::metadata(object)$output)
This function performs a non-linear normalization of counts with respect to a reference sample (geometric mean)
normalizeCounts(object, control, span = 1, ...)
normalizeCounts(object, control, span = 1, ...)
object |
an epigraHMMDataSet |
control |
list of control arguments from controlEM() |
span |
the span parameter of |
... |
arguments to be passed to |
This function 'limma::loessFit', which simply a wrapper for the 'stats::lowess' smoother.
An epigraHMMDataSet with an 'offsets' assay filled in.
Pedro L. Baldoni, [email protected]
https://github.com/plbaldoni/epigraHMM
# Creating dummy object countData <- list('counts' = matrix(rpois(1e5,10),ncol = 2), 'controls' = matrix(rpois(1e5,5),ncol = 2)) colData <- data.frame(condition = c('A','A'), replicate = c(1,2)) object <- epigraHMMDataSetFromMatrix(countData,colData) # Normalizing counts object <- normalizeCounts(object = object,control = controlEM(), span = 1)
# Creating dummy object countData <- list('counts' = matrix(rpois(1e5,10),ncol = 2), 'controls' = matrix(rpois(1e5,5),ncol = 2)) colData <- data.frame(condition = c('A','A'), replicate = c(1,2)) object <- epigraHMMDataSetFromMatrix(countData,colData) # Normalizing counts object <- normalizeCounts(object = object,control = controlEM(), span = 1)
'plotCounts()' plots read counts and peak regions from 'epigraHMM()'
plotCounts( object, ranges, hdf5 = metadata(object)$output, peaks = NULL, annotation = NULL )
plotCounts( object, ranges, hdf5 = metadata(object)$output, peaks = NULL, annotation = NULL )
object |
an epigraHMMDataSet |
ranges |
a GRanges object or a pair of integers with the genomic corrdinates/windows to be plotted |
hdf5 |
an optional character string with the hdf5 file path from 'epigraHMM' |
peaks |
an optional parameter with a GRanges object or a vector of logicals (with length equal to the number of rows in 'object') specifying the genomic corrdinates/windows with peaks |
annotation |
an optional parameter with a GRanges object or a vector of logicals (with length equal to the number of rows in 'object') specifying the genomic corrdinates/windows of an annotation track |
If the input object contains the assay 'offset', reads will be normalized prior to plotting (e.g. counts/exp(offset)). Reads from replicates pertaining to the same condition are aggregated prior to plotting.
A ggplot
Pedro L. Baldoni, [email protected]
https://github.com/plbaldoni/epigraHMM
countData <- rbind(matrix(rnbinom(1e3,mu = 2,size = 10),ncol = 1), matrix(rnbinom(1e3,mu = 7.5,size = 5),ncol = 1), matrix(rnbinom(1e3,mu = 7.5,size = 5),ncol = 1), matrix(rnbinom(1e3,mu = 2,size = 10),ncol = 1)) colData <- data.frame(condition = 'A', replicate = 1) object <- epigraHMMDataSetFromMatrix(countData,colData) plotCounts(object,ranges = c(500,3500))
countData <- rbind(matrix(rnbinom(1e3,mu = 2,size = 10),ncol = 1), matrix(rnbinom(1e3,mu = 7.5,size = 5),ncol = 1), matrix(rnbinom(1e3,mu = 7.5,size = 5),ncol = 1), matrix(rnbinom(1e3,mu = 2,size = 10),ncol = 1)) colData <- data.frame(condition = 'A', replicate = 1) object <- epigraHMMDataSetFromMatrix(countData,colData) plotCounts(object,ranges = c(500,3500))
'plotPatterns()' plots the posterior probabilities associated with differential patterns from a differential analysis of 'epigraHMM()'
plotPatterns( object, ranges, peaks, hdf5 = metadata(object)$output, colors = NULL )
plotPatterns( object, ranges, peaks, hdf5 = metadata(object)$output, colors = NULL )
object |
an epigraHMMDataSet |
ranges |
a GRanges object or a pair of integers with the genomic corrdinates/windows to be plotted |
peaks |
either a GRanges object or a vector of logicals (with length equal to the number of rows in 'object') specifying the genomic corrdinates/windows with peaks |
hdf5 |
a character string with the hdf5 file path from 'epigraHMM' |
colors |
an optional argument that specifies the colors for each differential combinatorial pattern |
A pheatmat
Pedro L. Baldoni, [email protected]
https://github.com/plbaldoni/epigraHMM
# Creating dummy object countData <- cbind(rbind(matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 10, size = 5), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 10, size = 5), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1)), rbind(matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 10, size = 5), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 10, size = 5), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1))) colData <- data.frame(condition = c('A','B'), replicate = c(1,1)) rowRanges <- GenomicRanges::GRanges('chrA', IRanges::IRanges(start = seq(1,by = 500, length.out = nrow(countData)),width = 500)) object <- epigraHMMDataSetFromMatrix(countData,colData,rowRanges = rowRanges) # Initializing object <- initializer(object,controlEM()) # Running epigraHMM object <- epigraHMM(object,controlEM(),type = 'differential',dist = 'nb') # Calling peaks peaks <- callPeaks(object = object, hdf5 = S4Vectors::metadata(object)$output, method = 'viterbi') # Plotting patterns plotPatterns(object, ranges = peaks[1], peaks = peaks)
# Creating dummy object countData <- cbind(rbind(matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 10, size = 5), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 10, size = 5), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1)), rbind(matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 10, size = 5), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1), matrix(rnbinom(1e2, mu = 10, size = 5), ncol = 1), matrix(rnbinom(1e2, mu = 1, size = 10), ncol = 1))) colData <- data.frame(condition = c('A','B'), replicate = c(1,1)) rowRanges <- GenomicRanges::GRanges('chrA', IRanges::IRanges(start = seq(1,by = 500, length.out = nrow(countData)),width = 500)) object <- epigraHMMDataSetFromMatrix(countData,colData,rowRanges = rowRanges) # Initializing object <- initializer(object,controlEM()) # Running epigraHMM object <- epigraHMM(object,controlEM(),type = 'differential',dist = 'nb') # Calling peaks peaks <- callPeaks(object = object, hdf5 = S4Vectors::metadata(object)$output, method = 'viterbi') # Plotting patterns plotPatterns(object, ranges = peaks[1], peaks = peaks)
This function segments a genome into non-overlapping windows.
segmentGenome(genome, window, rm.gap = TRUE, rm.blacklist = TRUE)
segmentGenome(genome, window, rm.gap = TRUE, rm.blacklist = TRUE)
genome |
a string with the name of the genome (e.g. 'hg19') |
window |
an integer with the window size |
rm.gap |
a logical indicating gap regions should be removed |
rm.blacklist |
a logical indicating blacklisted regions should be removed |
a GRanges object with the binned genome
Pedro L. Baldoni, [email protected]
https://github.com/plbaldoni/epigraHMM
gr <- segmentGenome(genome = 'mm10', window = 500)
gr <- segmentGenome(genome = 'mm10', window = 500)
Simulates a Markov Chain of length 'n' given a matrix of transition probabilities P
simulateMarkovChain(P, n)
simulateMarkovChain(P, n)
P |
a matrix of transition probabilities (row sums should be 1) |
n |
an integer specifying thhe length of the simulated sequence |
#Example simulateMarkovChain(matrix(c(0.99,0.01,0.01,0.99),2,2),100)
#Example simulateMarkovChain(matrix(c(0.99,0.01,0.01,0.99),2,2),100)