| Title: | HiCPotts: Hierarchical Modeling to Identify and Correct Genomic Biases in Hi-C |
|---|---|
| Description: | The HiCPotts package provides a comprehensive Bayesian framework for analyzing Hi-C interaction data, integrating both spatial and genomic biases within a probabilistic modeling framework. At its core, HiCPotts leverages the Potts model (Wu, 1982)—a well-established graphical model—to capture and quantify spatial dependencies across interaction loci arranged on a genomic lattice. By treating each interaction as a spatially correlated random variable, the Potts model enables robust segmentation of the genomic landscape into meaningful components, such as noise, true signals, and false signals. To model the influence of various genomic biases, HiCPotts employs a regression-based approach incorporating multiple covariates: Genomic distance (D): The distance between interacting loci, recognized as a fundamental driver of contact frequency. GC-content (GC): The local GC composition around the interacting loci, which can influence chromatin structure and interaction patterns. Transposable elements (TEs): The presence and abundance of repetitive elements that may shape contact probability through chromatin organization. Accessibility score (Acc): A measure of chromatin openness, informing how accessible certain genomic regions are to interaction. By embedding these covariates into a hierarchical mixture model, HiCPotts characterizes each interaction’s probability of belonging to one of several latent components. The model parameters, including regression coefficients, zero-inflation parameters (for ZIP/ZINB distributions), and dispersion terms (for NB/ZINB distributions), are inferred via a MCMC sampler. This algorithm draws samples from the joint posterior distribution, allowing for flexible posterior inference on model parameters and hidden states. From these posterior samples, HiCPotts computes posterior means of regression parameters and other quantities of interest. These posterior estimates are then used to calculate the posterior probabilities that assign each interaction to a specific component. The resulting classification sheds light on the underlying structure: distinguishing genuine high-confidence interactions (signal) from background noise and potential false signals, while simultaneously quantifying the impact of genomic biases on observed interaction frequencies. In summary, HiCPotts seamlessly integrates spatial modeling, bias correction, and probabilistic classification into a unified Bayesian inference framework. It provides rich posterior summaries and interpretable, model-based assignments of interaction states, enabling researchers to better understand the interplay between genomic organization, biases, and spatial correlation in Hi-C data. |
| Authors: | Itunu. Godwin Osuntoki [aut, cre] (ORCID: <https://orcid.org/0009-0005-1037-9346>), Nicolae. Radu Zabet [aut] |
| Maintainer: | Itunu. Godwin Osuntoki <[email protected]> |
| License: | GPL-3 | file LICENSE |
| Version: | 1.3.0 |
| Built: | 2026-06-02 08:30:39 UTC |
| Source: | https://github.com/bioc/HiCPotts |
The HiCPotts package provides a comprehensive Bayesian framework for analyzing Hi-C interaction data, integrating both spatial and genomic biases within a probabilistic modeling framework. At its core, HiCPotts leverages the Potts model (Wu, 1982)—a well-established graphical model—to capture and quantify spatial dependencies across interaction loci arranged on a genomic lattice. By treating each interaction as a spatially correlated random variable, the Potts model enables robust segmentation of the genomic landscape into meaningful components, such as noise, true signals, and false signals. To model the influence of various genomic biases, HiCPotts employs a regression-based approach incorporating multiple covariates: Genomic distance (D): The distance between interacting loci, recognized as a fundamental driver of contact frequency. GC-content (GC): The local GC composition around the interacting loci, which can influence chromatin structure and interaction patterns. Transposable elements (TEs): The presence and abundance of repetitive elements that may shape contact probability through chromatin organization. Accessibility score (Acc): A measure of chromatin openness, informing how accessible certain genomic regions are to interaction. By embedding these covariates into a hierarchical mixture model, HiCPotts characterizes each interaction’s probability of belonging to one of several latent components. The model parameters, including regression coefficients, zero-inflation parameters (for ZIP/ZINB distributions), and dispersion terms (for NB/ZINB distributions), are inferred via a MCMC sampler. This algorithm draws samples from the joint posterior distribution, allowing for flexible posterior inference on model parameters and hidden states. From these posterior samples, HiCPotts computes posterior means of regression parameters and other quantities of interest. These posterior estimates are then used to calculate the posterior probabilities that assign each interaction to a specific component. The resulting classification sheds light on the underlying structure: distinguishing genuine high-confidence interactions (signal) from background noise and potential false signals, while simultaneously quantifying the impact of genomic biases on observed interaction frequencies. In summary, HiCPotts seamlessly integrates spatial modeling, bias correction, and probabilistic classification into a unified Bayesian inference framework. It provides rich posterior summaries and interpretable, model-based assignments of interaction states, enabling researchers to better understand the interplay between genomic organization, biases, and spatial correlation in Hi-C data.
Maintainer: Itunu. Godwin Osuntoki [email protected] (ORCID)
Authors:
Nicolae. Radu Zabet [email protected]
Useful links:
Report bugs at https://github.com/igosungithub/HiCPotts/issues
This function computes the posterior probabilities of assigning genomic interactions to each of three mixture components (or states) in a Hidden Markov Random Field (HMRF) model. It uses the posterior means of regression parameters obtained from MCMC simulations and combines these with user-specified distributions (zero-inflated or standard) to produce probabilities for each observed interaction.
compute_HMRFHiC_probabilities(data, chain_betas, iterations, dist = "ZINB", max_interactions = NA_integer_, consistent_dist = TRUE)compute_HMRFHiC_probabilities(data, chain_betas, iterations, dist = "ZINB", max_interactions = NA_integer_, consistent_dist = TRUE)
data |
A
The |
chain_betas |
A list of MCMC chain results generated by the HiCPotts model-fitting procedure. Each element should correspond to one chain and must include:
|
iterations |
An integer specifying the number of total MCMC iterations. The function uses half of these as burn-in (i.e., |
dist |
A character string indicating the chosen distribution for modeling interaction counts. One of:
The default is |
max_interactions |
Optional cap on interactions before density eval. Defaults to NA. |
consistent_dist |
Logical; if TRUE, all three components use the same distribution. |
This function is part of the HiCPotts pipeline that models genomic interactions (e.g., Hi-C interaction counts) using a mixture model approach. The model typically considers three components (or latent states), each characterized by a distinct mean-structure and potentially different zero-inflation or overdispersion parameters, depending on the chosen distribution.
The function:
Extracts posterior means of regression parameters from MCMC chains, discarding the initial half of the samples as burn-in.
Estimates mean interaction intensities () for each component using log-linear models with covariates: distance (log of |end-start|),GC, TES, and ACC (each transformed by a log() operation).
Given the specified distribution (dist), calculates the probability (on the natural scale) of observing the recorded interaction count for each component.
Normalizes these probabilities so that each interaction is assigned a set of three probabilities summing to 1.
For zero-inflated distributions ("ZIP", "ZINB"), a parameter captures the probability of an excess zero.
For negative binomial distributions ("NB", "ZINB"), an overdispersion parameter is included.
The computed probabilities can be used for downstream analysis, such as segmenting interactions into classes or modeling spatial dependence in a hidden Markov field.
A data.frame containing the original input columns plus three new columns:
prob1: The posterior probability that the interaction belongs to component 1.
prob2: The posterior probability that the interaction belongs to component 2.
prob3: The posterior probability that the interaction belongs to component 3.
The returned data frame thus provides a probabilistic classification of each observed interaction into one of the three modeled components.
dpois, dnbinom, for probability calculations.
# # # Synthetic data set.seed(123) large_data <- data.frame( start = c(1, 10, 20), end = c(5, 15, 30), interactions = c(10, 20, 30), GC = c(0.5, 0.8, 0.3), TES = c(0.2, 0.5, 0.7), ACC = c(0.9, 0.4, 0.6) ) chain_betas <- list( list( chains = list( matrix(runif(30, 0.1, 1), ncol = 5), matrix(runif(30, 0.1, 1), ncol = 5), matrix(runif(30, 0.1, 1), ncol = 5) ), theta = runif(6, 0.1, 0.9), size = matrix(runif(18, 1, 10), nrow = 3) ) ) result <- compute_HMRFHiC_probabilities( data = large_data, chain_betas = chain_betas, iterations = 5, dist = "Poisson", max_interactions = NA_integer_, consistent_dist = TRUE ) print(result) # See vignette("HiCPotts_vignette") for detailed examples with real Hi-C data. # ## # # Synthetic data set.seed(123) large_data <- data.frame( start = c(1, 10, 20), end = c(5, 15, 30), interactions = c(10, 20, 30), GC = c(0.5, 0.8, 0.3), TES = c(0.2, 0.5, 0.7), ACC = c(0.9, 0.4, 0.6) ) chain_betas <- list( list( chains = list( matrix(runif(30, 0.1, 1), ncol = 5), matrix(runif(30, 0.1, 1), ncol = 5), matrix(runif(30, 0.1, 1), ncol = 5) ), theta = runif(6, 0.1, 0.9), size = matrix(runif(18, 1, 10), nrow = 3) ) ) result <- compute_HMRFHiC_probabilities( data = large_data, chain_betas = chain_betas, iterations = 5, dist = "Poisson", max_interactions = NA_integer_, consistent_dist = TRUE ) print(result) # See vignette("HiCPotts_vignette") for detailed examples with real Hi-C data. # #
This function generates a prior value for the interaction parameter in a Potts model from a Beta distribution. The Potts model is a spatial statistical model where the interaction parameter influences the tendency of neighboring sites on a lattice to take on similar states. By drawing the interaction parameter from a Beta distribution, we impose a prior belief on the range and likely values of this parameter.
gamma_prior_value()gamma_prior_value()
The function samples a single random value from a Beta(2, 2) distribution. This distribution places
more mass towards the higher end of the \[0, 1\] interval (since Beta(2, 2) is skewed towards 1), indicating
a prior belief that the interaction parameter is likely to encourage some degree of spatial clustering.
A numeric value between \[0,1\] representing the prior draw for the interaction parameter. This is a random draw from the specified Beta distribution.
# # Generate a single prior value for the Potts model interaction parameter prior_val <- gamma_prior_value() #prior_val# # Generate a single prior value for the Potts model interaction parameter prior_val <- gamma_prior_value() #prior_val
Reads a Hi-C contact matrix from a .hic, .cool, or .h5 file,
subsets to a specified genomic region, rebins to a desired resolution (merging or
slicing as needed), and returns a complete grid of bin-pair interactions. For each
bin-pair it computes:
GC content from a BSgenome
DNase-I accessibility (if provided, lifted over)
Transposable element (TE) overlap counts
Interaction counts
get_data( file_path, chr, start, end, resolution, genome_package, acc_wig = NULL, chain_file = NULL, te_granges = NULL )get_data( file_path, chr, start, end, resolution, genome_package, acc_wig = NULL, chain_file = NULL, te_granges = NULL )
file_path |
Path to the Hi-C file ( |
chr |
Chromosome name (e.g. |
start |
1-based start coordinate of the region. |
end |
End coordinate of the region. |
resolution |
Target bin size (bp); smaller native bins are merged, larger ones are sliced. |
genome_package |
Name of a BSgenome package (e.g.
|
acc_wig |
(Optional) Path to a DNase-I wig (bedGraph) file. Requires
|
chain_file |
(Optional) LiftOver chain file for mapping |
te_granges |
(Optional) Path to a BED/GTF of TE annotations or a
|
A data.frame with one row per bin-pair across [start,end] at
resolution, containing:
startBin1 start.
end.i.Bin1 end.
start.j.Bin2 start.
endBin2 end.
chromChromosome name.
GCCombined GC fraction of the two bins.
ACCMean DNase-I score per bin or NA.
TESSum of TE overlaps in both bins or NA.
interactionsObserved contact count (0 if absent).
wig <- system.file("extdata", "DNaseI_BG3_gr_chr4.bedGraph", package = "HiCPotts") chain <- system.file("extdata", "dm3ToDm6_chr4_only.chain", package = "HiCPotts") te <- system.file("extdata", "dm6_TEs_chr4.gtf", package = "HiCPotts") hic <- system.file("extdata", "BG3_WT_merged_hic_matrix_chr4_100Kb.cool", package = "HiCPotts") bb <- get_data( file_path = hic, chr = "chr4", start = 1, end = 400000, resolution = 200000, genome_package = "BSgenome.Dmelanogaster.UCSC.dm6", acc_wig = wig, chain_file = chain, te_granges = te )wig <- system.file("extdata", "DNaseI_BG3_gr_chr4.bedGraph", package = "HiCPotts") chain <- system.file("extdata", "dm3ToDm6_chr4_only.chain", package = "HiCPotts") te <- system.file("extdata", "dm6_TEs_chr4.gtf", package = "HiCPotts") hic <- system.file("extdata", "BG3_WT_merged_hic_matrix_chr4_100Kb.cool", package = "HiCPotts") bb <- get_data( file_path = hic, chr = "chr4", start = 1, end = 400000, resolution = 200000, genome_package = "BSgenome.Dmelanogaster.UCSC.dm6", acc_wig = wig, chain_file = chain, te_granges = te )
This function computes the likelihood contribution of a specific model component using a chosen distribution. It is designed to work within a mixture-modeling or hierarchical modeling framework, where each interaction (such as genomic interactions in a Hi-C experiment) can be modeled by a combination of regression parameters and potentially zero-inflation or overdispersion parameters.
likelihood_combined(pred_combined, params, z, y, x_vars, component, theta, size, N, dist)likelihood_combined(pred_combined, params, z, y, x_vars, component, theta, size, N, dist)
pred_combined |
A numeric vector of predictor values obtained from a prediction function. This
typically represents the linear predictor |
params |
A numeric vector of parameters associated with the model’s linear predictors.
Typically includes intercepts and regression coefficients related to the covariates in |
z |
A matrix or array representing the latent state indicators or other structural variables
that influence the model. |
y |
A numeric vector of observed interaction counts (the response variable). Each element corresponds to one observation (e.g., interaction count between a pair of genomic loci). |
x_vars |
A list of covariates used as predictors in the linear model. Each element in the list
corresponds to a covariate vector, and all must be of length |
component |
An integer or factor specifying which component of the mixture model is currently
being evaluated. In the 3-component mixture, this is |
theta |
(Optional) A numeric value for the zero-inflation parameter. Required for ZIP and ZINB models. This parameter controls the probability of excess zeros not explained by the Poisson or NB component. |
size |
(Optional) A numeric value for the size (overdispersion) parameter. Required for NB and ZINB models. This parameter captures variance that exceeds that of a Poisson distribution. |
N |
An integer specifying the number of observations. This
should match the length of the response variable |
dist |
A character string specifying the distribution to be used for modeling the interaction counts. Options may include:
|
This function calculates the likelihood for a single component given a set of parameters and covariates. The steps typically involved are:
Compute the linear predictor from the supplied parameters and covariates.
The pred_combined represents .
Depending on dist, convert the linear predictor into a mean parameter (e.g., for Poisson or NB).
Compute the likelihood of each observed count y[i] under the chosen distribution with
the given parameters (, , , etc.).
For zero-inflated models (ZIP, ZINB), the likelihood incorporates the probability of an extra zero.
Return the computed likelihood values. This is used internally to evaluate and update model parameters during the MCMC steps.
The function is a building block in a larger modeling framework (MCMC inference for mixture models). While end-users might not call it directly, it enables flexible specification of distributions and model components for complex hierarchical models.
The function returns the computed likelihood under the specified model setup. It returns a numeric vector of likelihood contributions for each observation.
dpois, dnbinom for related probability mass functions.
# Example setup N <- 3 z <- matrix(c(1, 1, 2, 3, 1, 2, 3, 1, 2), nrow = 3, byrow = TRUE) y <- matrix(c(0, 3, 5, 1, 0, 4, 6, 2, 0), nrow = 3, byrow = TRUE) params <- c(0.1, 0.2, 0.3, 0.4, 0.5) x_vars <- x_vars <- list( list(matrix(runif(9, 1, 10), nrow = N)), # first covariate list(matrix(runif(9, 1, 10), nrow = N)), # second covariate list(matrix(runif(9, 1, 10), nrow = N)), # third covariate list(matrix(runif(9, 1, 10), nrow = N)) # fourth covariate ) theta <- 0.2 size <- 10 #' # Compute likelihood under a Poisson model, component 1 ll_values <- likelihood_combined( pred_combined = pred_combined, params = params, z = z, y = y, x_vars = x_vars, component = 1, theta, size, N = N, dist = "Poisson" ) #print(sum(ll_values)) # sum of likelihood contributions# Example setup N <- 3 z <- matrix(c(1, 1, 2, 3, 1, 2, 3, 1, 2), nrow = 3, byrow = TRUE) y <- matrix(c(0, 3, 5, 1, 0, 4, 6, 2, 0), nrow = 3, byrow = TRUE) params <- c(0.1, 0.2, 0.3, 0.4, 0.5) x_vars <- x_vars <- list( list(matrix(runif(9, 1, 10), nrow = N)), # first covariate list(matrix(runif(9, 1, 10), nrow = N)), # second covariate list(matrix(runif(9, 1, 10), nrow = N)), # third covariate list(matrix(runif(9, 1, 10), nrow = N)) # fourth covariate ) theta <- 0.2 size <- 10 #' # Compute likelihood under a Poisson model, component 1 ll_values <- likelihood_combined( pred_combined = pred_combined, params = params, z = z, y = y, x_vars = x_vars, component = 1, theta, size, N = N, dist = "Poisson" ) #print(sum(ll_values)) # sum of likelihood contributions
This function computes the matrix of probabilities (or likelihood values) associated with the interaction parameter in the Potts model. The Potts model is a generalization of the Ising model and is commonly used in spatial statistics, image analysis, and other fields where data can be represented on a lattice. The interaction parameter controls how neighboring sites on the lattice influence each other.
likelihood_gamma(x, pair_neighbours_DA_x1, N)likelihood_gamma(x, pair_neighbours_DA_x1, N)
x |
A numeric vector representing the interaction parameter(s) for the Potts model. This could be a single parameter repeated for each pair of neighbors. |
pair_neighbours_DA_x1 |
A numeric vector of the same length as |
N |
An integer specifying the dimension of the Potts lattice. The returned matrix will be |
The Potts model describes configurations of states on a lattice, where the interaction parameter x
influences how often neighboring sites are in the same state. The likelihood or probability of a
configuration is determined by exponentials of the interaction terms between neighboring sites.
This function:
Multiplies the interaction parameter vector x by the vector pair_neighbours_DA_x1, which encodes
neighbor relationships or their contributions.
Applies a stability adjustment by subtracting the maximum value (max_val) from each term
to avoid numerical overflow.
Exponentiates these adjusted values and normalizes them to obtain a set of values that sum to 1, interpreting them as probabilities or likelihood contributions.
Reshapes these values into an N x N matrix, potts_DA, which represents
the spatial layout of likelihoods (or probabilities) under the Potts model.
A numeric N x N matrix of probabilities corresponding to the Potts model likelihood
for the given interaction parameter. Each element of the matrix represents the likelihood contribution of
that site/pair under the model parameterized by x.
# Suppose we have an N=4 lattice and a vector x and pair_neighbours_DA_x1 of length 16 N <- 4 x <- rep(0.5, N * N) # interaction parameter repeated for each pair pair_neighbours_DA_x1 <- rnorm(N * N) # random neighbor influences # Compute the Potts DA matrix potts_matrix <- likelihood_gamma(x, pair_neighbours_DA_x1, N) #print(potts_matrix)# Suppose we have an N=4 lattice and a vector x and pair_neighbours_DA_x1 of length 16 N <- 4 x <- rep(0.5, N * N) # interaction parameter repeated for each pair pair_neighbours_DA_x1 <- rnorm(N * N) # random neighbor influences # Compute the Potts DA matrix potts_matrix <- likelihood_gamma(x, pair_neighbours_DA_x1, N) #print(potts_matrix)
This function computes the combined posterior value for a specified component in the model utilizing a variety of distributions (Poisson, Negative Binomial, Zero-Inflated Poisson, Zero-Inflated Negative Binomial). It incorporates both the likelihood of the observed data under the given model parameters and the prior distributions for those parameters. When applicable, it also includes a prior on the dispersion parameter (size) used in Negative Binomial or Zero-Inflated Negative Binomial models.
posterior_combined(pred_combined, params, z, y, x_vars, component, theta, N, use_data_priors, user_fixed_priors, dist, size)posterior_combined(pred_combined, params, z, y, x_vars, component, theta, N, use_data_priors, user_fixed_priors, dist, size)
pred_combined |
A numeric vector of predictor values generated from a prediction function. Typically represents the linear predictors on a log scale. |
params |
A numeric vector of parameters for the model. This includes the regression coefficients. |
z |
A matrix or data structure representing latent variables, components, or states in the model.
For example, in the mixture model, |
y |
A numeric vector of observed counts or interaction values. Each element corresponds to an observation (e.g., interaction counts between genomic loci). |
x_vars |
A list of covariates used in the model. Each element corresponds to a predictor variable,
and all should be of length |
component |
An integer or factor indicating the component (e.g., mixture component) for which the posterior is being computed. Different components might have different parameter sets or priors. |
theta |
(Optional) A numeric value representing the zero-inflation parameter for Zero-Inflated models.
|
N |
An integer specifying the number of observations. This should match the length of |
use_data_priors |
A logical value indicating whether to use data-driven priors for each component.
If |
user_fixed_priors |
A logical value indicating whether to use user-specified priors for each component.
If |
dist |
A character string specifying the distribution family. One of:
|
size |
(Optional) A numeric value for the dispersion parameter in the NB or ZINB distribution. This parameter models overdispersion, allowing the variance to exceed the mean. |
The posterior is computed as:
Steps involved:
Compute the likelihood of the data (y) given the parameters, latent structure (z),
covariates (x_vars), and chosen distribution (dist) using the provided pred_combined
values. This is done via the likelihood_combined function.
Compute the prior distribution over the parameters (params). Depending on use_data_priors
and user_fixed_priors, these priors might be estimated from data or supplied by the user.
This step uses the prior_combined function.
If the distribution is "NB" or "ZINB", incorporate a separate prior on the dispersion
parameter (size). This step uses the size_prior function.
By combining the likelihood and priors, the function returns the posterior value, which is used in the Bayesian inference and MCMC steps to update parameters and perform parameters estimations.
A numeric value representing the posterior log-probability for the given component under the specified distribution and modeling assumptions.
# Example usage of posterior: N <- 5 y <- rpois(N, lambda = 5) x_vars <- list( list(matrix(runif(25), nrow = 5)), list(matrix(runif(25), nrow = 5)), list(matrix(runif(25), nrow = 5)), list(matrix(runif(25), nrow = 5)) ) params <- c(1, 2, 3, 4, 5) z <- matrix(sample(1:3, N * 5, replace = TRUE), nrow = N, ncol = 5) preds_comp1 <- pred_combined(params, z, x_vars, component = 1, N = N) # Compute posterior for component 1 using a Poisson distribution post_val <- posterior_combined( pred_combined = preds_comp1, params = params, z = z, y = y, x_vars = x_vars, component = 1, theta = NULL, N = N, use_data_priors = TRUE, user_fixed_priors = NULL, dist = "Poisson", size = NULL ) #print(post_val) # Display result# Example usage of posterior: N <- 5 y <- rpois(N, lambda = 5) x_vars <- list( list(matrix(runif(25), nrow = 5)), list(matrix(runif(25), nrow = 5)), list(matrix(runif(25), nrow = 5)), list(matrix(runif(25), nrow = 5)) ) params <- c(1, 2, 3, 4, 5) z <- matrix(sample(1:3, N * 5, replace = TRUE), nrow = N, ncol = 5) preds_comp1 <- pred_combined(params, z, x_vars, component = 1, N = N) # Compute posterior for component 1 using a Poisson distribution post_val <- posterior_combined( pred_combined = preds_comp1, params = params, z = z, y = y, x_vars = x_vars, component = 1, theta = NULL, N = N, use_data_priors = TRUE, user_fixed_priors = NULL, dist = "Poisson", size = NULL ) #print(post_val) # Display result
This function computes the predicted linear predictor () values for a specified model component.
It is designed to correct for various sources of bias by incorporating multiple covariates through a
log-transformed relationship. The resulting predictions serves as inputs to downstream likelihood
or posterior calculations in the Bayesian framework.
pred_combined(params, z, x_vars, component, N)pred_combined(params, z, x_vars, component, N)
params |
A numeric vector of parameters for the model. Typically, |
z |
A matrix used to indicate component membership of each observation.
The function uses |
x_vars |
A list of covariates, each element itself a list or vector of length
Each covariate vector must be of length |
component |
An integer specifying which component of the mixture or hierarchical model to use
when subsetting the data. For the three components, |
N |
An integer specifying the total number of observations. This should match the length of the
|
The predicted values () are computed as:
for each observation in the specified component. This log-transform often stabilizes variance and
normalizes skewed distributions of covariates.
The resulting values serves as linear predictors (in a Poisson, NB, ZIP, or ZINB model)
and as inputs into the likelihood and posterior computation steps.
A numeric vector containing the predicted values () for observations belonging to the specified component.
#\donttest{ # Assume we have: # N = 3 observations # z indicates component membership for each observation # x_vars is a list of four covariates, each nested in a list: x_vars[[i]][[1]] N <- 3 params <- c(1, 2, 3, 4, 5) # Example parameter values z <- matrix(c(1, 1, 2, 3, 1, 3, 2, 3, 1), nrow = 3) # Example component matrix x_vars <- list( list(matrix(runif(9, 1, 10), nrow = N)), # first covariate list(matrix(runif(9, 1, 10), nrow = N)), # second covariate list(matrix(runif(9, 1, 10), nrow = N)), # third covariate list(matrix(runif(9, 1, 10), nrow = N)) # fourth covariate ) # Get predicted values for component 1: preds_comp1 <- pred_combined(params, z, x_vars, component = 1, N = N) print(preds_comp1) #}#\donttest{ # Assume we have: # N = 3 observations # z indicates component membership for each observation # x_vars is a list of four covariates, each nested in a list: x_vars[[i]][[1]] N <- 3 params <- c(1, 2, 3, 4, 5) # Example parameter values z <- matrix(c(1, 1, 2, 3, 1, 3, 2, 3, 1), nrow = 3) # Example component matrix x_vars <- list( list(matrix(runif(9, 1, 10), nrow = N)), # first covariate list(matrix(runif(9, 1, 10), nrow = N)), # second covariate list(matrix(runif(9, 1, 10), nrow = N)), # third covariate list(matrix(runif(9, 1, 10), nrow = N)) # fourth covariate ) # Get predicted values for component 1: preds_comp1 <- pred_combined(params, z, x_vars, component = 1, N = N) print(preds_comp1) #}
This function computes the prior contribution to the log-posterior for the parameters associated with a specified model component. It can use either data-driven priors (estimated from the subset of data corresponding to that component) or user-specified fixed priors. By integrating these priors into the Bayesian inference framework, the function helps ensure that parameter estimates remain grounded in reasonable values and incorporate domain knowledge or data-derived distributions.
prior_combined(params, component, y, x_vars, z, use_data_priors = TRUE, user_fixed_priors)prior_combined(params, component, y, x_vars, z, use_data_priors = TRUE, user_fixed_priors)
params |
A numeric vector of parameters for the model. Typically, |
component |
An integer specifying which component of the model to consider. The model is a mixture with multiple components, this argument indicates which component’s parameters and priors are being evaluated. |
y |
A numeric vector of observed responses (e.g., interaction counts). This is used when estimating data-driven
priors. The vector should be of length |
x_vars |
A list of covariate information. Each element of |
z |
A matrix or similar structure used to assign observations to components. If |
use_data_priors |
A logical value indicating whether to use data-driven priors ( |
user_fixed_priors |
A list of user-defined priors for each component. This argument is only used when
|
This function supports two modes:
Data-Driven Priors: When use_data_priors = TRUE, the function subsets the data for the specified component,
calculates summary statistics, and then draws from those statistics to form priors for each parameter.
This approach allows the priors to adapt to the characteristics of the data associated with that component.
User-Specified Fixed Priors: When use_data_priors = FALSE, the function uses priors provided by the user
through user_fixed_priors, bypassing any data-driven estimation.
In both modes, the priors are assumed to be normal distributions for each parameter. The function returns the sum of the log of these priors.
A single numeric value representing the sum of the log-priors for all parameters in params.
This value can be integrated into a Bayesian inference framework to update parameter estimates based on the chosen priors.
# Example: N <- 5 y <- rpois(N*N, lambda = 5) x_vars <- list( list(runif(N*N)), # x1 list(rnorm(N*N)), # x2 list(rnorm(N*N)), # x3 list(rnorm(N*N)) # x4 ) z <- sample(1:3, N*N, replace = TRUE) params <- c(a = 1, b = 0.5, c = -0.2, d = 0.1, e = 0.3) # Using data-driven priors for component 1 #' prior_combined(params, 1, y, x_vars, z, TRUE, NULL) # Using user-specified fixed priors for component 2 #user_fixed_priors <- list( # component1 = list( # meany = 5, meanx1 = 0, meanx2 = 0, meanx3 = 0, meanx4 = 0, # sdy = 1, sdx1 = 1, sdx2 = 1, sdx3 = 1, sdx4 = 1 # ), # component2 = list( # meany = 10, meanx1 = 1, meanx2 = 1, meanx3 = 1, meanx4 = 1, # sdy = 2, sdx1 = 2, sdx2 = 2, sdx3 = 2, sdx4 = 2 # ), # component3 = list( # meany = 2, meanx1 = 2, meanx2 = 2, meanx3 = 2, meanx4 = 2, # sdy = 1, sdx1 = 1, sdx2 = 1, sdx3 = 1, sdx4 = 1 # ) #) # #prior_val_fixed <- prior_combined( # params = params, # component = 2, # y = y, # x_vars = x_vars, # z = z, # use_data_priors = FALSE, # user_fixed_priors = user_fixed_priors #)# Example: N <- 5 y <- rpois(N*N, lambda = 5) x_vars <- list( list(runif(N*N)), # x1 list(rnorm(N*N)), # x2 list(rnorm(N*N)), # x3 list(rnorm(N*N)) # x4 ) z <- sample(1:3, N*N, replace = TRUE) params <- c(a = 1, b = 0.5, c = -0.2, d = 0.1, e = 0.3) # Using data-driven priors for component 1 #' prior_combined(params, 1, y, x_vars, z, TRUE, NULL) # Using user-specified fixed priors for component 2 #user_fixed_priors <- list( # component1 = list( # meany = 5, meanx1 = 0, meanx2 = 0, meanx3 = 0, meanx4 = 0, # sdy = 1, sdx1 = 1, sdx2 = 1, sdx3 = 1, sdx4 = 1 # ), # component2 = list( # meany = 10, meanx1 = 1, meanx2 = 1, meanx3 = 1, meanx4 = 1, # sdy = 2, sdx1 = 2, sdx2 = 2, sdx3 = 2, sdx4 = 2 # ), # component3 = list( # meany = 2, meanx1 = 2, meanx2 = 2, meanx3 = 2, meanx4 = 2, # sdy = 1, sdx1 = 1, sdx2 = 1, sdx3 = 1, sdx4 = 1 # ) #) # #prior_val_fixed <- prior_combined( # params = params, # component = 2, # y = y, # x_vars = x_vars, # z = z, # use_data_priors = FALSE, # user_fixed_priors = user_fixed_priors #)
The process_data function takes a data frame with interaction information and associated covariates
(genomic bins, GC content, transposable elements(TES), and accessibility) and converts it into
a structured list of matrices suitable for modeling. It also provides options for scaling interaction counts and checking for
required columns and missing values.
process_data(data, N, scale_max = NA_real_, standardization_y = FALSE, pad_with_zero = FALSE)process_data(data, N, scale_max = NA_real_, standardization_y = FALSE, pad_with_zero = FALSE)
data |
A
|
N |
An integer specifying the dimension of the resulting |
scale_max |
A numeric value indicating the maximum scaling factor for interaction counts
when |
standardization_y |
A logical value. If |
pad_with_zero |
Logical; if TRUE, zero-pads the last block when nrow(data) is not a multiple of N^2. |
This function processes a long-format data frame where each row represents an interaction between
two loci (denoted by start and end) and associated covariate values.
Key steps include:
Validating that required columns are present and checking for missing values.
Optionally scaling the interaction counts to a fixed range to reduce skew or prepare data for models sensitive to the magnitude of counts.
Computing genomic distances as .
Reshaping the vectorized interaction and covariate data into matrices.
If the number of rows in data is greater than , it splits them into multiple
matrices (one for each segment of length ).
Storing the resulting data in a list structure suitable for downstream analyses.
The returned x_vars list contains multiple matrices for each covariate:
x_vars$distance: A list of one or more matrices containing genomic distances.
x_vars$GC: A list of matrices for GC content.
x_vars$TES: A list of matrices for TES values.
x_vars$ACC: A list of matrices for accessibility scores.
The y element in the returned list contains the scaled (or raw) interaction counts structured
as a list of matrices.
A list with two elements:
x_varsA named list containing lists of matrices for each covariate:
distance, GC, TES, ACC.
yA list of matrices representing the (scaled) interaction counts
corresponding to the covariates in x_vars.
run_chain_betas for downstream MCMC inference
and modeling steps once data is processed into matrix form.
set.seed(123) df <- data.frame( start = rep(1:10, each = 10), end = rep(11:20, times = 10), interactions = rpois(100, 5), GC = runif(100, 0, 1), TES = runif(100, 0, 1), ACC = runif(100, 0, 1) ) processed <- process_data(df, N = 10, scale_max = NA_real_, standardization_y = FALSE, pad_with_zero = FALSE) #x_vars <- processed$x_vars #y_matrices <- processed$y #str(x_vars) # Show structure of covariates #str(y_matrices) # Show structure of interaction matrices # Extended example with larger dataset # Suppose we have a data frame 'large_df' corresponding to a 20x20 interaction matrix #large_df <- data.frame( # start = rep(1:20, each = 20), # end = rep(21:40, times = 20), # interactions = rpois(400, 5), # GC = runif(400, 0, 1), # TES = runif(400, 0, 1), # ACC = runif(400, 0, 1) #) #processed <- process_data(large_df, N = 20, scale_max = NA_real_, # standardization_y = FALSE, pad_with_zero = FALSE) #x_vars <- processed[[1]] #y_matrices <- processed[[2]] #str(x_vars) #str(y_matrices) # See vignette("HiCPotts_vignette") for detailed examples with real Hi-C data. #set.seed(123) df <- data.frame( start = rep(1:10, each = 10), end = rep(11:20, times = 10), interactions = rpois(100, 5), GC = runif(100, 0, 1), TES = runif(100, 0, 1), ACC = runif(100, 0, 1) ) processed <- process_data(df, N = 10, scale_max = NA_real_, standardization_y = FALSE, pad_with_zero = FALSE) #x_vars <- processed$x_vars #y_matrices <- processed$y #str(x_vars) # Show structure of covariates #str(y_matrices) # Show structure of interaction matrices # Extended example with larger dataset # Suppose we have a data frame 'large_df' corresponding to a 20x20 interaction matrix #large_df <- data.frame( # start = rep(1:20, each = 20), # end = rep(21:40, times = 20), # interactions = rpois(400, 5), # GC = runif(400, 0, 1), # TES = runif(400, 0, 1), # ACC = runif(400, 0, 1) #) #processed <- process_data(large_df, N = 20, scale_max = NA_real_, # standardization_y = FALSE, pad_with_zero = FALSE) #x_vars <- processed[[1]] #y_matrices <- processed[[2]] #str(x_vars) #str(y_matrices) # See vignette("HiCPotts_vignette") for detailed examples with real Hi-C data. #
This function computes the proposal density under a given proposal distribution for a specified model component. In a Bayesian MCMC framework, a proposal distribution is used to generate candidate parameter values in each iteration. By evaluating the log-density of the proposed parameters under this distribution, the MCMC algorithm can determine the acceptance or rejection of the new draw.
proposaldensity_combined(params, component)proposaldensity_combined(params, component)
params |
A numeric vector of parameters for which the proposal density is evaluated. The length and meaning of these parameters are determined by the model and the selected component. |
component |
An integer or identifier specifying which component’s proposal distribution should be used. Each component has its own set of means and standard deviations defining the proposal distribution. Valid components should match those defined within the function (e.g., 1, 2, or 3). |
The function defines three components, each with its own proposal distribution parameters (means and standard deviations)
for a set of parameters. It then evaluates the log-density of the params vector under a multivariate normal
distribution (assuming independence between parameters), using the component-specific means and standard deviations.
Steps performed by the function:
Verifies that the specified component is valid and corresponds to one of the defined distributions.
Extracts the mean and standard deviation vectors associated with that component.
Checks that the length of params matches the expected length of the means and standard deviations.
Evaluates the log of the normal probability density function (PDF) for each parameter using the corresponding mean and standard deviation.
Sums these log-densities to produce the overall log-density of the proposal distribution for the parameter vector.
This function is used internally in our MCMC algorithms where proposals are drawn from component-specific Normal distributions. By returning a log-probability, it simplifies calculations involved in Metropolis-Hastings updates and related MCMC procedures.
A single numeric value representing the sum of the log proposal densities of all parameters for the specified component.
# Example usage: # Suppose we have a parameter vector and want to compute its proposal density under component 2: params <- c(0, 3, 5, 6, 1) # parameter vector component <- 2 log_proposal <- proposaldensity_combined(params, component) #print(log_proposal)# Example usage: # Suppose we have a parameter vector and want to compute its proposal density under component 2: params <- c(0, 3, 5, 6, 1) # parameter vector component <- 2 log_proposal <- proposaldensity_combined(params, component) #print(log_proposal)
This function generates a proposal value for the interaction parameter in the Potts model, drawn from a specified Beta distribution. It is typically used within the Markov Chain Monte Carlo (MCMC) framework to propose new candidate values for the interaction parameter at each iteration.
proposalfunction()proposalfunction()
The Potts model describes configurations of states on a lattice, and the interaction parameter
influences how neighboring sites align. An MCMC algorithm typically requires a proposal distribution
to generate new candidate values for parameters. Here, the function draws from a Beta(2,2)
distribution, which, similar to the prior example, draws the proposals towards the higher end of the
\[0,1\] interval.
A numeric value between 0 and 1, drawn from a Beta(2, 2) distribution, serving as a proposal value for the interaction parameter in the Potts model.
# Generate a proposal value for the Potts model interaction parameter proposed_val <- proposalfunction() #proposed_val# Generate a proposal value for the Potts model interaction parameter proposed_val <- proposalfunction() #proposed_val
The run_metropolis_MCMC_betas function serves as a high-level wrapper to execute the MCMC sampling process on Hi-C interaction data. It leverages the run_metropolis_MCMC_betas function to perform Bayesian inference, including updating model parameters (betas, gamma, size, and theta) and mixture component assignments. It can also run in parallel. By running multiple simulations simultaneously, it streamlines and expedites large-scale analyses.
run_chain_betas( N, gamma_prior=0.3, iterations, x_vars, y, theta_start=NULL, size_start=NULL, use_data_priors=TRUE, user_fixed_priors=NULL, dist="ZIP", epsilon=NULL, distance_metric="manhattan", mc_cores = 1 )run_chain_betas( N, gamma_prior=0.3, iterations, x_vars, y, theta_start=NULL, size_start=NULL, use_data_priors=TRUE, user_fixed_priors=NULL, dist="ZIP", epsilon=NULL, distance_metric="manhattan", mc_cores = 1 )
N |
An integer specifying the dimension of each interaction matrix ( |
gamma_prior |
A numeric value specifying the initial (prior) value for the Potts model interaction parameter |
iterations |
An integer indicating the number of MCMC iterations to run for each dataset. During these iterations, parameters (betas, gamma, theta, size) and component (z) are updated repeatedly. |
x_vars |
A list of covariates (e.g., distance, GC, TES, ACC) used in the regression model for the Hi-C interactions.
Each element of |
y |
A |
theta_start |
A numeric value for the initial theta parameter (zero-inflation probability) for ZIP or ZINB distributions.
This parameter controls the degree of zero-inflation in the model. If the chosen distribution does not involve zero-inflation,
|
size_start |
A numeric vector of length 3 providing initial values for the size (overdispersion) parameter,
required when |
use_data_priors |
data driven prior. |
user_fixed_priors |
user-specified prior. |
dist |
A character string specifying the distribution family to use for modeling interaction counts. Options include:
|
epsilon |
(Optional) A numeric value for the ABC threshold |
distance_metric |
A character string specifying the distance measure used in the ABC step. Supported values include:
|
mc_cores |
An integer specifying the number of CPU cores to use for parallel processing.
If |
The run_chain_betas function provides a convenient interface for running MCMC-based inference
on multiple Hi-C datasets simultaneously. It calls run_metropolis_MCMC_betas
which performs:
Estimation of mixture model parameters (betas) that adjust for biases like genomic distance, GC content, TES, and accessibility.
Updating the Potts model interaction parameter via Approximate Bayesian Computation (ABC).
Handling of zero-inflation () if using ZIP or ZINB distributions.
Updating the dispersion parameter () for NB or ZINB distributions.
By passing a list of count matrices through y, users can run multiple independent MCMC chains concurrently,
facilitating simulation studies, sensitivity analyses, or other large-scale inference tasks.
A list of length equal to the length of y. Each element of the returned list is itself a list
(as returned by run_metropolis_MCMC_betas) containing:
chains: MCMC samples of the regression parameters for each mixture component.
gamma: Samples of the Potts model interaction parameter.
theta: Samples of the zero-inflation parameter (if applicable).
size: Samples of the dispersion parameter (if using NB or ZINB).
[run_metropolis_MCMC_betas()]
# # Suppose we have multiple simulated Hi-C data sets: N <- 5 y <- list( matrix(rpois((N*N), lambda = 5), nrow = N) ) gamma_prior <- 0.3 iterations <- 10 x_vars <- list( distance = list(matrix(runif(N * N, 0, 1), nrow = N)), GC = list(matrix(runif(N * N, 0, 1), nrow = N)), TES = list(matrix(runif(N * N, 0, 1), nrow = N)), ACC = list(matrix(runif(N * N, 0, 1), nrow = N)) ) theta_start <- 0.5 size_start <- c(1, 1, 1) # Run the MCMC chains in parallel using 2 cores results <- run_chain_betas( N = N, gamma_prior = gamma_prior, iterations = iterations, x_vars = x_vars, y = y, theta_start = theta_start, use_data_priors = TRUE, user_fixed_priors = FALSE, epsilon = NULL, distance_metric = "manhattan", dist = "ZINB", size_start = size_start, mc_cores = 1 ) print(results) # Each element of 'results' corresponds to the output of run_metropolis_MCMC_betas for each dataset # #/donttest{ user_fixed_priors <- list( component1 = list( meany = 3, meanx1 = 5, meanx2 = 0.2, meanx3 = 0, meanx4 = 4, sdy = 1, sdx1 = 0.01, sdx2 = 0.01, sdx3 = 0.002, sdx4 = 0.005 ), component2 = list( meany = 300, meanx1 = 8, meanx2 = 0.3, meanx3 = 3, meanx4 = 8, sdy = 0.1, sdx1 = 0.1, sdx2 = 0.01, sdx3 = 0.02, sdx4 = 0.02 ), component3 = list( meany = 200, meanx1 = 10, meanx2 = 0.3, meanx3 = 1.6, meanx4 = 4.5, sdy = 0.1, sdx1 = 0.6, sdx2 = 0.1, sdx3 = 0.2, sdx4 = 0.2 ) ) #}# # Suppose we have multiple simulated Hi-C data sets: N <- 5 y <- list( matrix(rpois((N*N), lambda = 5), nrow = N) ) gamma_prior <- 0.3 iterations <- 10 x_vars <- list( distance = list(matrix(runif(N * N, 0, 1), nrow = N)), GC = list(matrix(runif(N * N, 0, 1), nrow = N)), TES = list(matrix(runif(N * N, 0, 1), nrow = N)), ACC = list(matrix(runif(N * N, 0, 1), nrow = N)) ) theta_start <- 0.5 size_start <- c(1, 1, 1) # Run the MCMC chains in parallel using 2 cores results <- run_chain_betas( N = N, gamma_prior = gamma_prior, iterations = iterations, x_vars = x_vars, y = y, theta_start = theta_start, use_data_priors = TRUE, user_fixed_priors = FALSE, epsilon = NULL, distance_metric = "manhattan", dist = "ZINB", size_start = size_start, mc_cores = 1 ) print(results) # Each element of 'results' corresponds to the output of run_metropolis_MCMC_betas for each dataset # #/donttest{ user_fixed_priors <- list( component1 = list( meany = 3, meanx1 = 5, meanx2 = 0.2, meanx3 = 0, meanx4 = 4, sdy = 1, sdx1 = 0.01, sdx2 = 0.01, sdx3 = 0.002, sdx4 = 0.005 ), component2 = list( meany = 300, meanx1 = 8, meanx2 = 0.3, meanx3 = 3, meanx4 = 8, sdy = 0.1, sdx1 = 0.1, sdx2 = 0.01, sdx3 = 0.02, sdx4 = 0.02 ), component3 = list( meany = 200, meanx1 = 10, meanx2 = 0.3, meanx3 = 1.6, meanx4 = 4.5, sdy = 0.1, sdx1 = 0.6, sdx2 = 0.1, sdx3 = 0.2, sdx4 = 0.2 ) ) #}
This function computes the prior contribution of the size parameter (also known as the dispersion parameter)
for a specified component in models using a Negative Binomial (NB) or Zero-Inflated Negative Binomial (ZINB) distribution.
The prior is assumed to follow a Gamma distribution.
size_prior(size_value, component)size_prior(size_value, component)
size_value |
A numeric value representing the |
component |
An integer specifying which component of the mixture model to consider (e.g., |
In NB and ZINB distributions, the size parameter controls the variance relative to the mean.
Bayesian inference often places a prior on this parameter to regularize its estimation.
By assigning a Gamma prior, we leverage its conjugacy properties and flexibility in capturing a range of plausible
dispersion values.
This function assigns a component-specific Gamma prior. Each component can have distinct shape (and possibly rate) parameters, allowing the model to accommodate different dispersion characteristics across mixture components.
The returned value is the log-density of the Gamma prior at the given size_value.
A numeric value representing the log of the prior density for the size_value parameter under the Gamma distribution
defined for the specified component.
# Example: Compute the size prior for size_value = 2.5 in component 1 log_prior_comp1 <- size_prior(size_value = 2.5, component = 1) #log_prior_comp1# Example: Compute the size prior for size_value = 2.5 in component 1 log_prior_comp1 <- size_prior(size_value = 2.5, component = 1) #log_prior_comp1