Title: | Model higher-order methylation profiles |
---|---|
Description: | The BPRMeth package is a probabilistic method to quantify explicit features of methylation profiles, in a way that would make it easier to formally use such profiles in downstream modelling efforts, such as predicting gene expression levels or clustering genomic regions or cells according to their methylation profiles. |
Authors: | Chantriolnt-Andreas Kapourani [aut, cre] |
Maintainer: | Chantriolnt-Andreas Kapourani <[email protected]> |
License: | GPL-3 | file LICENSE |
Version: | 1.33.0 |
Built: | 2024-12-29 03:51:45 UTC |
Source: | https://github.com/bioc/BPRMeth |
A synthetic dataset containinig 300 entries from Bernoulli probit regression observations (e.g. single cell methylation data).
bernoulli_data
bernoulli_data
A list with 300 elements, where each element element is an L x 2 matrix of observations, where:
locations of obseravtions
methylation level, 0 - unmethylated, 1 - methylated
Synthetic Bernoulli methylation data.
binomial_data
, beta_data
,
gaussian_data
A synthetic dataset containinig 300 entries from Beta probit regression observations (e.g. Beta values from array methylation data).
beta_data
beta_data
A list with 300 elements, where each element element is an L x 2 matrix of observations, where:
locations of obseravtions
methylation level
Synthetic Beta methylation data.
The beta regression model is based on alternative parameterization of the beta density in terms of the mean and dispersion parameter:https://cran.r-project.org/web/packages/betareg/.
binomial_data
, bernoulli_data
,
gaussian_data
A synthetic dataset containinig 300 entries from Binomial probit regression observations (e.g. bulk methylation data).
binomial_data
binomial_data
A list with 300 elements, where each element element is an L x 3 matrix of observations, where:
locations of obseravtions
total trials
number of successes
Synthetic Binomial methylation data.
bernoulli_data
, beta_data
,
gaussian_data
Create a boxplot of clustered gene expression levels which depend on the clustered methylation profiles. Each colour denotes a different cluster.
boxplot_cluster_expr(cluster_obj, expr, anno, title = "Expression levels")
boxplot_cluster_expr(cluster_obj, expr, anno, title = "Expression levels")
cluster_obj |
The output from |
expr |
The expression data object. |
anno |
The annotation data object. |
title |
Plot title |
A ggplot2 object.
C.A.Kapourani [email protected]
plot_cluster_profiles
,
plot_infer_profiles
, plot_predicted_expr
# Cluster methylation profiles using 3 RBFs basis <- create_rbf_object(M = 3) # Perform clustering cl_obj <- cluster_profiles_vb(X = encode_met$met, K = 3, model = "binomial", basis = basis, vb_max_iter = 5) # Create plot g <- boxplot_cluster_expr(cluster_obj = cl_obj, expr = encode_expr, anno = encode_met$anno)
# Cluster methylation profiles using 3 RBFs basis <- create_rbf_object(M = 3) # Perform clustering cl_obj <- cluster_profiles_vb(X = encode_met$met, K = 3, model = "binomial", basis = basis, vb_max_iter = 5) # Create plot g <- boxplot_cluster_expr(cluster_obj = cl_obj, expr = encode_expr, anno = encode_met$anno)
(DEPRECATED) bpr_cluster_wrap
is a wrapper function that
clusters methylation profiles using the EM algorithm. Initially, it
performs parameter checking, and initializes main parameters, such as
mixing proportions, basis function coefficients, then the EM algorithm is
applied and finally model selection metrics are calculated, such as BIC and
AIC.
bpr_cluster_wrap( x, K = 3, pi_k = NULL, w = NULL, basis = NULL, em_max_iter = 100, epsilon_conv = 1e-04, lambda = 1/2, opt_method = "CG", opt_itnmax = 100, init_opt_itnmax = 100, is_parallel = TRUE, no_cores = NULL, is_verbose = FALSE )
bpr_cluster_wrap( x, K = 3, pi_k = NULL, w = NULL, basis = NULL, em_max_iter = 100, epsilon_conv = 1e-04, lambda = 1/2, opt_method = "CG", opt_itnmax = 100, init_opt_itnmax = 100, is_parallel = TRUE, no_cores = NULL, is_verbose = FALSE )
x |
The binomial distributed observations, which has to be a list of
elements of length N, where each element is an L x 3 matrix of
observations, where 1st column contains the locations. The 2nd and 3rd
columns contain the total trials and number of successes at the
corresponding locations, repsectively. See
|
K |
Integer denoting the number of clusters K. |
pi_k |
Mixing proportions. |
w |
A MxK matrix, where each column consists of the basis function coefficients for each corresponding cluster. |
basis |
A 'basis' object. E.g. see |
em_max_iter |
Integer denoting the maximum number of EM iterations. |
epsilon_conv |
Numeric denoting the convergence parameter for EM. |
lambda |
The complexity penalty coefficient for ridge regression. |
opt_method |
The optimization method to be used. See
|
opt_itnmax |
Optional argument giving the maximum number of iterations
for the corresponding method. See |
init_opt_itnmax |
Optimization iterations for obtaining the initial EM parameter values. |
is_parallel |
Logical, indicating if code should be run in parallel. |
no_cores |
Number of cores to be used, default is max_no_cores - 2. |
is_verbose |
Logical, print results during EM iterations. |
A 'bpr_cluster' object which, in addition to the input parameters, consists of the following variables:
pi_k
: Fitted
mixing proportions.
w
: A MxK matrix with the fitted
coefficients of the basis functions for each cluster k.
NLL
:
The Negative Log Likelihood after the EM algorithm has finished.
r_nk
: Posterior probabilities of each promoter region
belonging to each cluster.
labels
: Hard clustering
assignments of each observation/promoter region.
BIC
:
Bayesian Information Criterion metric.
AIC
: Akaike
Information Criterion metric.
ICL
: Integrated Complete
Likelihood criterion metric.
C.A.Kapourani [email protected]
ex_data <- meth_data data_clust <- bpr_cluster_wrap(x = ex_data, em_max_iter = 3, opt_itnmax = 3, init_opt_itnmax = 5, is_parallel = FALSE)
ex_data <- meth_data data_clust <- bpr_cluster_wrap(x = ex_data, em_max_iter = 3, opt_itnmax = 3, init_opt_itnmax = 5, is_parallel = FALSE)
These functions evaluate the model log-likelihood and gradient for different observation models. Available models are "bpr" (i.e. "bernoulli" or "binomial"), "beta" and "lr" (i.e. "gaussian"). There are also functions to compute the sum and weighted sum of the observation model likelihoods, e.g. required for the EM algorithm. These functions are written in C++ for efficiency.
bpr_log_likelihood(w, X, H, lambda, is_nll) bpr_gradient(w, X, H, lambda, is_nll) betareg_log_likelihood(w, X, H, lambda, is_nll) betareg_gradient(w, X, H, lambda, is_nll) sum_weighted_bpr_lik(w, X_list, H_list, r_nk, lambda, is_nll) sum_weighted_bpr_grad(w, X_list, H_list, r_nk, lambda, is_nll) sum_weighted_betareg_lik(w, X_list, H_list, r_nk, lambda, is_nll) sum_weighted_betareg_grad(w, X_list, H_list, r_nk, lambda, is_nll) lr_log_likelihood(w, X, H, lambda = 0.5, is_nll = FALSE)
bpr_log_likelihood(w, X, H, lambda, is_nll) bpr_gradient(w, X, H, lambda, is_nll) betareg_log_likelihood(w, X, H, lambda, is_nll) betareg_gradient(w, X, H, lambda, is_nll) sum_weighted_bpr_lik(w, X_list, H_list, r_nk, lambda, is_nll) sum_weighted_bpr_grad(w, X_list, H_list, r_nk, lambda, is_nll) sum_weighted_betareg_lik(w, X_list, H_list, r_nk, lambda, is_nll) sum_weighted_betareg_grad(w, X_list, H_list, r_nk, lambda, is_nll) lr_log_likelihood(w, X, H, lambda = 0.5, is_nll = FALSE)
w |
A vector of parameters (i.e. coefficients of the basis functions) |
X |
An |
H |
The |
lambda |
The complexity penalty coefficient for penalized regression. |
is_nll |
Logical, indicating if the Negative Log Likelihood should be returned. |
X_list |
A list of elements of length N, where each element is an
|
H_list |
A list of elements of length N, where each element contains the
|
r_nk |
A vector of length N containing the posterior probabilities (i.e.
responsibilities) for each element of |
Returns the log likelihood or gradient of the observation model.
C.A.Kapourani [email protected]
eval_functions
, infer_profiles_mle
(DEPRECATED) The function bpr_optimize minimizes the negative
log likelihood of the BPR function. Since it cannot be evaluated
analytically, an optimization procedure is used. The
optim
packages is used for performing optimization.
bpr_optim(x, ...) ## S3 method for class 'list' bpr_optim( x, w = NULL, basis = NULL, lambda = 1/2, opt_method = "CG", opt_itnmax = 100, is_parallel = TRUE, no_cores = NULL, ... ) ## S3 method for class 'matrix' bpr_optim( x, w = NULL, basis = NULL, lambda = 1/2, opt_method = "CG", opt_itnmax = 100, ... )
bpr_optim(x, ...) ## S3 method for class 'list' bpr_optim( x, w = NULL, basis = NULL, lambda = 1/2, opt_method = "CG", opt_itnmax = 100, is_parallel = TRUE, no_cores = NULL, ... ) ## S3 method for class 'matrix' bpr_optim( x, w = NULL, basis = NULL, lambda = 1/2, opt_method = "CG", opt_itnmax = 100, ... )
x |
|
... |
Additional parameters. |
w |
A vector of parameters (i.e. coefficients of the basis functions) |
basis |
A 'basis' object. E.g. see |
lambda |
The complexity penalty coefficient for ridge regression. |
opt_method |
The optimization method to be used. See
|
opt_itnmax |
Optional argument giving the maximum number of iterations
for the corresponding method. See |
is_parallel |
Logical, indicating if code should be run in parallel. |
no_cores |
Number of cores to be used, default is max_no_cores - 2. |
Depending on the input object x
:
If x
is
a list
: An object containing the following elements:
W_opt
: An Nx(M+1) matrix with the optimized
parameter values. Each row of the matrix corresponds to each element of the
list x. The columns are of the same length as the parameter vector w (i.e.
number of basis functions).
Mus
: An N x M matrix with the
RBF centers if basis object is create_rbf_object
, otherwise
NULL.
basis
: The basis object.
w
: The
initial values of the parameters w.
If x
is a
matrix
: An object containing the following elements:
w_opt
: Optimized values for the coefficient vector
w. The length of the result is the same as the length of the vector w.
basis
: The basis object.
If calling
bpr_optim_fast
just the optimal weight matrix W_opt.
C.A.Kapourani [email protected]
# Example of optimizing parameters for synthetic data using default values data <- meth_data out_opt <- bpr_optim(x = data, is_parallel = FALSE, opt_itnmax = 3) #------------------------------------- # Example of optimizing parameters for synthetic data using 3 RBFs ex_data <- meth_data basis <- create_rbf_object(M=3) out_opt <- bpr_optim(x = ex_data, is_parallel = FALSE, basis = basis, opt_itnmax = 3) #------------------------------------- # Example of of specific promoter region using 2 RBFs basis <- create_rbf_object(M=2) w <- c(0.1, 0.1, 0.1) data <- meth_data[[1]] out_opt <- bpr_optim(x = data, w = w, basis = basis, fit_feature = "NLL", opt_itnmax = 3)
# Example of optimizing parameters for synthetic data using default values data <- meth_data out_opt <- bpr_optim(x = data, is_parallel = FALSE, opt_itnmax = 3) #------------------------------------- # Example of optimizing parameters for synthetic data using 3 RBFs ex_data <- meth_data basis <- create_rbf_object(M=3) out_opt <- bpr_optim(x = ex_data, is_parallel = FALSE, basis = basis, opt_itnmax = 3) #------------------------------------- # Example of of specific promoter region using 2 RBFs basis <- create_rbf_object(M=2) w <- c(0.1, 0.1, 0.1) data <- meth_data[[1]] out_opt <- bpr_optim(x = data, w = w, basis = basis, fit_feature = "NLL", opt_itnmax = 3)
(DEPRECATED) bpr_predict_wrap
is a function that wraps
all the necessary subroutines for performing prediction on gene expression
levels. Initially, it optimizes the parameters of the basis functions so as
to learn the methylation profiles. Then, uses the learned parameters /
coefficients of the basis functions as input features for performing
regression in order to predict the corresponding gene expression levels.
bpr_predict_wrap( formula = NULL, x, y, model_name = "svm", w = NULL, basis = NULL, train_ind = NULL, train_perc = 0.7, fit_feature = "RMSE", cov_feature = TRUE, opt_method = "CG", opt_itnmax = 100, is_parallel = TRUE, no_cores = NULL, is_summary = TRUE )
bpr_predict_wrap( formula = NULL, x, y, model_name = "svm", w = NULL, basis = NULL, train_ind = NULL, train_perc = 0.7, fit_feature = "RMSE", cov_feature = TRUE, opt_method = "CG", opt_itnmax = 100, is_parallel = TRUE, no_cores = NULL, is_summary = TRUE )
formula |
An object of class |
x |
The binomial distributed observations, which has to be a list of
elements of length N, where each element is an L x 3 matrix of
observations, where 1st column contains the locations. The 2nd and 3rd
columns contain the total trials and number of successes at the
corresponding locations, repsectively. See
|
y |
Corresponding gene expression data for each element of the list x. |
model_name |
A string denoting the regression model. Currently,
available models are: |
w |
Optional vector of initial parameter / coefficient values. |
basis |
Optional basis function object, default is an 'rbf' object, see
|
train_ind |
Optional vector containing the indices for the train set. |
train_perc |
Optional parameter for defining the percentage of the dataset to be used for training set, the remaining will be the test set. |
fit_feature |
Return additional feature on how well the profile fits the methylation data. Either NULL for ignoring this feature or one of the following: 1) "RMSE" for returning the fit of the profile using the RMSE as measure of error or 2) "NLL" for returning the fit of the profile using the Negative Log Likelihood as measure of error. |
cov_feature |
Logical, whether to return an additional feature for the CpG coverage across the promoter region. |
opt_method |
The optimization method to be used. See
|
opt_itnmax |
Optional argument giving the maximum number of iterations
for the corresponding method. See |
is_parallel |
Logical, indicating if code should be run in parallel. |
no_cores |
Number of cores to be used, default is max_no_cores - 2. |
is_summary |
Logical, print the summary statistics. |
A 'bpr_predict' object which, in addition to the input parameters, consists of the following variables:
W_opt
: An
Nx(M+1) matrix with the optimized parameter values. Each row of the matrix
corresponds to each element of the list x. The columns are of the same
length as the parameter vector w (i.e. number of basis functions).
Mus
: An N x M matrix with the RBF centers if basis object is
create_rbf_object
, otherwise NULL.
train: The training data.
test: The test data.
gex_model
: The
fitted regression model.
train_pred
The predicted values for
the training data.
test_pred
The predicted values for the test
data.
train_errors
: The training error metrics.
test_errors
: The test error metrics.
C.A.Kapourani [email protected]
obs <- meth_data y <- gex_data basis <- create_rbf_object(M = 5) out <- bpr_predict_wrap(x = obs, y = y, basis = basis, is_parallel = FALSE, opt_itnmax = 3)
obs <- meth_data y <- gex_data basis <- create_rbf_object(M = 5) out <- bpr_predict_wrap(x = obs, y = y, basis = basis, is_parallel = FALSE, opt_itnmax = 3)
BPRMeth
: Extracting higher order methylation featuresHigher order methylation features for clustering and prediction in epigenomic studies
.datatable.aware
.datatable.aware
An object of class logical
of length 1.
BPRMeth main package documentation.
C.A.Kapourani [email protected]
General purpose functions for clustering latent profiles for different observation models using maximum likelihood estimation (MLE) and the EM algorithm. Initially, it performs parameter checking, and initializes main parameters, such as mixing proportions, basis function coefficients, then the EM algorithm is applied and finally model selection metrics are calculated, such as BIC and AIC.
cluster_profiles_mle( X, K = 3, model = NULL, basis = NULL, H = NULL, pi_k = NULL, lambda = 0.5, beta_dispersion = 5, gaussian_sigma = rep(0.2, K), w = NULL, em_max_iter = 50, epsilon_conv = 1e-04, opt_method = "CG", opt_itnmax = 50, init_opt_itnmax = 30, is_parallel = FALSE, no_cores = NULL, is_verbose = FALSE, ... )
cluster_profiles_mle( X, K = 3, model = NULL, basis = NULL, H = NULL, pi_k = NULL, lambda = 0.5, beta_dispersion = 5, gaussian_sigma = rep(0.2, K), w = NULL, em_max_iter = 50, epsilon_conv = 1e-04, opt_method = "CG", opt_itnmax = 50, init_opt_itnmax = 30, is_parallel = FALSE, no_cores = NULL, is_verbose = FALSE, ... )
X |
The input data, which has to be a |
K |
Integer denoting the total number of clusters K. |
model |
Observation model name as character string. It can be either 'bernoulli', 'binomial', 'beta' or 'gaussian'. |
basis |
A 'basis' object. E.g. see |
H |
Optional, design matrix of the input data X. If NULL, H will be computed inside the function. |
pi_k |
Vector of length K, denoting the mixing proportions. |
lambda |
The complexity penalty coefficient for ridge regression. |
beta_dispersion |
Dispersion parameter, only used for Beta distribution and will be the same for all observations. |
gaussian_sigma |
Initial standard deviation of the noise term, only used when having "gaussian" observation model. |
w |
Optional, an (M+1)xK matrix of the initial parameters, where each column consists of the basis function coefficients for each corresponding cluster k. If NULL, will be assigned with default values. |
em_max_iter |
Integer denoting the maximum number of EM iterations. |
epsilon_conv |
Numeric denoting the convergence threshold for EM. |
opt_method |
The optimization method to be used. See
|
opt_itnmax |
Optional argument giving the maximum number of iterations
for the corresponding method. See |
init_opt_itnmax |
Optimization iterations for obtaining the initial EM parameter values. |
is_parallel |
Logical, indicating if code should be run in parallel. |
no_cores |
Number of cores to be used, default is max_no_cores - 1. |
is_verbose |
Logical, print results during EM iterations. |
... |
Additional parameters. |
An object of class cluster_profiles_mle_
"obs_model" with the
following elements:
W
: An (M+1) X K matrix with the
optimized parameter values for each cluster. Each column of the matrix
corresponds a different cluster k. M are the number of basis functions.
pi_k
: Mixing proportions.
r_nk
: An (N X K)
responsibility matrix of each observations being explained by a specific
cluster.
basis
: The basis object.
nll
: The
negative log likelihood vector.
labels
: Cluster assignment
labels.
bic
: Bayesian Information Criterion metric.
aic
: Akaike Information Criterion metric.
icl
:
Integrated Complete Likelihood criterion metric.
gaussian_sigma
: Optimized standard deviation for gaussian
observation model.
The beta regression model is based on alternative parameterization of the beta density in terms of the mean and dispersion parameter: https://cran.r-project.org/web/packages/betareg/. For modelling details for Binomial/Bernoulli observation model check the paper for BPRMeth: https://academic.oup.com/bioinformatics/article/32/17/i405/2450762 .
C.A.Kapourani [email protected]
create_basis
, cluster_profiles_vb
infer_profiles_vb
, infer_profiles_mle
,
infer_profiles_gibbs
, create_region_object
# Example of optimizing parameters for synthetic data using 3 RBFs basis <- create_rbf_object(M=3) out <- cluster_profiles_mle(X = binomial_data, model = "binomial", basis=basis, em_max_iter = 5, opt_itnmax = 5, init_opt_itnmax=5, is_parallel = FALSE) #------------------------------------- basis <- create_rbf_object(M=3) out <- cluster_profiles_mle(X = gaussian_data, model = "gaussian", basis=basis, em_max_iter = 5, opt_itnmax = 5, init_opt_itnmax=5, is_parallel = FALSE)
# Example of optimizing parameters for synthetic data using 3 RBFs basis <- create_rbf_object(M=3) out <- cluster_profiles_mle(X = binomial_data, model = "binomial", basis=basis, em_max_iter = 5, opt_itnmax = 5, init_opt_itnmax=5, is_parallel = FALSE) #------------------------------------- basis <- create_rbf_object(M=3) out <- cluster_profiles_mle(X = gaussian_data, model = "gaussian", basis=basis, em_max_iter = 5, opt_itnmax = 5, init_opt_itnmax=5, is_parallel = FALSE)
General purpose functions for clustering latent profiles for different observation models using Variational Bayes (VB) EM-like algorithm.
cluster_profiles_vb( X, K = 3, model = NULL, basis = NULL, H = NULL, delta_0 = NULL, w = NULL, gaussian_l = 50, alpha_0 = 0.5, beta_0 = 0.1, vb_max_iter = 100, epsilon_conv = 1e-04, is_verbose = FALSE, ... )
cluster_profiles_vb( X, K = 3, model = NULL, basis = NULL, H = NULL, delta_0 = NULL, w = NULL, gaussian_l = 50, alpha_0 = 0.5, beta_0 = 0.1, vb_max_iter = 100, epsilon_conv = 1e-04, is_verbose = FALSE, ... )
X |
The input data, which has to be a |
K |
Integer denoting the total number of clusters K. |
model |
Observation model name as character string. It can be either 'bernoulli', 'binomial', 'beta' or 'gaussian'. |
basis |
A 'basis' object. E.g. see |
H |
Optional, design matrix of the input data X. If NULL, H will be computed inside the function. |
delta_0 |
Parameter vector of the Dirichlet prior on the mixing proportions pi. |
w |
Optional, an (M+1)xK matrix of the initial parameters, where each column consists of the basis function coefficients for each corresponding cluster k. If NULL, will be assigned with default values. |
gaussian_l |
Noise precision parameter, only used when having "gaussian" observation model. |
alpha_0 |
Hyperparameter: shape parameter for Gamma distribution. A Gamma distribution is used as prior for the precision parameter tau. |
beta_0 |
Hyperparameter: rate parameter for Gamma distribution. A Gamma distribution is used as prior for the precision parameter tau. |
vb_max_iter |
Integer denoting the maximum number of VB iterations. |
epsilon_conv |
Numeric denoting the convergence threshold for VB. |
is_verbose |
Logical, print results during VB iterations. |
... |
Additional parameters. |
An object of class cluster_profiles_vb_
"obs_model" with the
following elements:
W
: An (M+1) X K matrix with the
optimized parameter values for each cluster, M are the number of basis
functions. Each column of the matrix corresponds a different cluster k.
W_Sigma
: A list with the covariance matrices of the posterior
parmateter W for each cluster k.
r_nk
: An (N X K)
responsibility matrix of each observations being explained by a specific
cluster.
delta
: Optimized Dirichlet paramter for the mixing
proportions.
alpha
: Optimized shape parameter of Gamma
distribution.
beta
: Optimized rate paramter of the Gamma
distribution
basis
: The basis object.
lb
:
The lower bound vector.
labels
: Cluster assignment labels.
pi_k
: Expected value of mixing proportions.
The modelling and mathematical details for clustering profiles using mean-field variational inference are explained here: http://rpubs.com/cakapourani/ . More specifically:
For Binomial/Bernoulli observation model check: http://rpubs.com/cakapourani/vb-mixture-bpr
For Gaussian observation model check: http://rpubs.com/cakapourani/vb-mixture-lr
C.A.Kapourani [email protected]
create_basis
, cluster_profiles_mle
infer_profiles_vb
, infer_profiles_mle
,
infer_profiles_gibbs
, create_region_object
# Example of optimizing parameters for synthetic data using 3 RBFs basis <- create_rbf_object(M=3) out <- cluster_profiles_vb(X = binomial_data, model = "binomial", basis=basis, vb_max_iter = 10) #------------------------------------- basis <- create_rbf_object(M=3) out <- cluster_profiles_vb(X = gaussian_data, model = "gaussian", basis=basis, vb_max_iter = 10)
# Example of optimizing parameters for synthetic data using 3 RBFs basis <- create_rbf_object(M=3) out <- cluster_profiles_vb(X = binomial_data, model = "binomial", basis=basis, vb_max_iter = 10) #------------------------------------- basis <- create_rbf_object(M=3) out <- cluster_profiles_vb(X = gaussian_data, model = "gaussian", basis=basis, vb_max_iter = 10)
create_anno_region
creates annotation regions from
annotation data, using the central point of the annotation features as
ground truth labels we create genomic regions N
bp upstream and
M
bp downstream of central location.
create_anno_region( anno, chrom_size = NULL, is_centre = FALSE, is_window = TRUE, upstream = -5000, downstream = 5000 )
create_anno_region( anno, chrom_size = NULL, is_centre = FALSE, is_window = TRUE, upstream = -5000, downstream = 5000 )
anno |
A |
chrom_size |
Object containing genome chromosome sizes, normally would
be the output of |
is_centre |
Logical, whether 'start' and 'end' locations are pre-centred. If TRUE, the mean of the locations will be chosen as centre. If FALSE, the 'start' will be chosen as the center; e.g. for genes the 'start' denotes the TSS and we use this as centre to obtain K-bp upstream and downstream of TSS. |
is_window |
Whether to consider a predefined window region around centre. If TRUE, then 'upstream' and 'downstream' parameters are used, otherwise we consider the whole region from start to end location. |
upstream |
Integer defining the length of bp upstream of 'centre' for creating the genomic region. If is_window = FALSE, this parameter is ignored. |
downstream |
Integer defining the length of bp downstream of 'centre' for creating the genomic region. If is_window = FALSE, this parameter is ignored. |
A GRanges
object containing the genomic regions.
The GRanges object contains two or three additional metadata column:
id
: Genomic region id.
centre
: Central
location of each genomic region.
name
: (Optional) Genomic
region name.
This column can be accessed as follows:
granges_object$tss
C.A.Kapourani [email protected]
create_region_object
, read_anno
# Obtain the path to files file <- system.file("extdata", "dummy_anno.bed", package = "BPRMeth") anno_dt <- read_anno(file, is_anno_region = FALSE) # Create genomic region gen_region <- create_anno_region(anno_dt) # Extract ID id <- gen_region$id
# Obtain the path to files file <- system.file("extdata", "dummy_anno.bed", package = "BPRMeth") anno_dt <- read_anno(file, is_anno_region = FALSE) # Create genomic region gen_region <- create_anno_region(anno_dt) # Extract ID id <- gen_region$id
These functions create different basis objects, which can be used as input to complex functions in order to perform computations depending on the class of the basis function.
create_rbf_object( M = 2, gamma = NULL, mus = NULL, eq_spaced_mus = TRUE, whole_region = TRUE ) create_polynomial_object(M = 1) create_fourier_object(M = 2, period = 2)
create_rbf_object( M = 2, gamma = NULL, mus = NULL, eq_spaced_mus = TRUE, whole_region = TRUE ) create_polynomial_object(M = 1) create_fourier_object(M = 2, period = 2)
M |
The number of the basis functions. In case of Fourier basis, this number should be even, since we need to have pairs of sines and cosines and the constant term is added by default. |
gamma |
Inverse width of radial basis function. |
mus |
Optional centers of the RBF. |
eq_spaced_mus |
Logical, if TRUE, equally spaced centers are created,
otherwise centers are created using |
whole_region |
Logical, indicating if the centers will be evaluated equally spaced on the whole region, or between the min and max of the observation values. |
period |
The period, that is the basis functions are periodic on a specific interval. Best choice is the range of the points used for regression. |
A basis object of class 'rbf', 'polynomial' or 'fourier'.
C.A.Kapourani [email protected]
(obj <- create_rbf_object(M = 2)) #--------------------------------- (obj <- create_polynomial_object(M = 2)) (obj <- create_fourier_object(M = 2, period = 1))
(obj <- create_rbf_object(M = 2)) #--------------------------------- (obj <- create_polynomial_object(M = 2)) (obj <- create_fourier_object(M = 2, period = 1))
create_region_object
creates genomic regions (e.g. forms
methylation regions data) using as input methylation and annotation data
with genomic regions of interest.
create_region_object( met_dt, anno_dt, cov = 5, sd_thresh = 0.1, ignore_strand = TRUE, filter_empty_region = TRUE, fmin = -1, fmax = 1 )
create_region_object( met_dt, anno_dt, cov = 5, sd_thresh = 0.1, ignore_strand = TRUE, filter_empty_region = TRUE, fmin = -1, fmax = 1 )
met_dt |
A |
anno_dt |
A |
cov |
Integer defining the minimum coverage of CpGs that each region must contain. |
sd_thresh |
Optional numeric defining the minimum standard deviation of the methylation change in a region. This is used to filter regions with no methylation variability. |
ignore_strand |
Logical, whether or not to ignore strand information. |
filter_empty_region |
Logical, whether to discard genomic regions that have no CpG coverage or do not pass filtering options. |
fmin |
Minimum range value for location scaling. Under this version, it should be left to its default value -1. |
fmax |
Maximum range value for location scaling. Under this version, it should be left to its default value 1. |
A list
object containing the two elements:
met
: A list containing methylation region data, where each entry in
the list is an dimensional matrix, where
denotes the number of CpGs found in region
i
. The columns contain
the following information:
1st column: Contains the locations of CpGs relative to centre. Note that the actual locations are scaled to the (fmin, fmax) region.
2nd column: If "bulk" data (i.e. binomial) it contains the total number of reads at each CpG location, otherwise the methylation level.
3rd column: If "bulk" data, the methylated reads at each CpG location, otherwise this D = 2 and this column is absent.
. Rownames of each matrix contain the actual CpG genomic coordinates as <chr>:<location>.
anno
: The annotation
object.
Note: The lengths of met
and anno
should match.
C.A.Kapourani [email protected]
## Not run: # Download the files and change the working directory to that location met_dt <- read_met("name_of_met_file") anno_dt <- read_anno("name_of_anno_file") obj <- create_region_object(met_dt, anno_dt) # Extract methylation regions met <- obj$met ## End(Not run)
## Not run: # Download the files and change the working directory to that location met_dt <- read_met("name_of_met_file") anno_dt <- read_anno("name_of_anno_file") obj <- create_region_object(met_dt, anno_dt) # Extract methylation regions met <- obj$met ## End(Not run)
These functions call the appropriate methods depending on the
class of the object obj
to create RBF, polynomial or Fourier design
matrices.
design_matrix(obj, ...) ## Default S3 method: design_matrix(obj, ...) ## S3 method for class 'polynomial' design_matrix(obj, obs, ...) ## S3 method for class 'rbf' design_matrix(obj, obs, ...) ## S3 method for class 'fourier' design_matrix(obj, obs, ...)
design_matrix(obj, ...) ## Default S3 method: design_matrix(obj, ...) ## S3 method for class 'polynomial' design_matrix(obj, obs, ...) ## S3 method for class 'rbf' design_matrix(obj, obs, ...) ## S3 method for class 'fourier' design_matrix(obj, obs, ...)
obj |
A basis function object. |
... |
Additional parameters. |
obs |
A vector of observations. |
A design matrix object
C.A.Kapourani [email protected]
obj <- create_polynomial_object(M=2) obs <- c(0,.2,.5) poly <- design_matrix(obj, obs) #---------------- obj <- create_rbf_object(M=2) obs <- c(0,.2,.5) rbf <- design_matrix(obj, obs) #---------------- obj <- create_fourier_object(M=2) obs <- c(0,.2,.5) fourier <- design_matrix(obj, obs)
obj <- create_polynomial_object(M=2) obs <- c(0,.2,.5) poly <- design_matrix(obj, obs) #---------------- obj <- create_rbf_object(M=2) obs <- c(0,.2,.5) rbf <- design_matrix(obj, obs) #---------------- obj <- create_fourier_object(M=2) obs <- c(0,.2,.5) fourier <- design_matrix(obj, obs)
Small subset of ENCODE expression data already in pre-processed format, which are used as a case study for the vignette.
encode_expr
encode_expr
Expression data in format as the output of read_expr
fucntion, i.e. a data.table
with two columns: "id", "expr".
Encode expression data
create_region_object
, read_met
,
read_anno
, read_expr
Small subset of ENCODE methylation data already in pre-processed format, which are used as a case study for the vignette.
encode_met
encode_met
A list object containing methylation regions and annotation data.
This in general would be the output of the
create_region_object
function. It has the following two
objects:
met
: A list containing the methylation
regions, each element of the list is a different genomic region.
anno
: Corresponding annotation data.
Encode methylation data
create_region_object
, read_met
,
read_anno
, read_expr
Method for evaluating an M basis function model with observation
data obs
and coefficients w
.
eval_probit_function(obj, ...) eval_function(obj, ...) ## S3 method for class 'rbf' eval_function(obj, obs, w, ...) ## S3 method for class 'polynomial' eval_function(obj, obs, w, ...) ## S3 method for class 'fourier' eval_function(obj, obs, w, ...)
eval_probit_function(obj, ...) eval_function(obj, ...) ## S3 method for class 'rbf' eval_function(obj, obs, w, ...) ## S3 method for class 'polynomial' eval_function(obj, obs, w, ...) ## S3 method for class 'fourier' eval_function(obj, obs, w, ...)
obj |
The basis function object. |
... |
Optional additional parameters |
obs |
Observation data. |
w |
Vector of length M, containing the coefficients of an
M |
The evaluated function values.
NOTE that the eval_probit_function
computes the probit transformed
basis function values.
C.A.Kapourani [email protected]
# Evaluate the probit transformed basis function values obj <- create_rbf_object(M=2) obs <- c(1,2,3) w <- c(0.1, 0.3, -0.6) out <- eval_probit_function(obj, obs, w) # ------------------------- # Evaluate the RBF basis function values obj <- create_rbf_object(M=2, mus = c(2,2.5)) obs <- c(1,2,3) w <- c(0.1, 0.3, -0.6) out <- eval_function(obj, obs, w) # ------------------------- # Evaluate the Polynomial basis function values obj <- create_polynomial_object(M=2) obs <- c(1,2,3) w <- c(0.1, 0.3, -0.6) out <- eval_function(obj, obs, w) # ------------------------- # Evaluate the Fourier basis function values obj <- create_fourier_object(M=2) obs <- c(1,2,3) w <- c(0.1, 0.3, -0.6) out <- eval_function(obj, obs, w)
# Evaluate the probit transformed basis function values obj <- create_rbf_object(M=2) obs <- c(1,2,3) w <- c(0.1, 0.3, -0.6) out <- eval_probit_function(obj, obs, w) # ------------------------- # Evaluate the RBF basis function values obj <- create_rbf_object(M=2, mus = c(2,2.5)) obs <- c(1,2,3) w <- c(0.1, 0.3, -0.6) out <- eval_function(obj, obs, w) # ------------------------- # Evaluate the Polynomial basis function values obj <- create_polynomial_object(M=2) obs <- c(1,2,3) w <- c(0.1, 0.3, -0.6) out <- eval_function(obj, obs, w) # ------------------------- # Evaluate the Fourier basis function values obj <- create_fourier_object(M=2) obs <- c(1,2,3) w <- c(0.1, 0.3, -0.6) out <- eval_function(obj, obs, w)
A synthetic dataset containinig 300 entries from Gaussian (i.e. linear) regression observations (e.g. M-values from array methylation data).
gaussian_data
gaussian_data
A list with 300 elements, where each element element is an L x 2 matrix of observations, where:
locations of obseravtions
methylation level
Synthetic Bernoulli methylation data.
binomial_data
, beta_data
,
bernoulli_data
Corresponding gene expression data for the
meth_data
gex_data
gex_data
A vector of length 600
Synthetic gene expression data
meth_data
, bernoulli_data
,
binomial_data
, beta_data
,
gaussian_data
Make predictions of missing methylation states, i.e. perfrom imputation using BPRmeth This requires keepin a subset of data as a held out test set during BPRMeth inference or providing a different file that contains chromosome and CpG locations.
impute_bulk_met(obj, anno, test_data = NULL, return_test = FALSE)
impute_bulk_met(obj, anno, test_data = NULL, return_test = FALSE)
obj |
Output of BPRMeth inference object. |
anno |
A |
test_data |
Test data to evaluate performance. |
return_test |
Whether or not to return a list with the predictions. |
A list containing two vectors, the true methylation state and the predicted/imputed methylation states.
C.A.Kapourani [email protected]
# Extract synthetic data dt <- encode_met # Partition to train and test set dt <- partition_bulk_dataset(dt) # Create basis object basis_obj <- create_rbf_object(M = 3) # Run BPRMeth fit <- infer_profiles_mle(X = dt$met, model = "binomial", basis = basis_obj, is_parallel = FALSE, opt_itnmax = 10) # Perform imputation imputation_obj <- impute_bulk_met(obj = fit, anno = dt$anno, test_data = dt$met_test)
# Extract synthetic data dt <- encode_met # Partition to train and test set dt <- partition_bulk_dataset(dt) # Create basis object basis_obj <- create_rbf_object(M = 3) # Run BPRMeth fit <- infer_profiles_mle(X = dt$met, model = "binomial", basis = basis_obj, is_parallel = FALSE, opt_itnmax = 10) # Perform imputation imputation_obj <- impute_bulk_met(obj = fit, anno = dt$anno, test_data = dt$met_test)
General purpose functions for inferring latent profiles for different observation models using Gibbs sampling. Currently implemented observation models are: 'bernoulli' and 'binomial' and the auxiliary variable approach is used.
infer_profiles_gibbs( X, model = NULL, basis = NULL, H = NULL, w = NULL, mu_0 = NULL, cov_0 = NULL, gibbs_nsim = 500, gibbs_burn_in = 100, store_gibbs_draws = FALSE, is_parallel = FALSE, no_cores = NULL, ... )
infer_profiles_gibbs( X, model = NULL, basis = NULL, H = NULL, w = NULL, mu_0 = NULL, cov_0 = NULL, gibbs_nsim = 500, gibbs_burn_in = 100, store_gibbs_draws = FALSE, is_parallel = FALSE, no_cores = NULL, ... )
X |
The input data, either a |
model |
Observation model name as character string. It can be either 'bernoulli' or 'binomial'. |
basis |
A 'basis' object. E.g. see |
H |
Optional, design matrix of the input data X. If NULL, H will be computed inside the function. |
w |
A vector of initial parameters (i.e. coefficients of the basis functions). If NULL, it will be initialized inside the function. |
mu_0 |
The prior mean hyperparameter vector for w. |
cov_0 |
The prior covariance hyperparameter matrix for w. |
gibbs_nsim |
Total number of simulations for the Gibbs sampler. |
gibbs_burn_in |
Burn in period of the Gibbs sampler. |
store_gibbs_draws |
Logical indicating if we should keep the whole MCMC chain for further analysis. |
is_parallel |
Logical, indicating if code should be run in parallel. |
no_cores |
Number of cores to be used, default is max_no_cores - 1. |
... |
Additional parameters. |
An object of class infer_profiles_gibbs_
"obs_model" with the
following elements:
W
: An Nx(M+1) matrix with the
posterior mean of the parameters w. Each row of the matrix corresponds to
each element of the list X; if X is a matrix, then N = 1. The columns are
of the same length as the parameter vector w (i.e. number of basis
functions).
W_sd
: An Nx(M+1) matrix with the posterior
standard deviation (sd) of the parameters W.
basis
: The
basis object.
nll_feat
: NLL fit feature.
rmse_feat
: RMSE fit feature.
coverage_feat
: CpG
coverage feature.
W_draws
: Optional, draws of the Gibbs
sampler.
The modelling and mathematical details for inferring profiles using Gibbs sampling are explained here: http://rpubs.com/cakapourani/ . More specifically:
For Binomial observation model check: http://rpubs.com/cakapourani/bayesian-bpr-model
For Bernoulli observation model check: http://rpubs.com/cakapourani/bayesian-bpr-model
C.A.Kapourani [email protected]
create_basis
, infer_profiles_mle
,
infer_profiles_vb
, create_region_object
# Example of inferring parameters for synthetic data using 3 RBFs basis <- create_rbf_object(M=3) out <- infer_profiles_gibbs(X = binomial_data, model = "binomial", basis = basis, is_parallel = FALSE, gibbs_nsim = 10, gibbs_burn_in = 5) #-------------------------------------
# Example of inferring parameters for synthetic data using 3 RBFs basis <- create_rbf_object(M=3) out <- infer_profiles_gibbs(X = binomial_data, model = "binomial", basis = basis, is_parallel = FALSE, gibbs_nsim = 10, gibbs_burn_in = 5) #-------------------------------------
General purpose functions for inferring latent profiles for
different observation models using maximum likelihood estimation (MLE).
Current observation models are: 'bernoulli', 'binomial', 'beta' or
'gaussian'. For most models we cannot obtain an analytically tractable
solution, hence an optimization procedure is used. The
optim
package is used for performing optimization.
infer_profiles_mle( X, model = NULL, basis = NULL, H = NULL, lambda = 0.5, w = NULL, beta_dispersion = 5, opt_method = "CG", opt_itnmax = 100, is_parallel = FALSE, no_cores = NULL, ... )
infer_profiles_mle( X, model = NULL, basis = NULL, H = NULL, lambda = 0.5, w = NULL, beta_dispersion = 5, opt_method = "CG", opt_itnmax = 100, is_parallel = FALSE, no_cores = NULL, ... )
X |
The input data, either a |
model |
Observation model name as character string. It can be either 'bernoulli', 'binomial', 'beta' or 'gaussian'. |
basis |
A 'basis' object. E.g. see |
H |
Optional, design matrix of the input data X. If NULL, H will be computed inside the function. |
lambda |
The complexity penalty coefficient for ridge regression. |
w |
A vector of initial parameters (i.e. coefficients of the basis functions). |
beta_dispersion |
Dispersion parameter, only used for Beta distribution and will be the same for all observations. |
opt_method |
The optimization method to be used. See
|
opt_itnmax |
Optional argument giving the maximum number of iterations
for the corresponding method. See |
is_parallel |
Logical, indicating if code should be run in parallel. |
no_cores |
Number of cores to be used, default is max_no_cores - 1. |
... |
Additional parameters. |
An object of class infer_profiles_mle_
"obs_model" with the
following elements:
W
: An Nx(M+1) matrix with the
optimized parameter values. Each row of the matrix corresponds to each
element of the list X; if X is a matrix, then N = 1. The columns are of the
same length as the parameter vector w (i.e. number of basis functions).
basis
: The basis object.
nll_feat
: NLL fit
feature.
rmse_feat
: RMSE fit feature.
coverage_feat
: CpG coverage feature.
The beta regression model is based on alternative parameterization of the beta density in terms of the mean and dispersion parameter: https://cran.r-project.org/web/packages/betareg/ . For modelling details for Binomial/Bernoulli observation model check the paper for BPRMeth: https://academic.oup.com/bioinformatics/article/32/17/i405/2450762 .
C.A.Kapourani [email protected]
create_basis
, infer_profiles_vb
,
infer_profiles_gibbs
, create_region_object
# Example of optimizing parameters for synthetic data using 3 RBFs basis <- create_rbf_object(M=3) out <- infer_profiles_mle(X = binomial_data, model = "binomial", basis = basis, is_parallel = FALSE, opt_itnmax = 10) #------------------------------------- basis <- create_rbf_object(M=3) out <- infer_profiles_mle(X = beta_data, model = "beta", basis = basis, is_parallel = FALSE, opt_itnmax = 10) #------------------------------------- basis <- create_rbf_object(M=3) out <- infer_profiles_mle(X = gaussian_data[[1]], model = "gaussian", basis = basis, is_parallel = FALSE, opt_itnmax = 10)
# Example of optimizing parameters for synthetic data using 3 RBFs basis <- create_rbf_object(M=3) out <- infer_profiles_mle(X = binomial_data, model = "binomial", basis = basis, is_parallel = FALSE, opt_itnmax = 10) #------------------------------------- basis <- create_rbf_object(M=3) out <- infer_profiles_mle(X = beta_data, model = "beta", basis = basis, is_parallel = FALSE, opt_itnmax = 10) #------------------------------------- basis <- create_rbf_object(M=3) out <- infer_profiles_mle(X = gaussian_data[[1]], model = "gaussian", basis = basis, is_parallel = FALSE, opt_itnmax = 10)
General purpose functions for inferring latent profiles for different observation models using Variational Bayes (VB). Current observation models are: 'bernoulli', 'binomial' or 'gaussian'.
infer_profiles_vb( X, model = NULL, basis = NULL, H = NULL, w = NULL, gaussian_l = 50, alpha_0 = 0.5, beta_0 = 0.1, vb_max_iter = 100, epsilon_conv = 1e-05, is_parallel = FALSE, no_cores = NULL, is_verbose = FALSE, ... )
infer_profiles_vb( X, model = NULL, basis = NULL, H = NULL, w = NULL, gaussian_l = 50, alpha_0 = 0.5, beta_0 = 0.1, vb_max_iter = 100, epsilon_conv = 1e-05, is_parallel = FALSE, no_cores = NULL, is_verbose = FALSE, ... )
X |
The input data, either a |
model |
Observation model name as character string. It can be either 'bernoulli', 'binomial', 'beta' or 'gaussian'. |
basis |
A 'basis' object. E.g. see |
H |
Optional, design matrix of the input data X. If NULL, H will be computed inside the function. |
w |
A vector of initial parameters (i.e. coefficients of the basis functions). If NULL, it will be initialized inside the function. |
gaussian_l |
Noise precision parameter, only used when having "gaussian" observation model. |
alpha_0 |
Hyperparameter: shape parameter for Gamma distribution. A Gamma distribution is used as prior for the precision parameter tau. |
beta_0 |
Hyperparameter: rate parameter for Gamma distribution. A Gamma distribution is used as prior for the precision parameter tau. |
vb_max_iter |
Integer denoting the maximum number of VB iterations. |
epsilon_conv |
Numeric denoting the convergence threshold for VB. |
is_parallel |
Logical, indicating if code should be run in parallel. |
no_cores |
Number of cores to be used, default is max_no_cores - 1. |
is_verbose |
Logical, print results during VB iterations. |
... |
Additional parameters. |
An object of class infer_profiles_vb_
"obs_model" with the
following elements:
W
: An Nx(M+1) matrix with the
optimized parameter values. Each row of the matrix corresponds to each
element of the list X; if X is a matrix, then N = 1. The columns are of the
same length as the parameter vector w (i.e. number of basis functions).
W_Sigma
: A list with covariance matrices for each element row
in W.
basis
: The basis object.
nll_feat
: NLL
fit feature.
rmse_feat
: RMSE fit feature.
coverage_feat
: CpG coverage feature.
lb_feat
:
Lower Bound feature.
The modelling and mathematical details for inferring profiles using mean-field variational inference are explained here: http://rpubs.com/cakapourani/ . More specifically:
For Binomial/Bernoulli observation model check: http://rpubs.com/cakapourani/variational-bayes-bpr
For Gaussian observation model check: http://rpubs.com/cakapourani/variational-bayes-lr
C.A.Kapourani [email protected]
create_basis
, infer_profiles_mle
,
predict_expr
, create_region_object
# Example of inferring parameters for synthetic data using 3 RBFs basis <- create_rbf_object(M=3) out <- infer_profiles_vb(X = binomial_data, model = "binomial", basis = basis, is_parallel = FALSE, vb_max_iter = 10) #------------------------------------- basis <- create_rbf_object(M=3) out <- infer_profiles_vb(X = gaussian_data, model = "gaussian", basis = basis, is_parallel = FALSE, vb_max_iter = 10)
# Example of inferring parameters for synthetic data using 3 RBFs basis <- create_rbf_object(M=3) out <- infer_profiles_vb(X = binomial_data, model = "binomial", basis = basis, is_parallel = FALSE, vb_max_iter = 10) #------------------------------------- basis <- create_rbf_object(M=3) out <- infer_profiles_vb(X = gaussian_data, model = "gaussian", basis = basis, is_parallel = FALSE, vb_max_iter = 10)
This functions makes predictions of gene expression levels using a model trained on methylation features extracted from genomic regions.
inner_predict_model_expr(model, test, is_summary = TRUE)
inner_predict_model_expr(model, test, is_summary = TRUE)
model |
The fitted regression model, i.e. the output of
|
test |
The testing data. |
is_summary |
Logical, print the summary statistics. |
A list containing the following elements:
test_pred
: The predicted values for the test data.
test_errors
: The test error metrics.
C.A.Kapourani [email protected]
# Create synthetic data train_data <- data.frame(x = rnorm(20), y=rnorm(20, 1, 4)) test_data <- data.frame(x = rnorm(40), y=rnorm(20, 1, 3)) # Train the model train_model <- inner_train_model_expr(formula = y~., model_name="svm", train = train_data) # Make predictions res <- inner_predict_model_expr(model = train_model$model, test = test_data)
# Create synthetic data train_data <- data.frame(x = rnorm(20), y=rnorm(20, 1, 4)) test_data <- data.frame(x = rnorm(40), y=rnorm(20, 1, 3)) # Train the model train_model <- inner_train_model_expr(formula = y~., model_name="svm", train = train_data) # Make predictions res <- inner_predict_model_expr(model = train_model$model, test = test_data)
This function trains a regression model for predicting gene expression levels by taking as input the higher order methylation features extracted from specific genomic regions.
inner_train_model_expr( formula = NULL, model_name = "lm", train, is_summary = TRUE )
inner_train_model_expr( formula = NULL, model_name = "lm", train, is_summary = TRUE )
formula |
An object of class |
model_name |
A string denoting the regression model. Currently,
available models are: |
train |
The training data. |
is_summary |
Logical, print the summary statistics. |
A list containing the following elements:
formula
: The formula that was used.
gex_model
: The
fitted model.
train_pred
The predicted values for the training
data.
train_errors
: The training error metrics.
C.A.Kapourani [email protected]
# Create synthetic data train_data <- data.frame(x = rnorm(20), y=rnorm(20, 1, 4)) res <- inner_train_model_expr(formula = y~., train = train_data) # Using a different model res <- inner_train_model_expr(model_name = "randomForest", train = train_data)
# Create synthetic data train_data <- data.frame(x = rnorm(20), y=rnorm(20, 1, 4)) res <- inner_train_model_expr(formula = y~., train = train_data) # Using a different model res <- inner_train_model_expr(model_name = "randomForest", train = train_data)
A synthetic dataset containinig 600 entries.
meth_data
meth_data
A list with 600 elements, where each element element is an L x 3 matrix of observations, where:
locations of obseravtions
total trials
number of successes
Synthetic bulk methylation data
gex_data
, bernoulli_data
,
binomial_data
, beta_data
,
gaussian_data
(DEPRECATED) boxplot_cluster_gex
creates a boxplot of
clustered gene expression levels which depend on the clustered methylation
profiles. Each colour denotes a different cluster.
old_boxplot_cluster_gex( bpr_cluster_obj, gex, main_lab = "Gene expression levels" )
old_boxplot_cluster_gex( bpr_cluster_obj, gex, main_lab = "Gene expression levels" )
bpr_cluster_obj |
The output of the |
gex |
The vector of gene expression data for each promoter region. |
main_lab |
The title of the plot |
The figure to be plotted in the device.
C.A.Kapourani [email protected]
old_plot_cluster_prof
,
old_plot_fitted_profiles
# Cluster methylation profiles using 4 RBFs obs <- meth_data basis <- create_rbf_object(M = 4) res <- bpr_cluster_wrap(x = obs, K = 3, em_max_iter = 2, opt_itnmax = 3, init_opt_itnmax = 2, is_parallel = FALSE) # Create the plot old_boxplot_cluster_gex(bpr_cluster_obj = res, gex = gex_data)
# Cluster methylation profiles using 4 RBFs obs <- meth_data basis <- create_rbf_object(M = 4) res <- bpr_cluster_wrap(x = obs, K = 3, em_max_iter = 2, opt_itnmax = 3, init_opt_itnmax = 2, is_parallel = FALSE) # Create the plot old_boxplot_cluster_gex(bpr_cluster_obj = res, gex = gex_data)
(DEPRECATED) plot_cluster_prof
creates a plot of cluster
methylation profiles, where each colour denotes a different cluster.
old_plot_cluster_prof( bpr_cluster_obj, main_lab = "Clustered methylation profiles" )
old_plot_cluster_prof( bpr_cluster_obj, main_lab = "Clustered methylation profiles" )
bpr_cluster_obj |
The output of the |
main_lab |
The title of the plot |
The figure to be plotted in the device.
C.A.Kapourani [email protected]
old_plot_fitted_profiles
,
old_boxplot_cluster_gex
# Cluster methylation profiles using 4 RBFs obs <- meth_data basis <- create_rbf_object(M = 4) res <- bpr_cluster_wrap(x = obs, K = 3, em_max_iter = 2, opt_itnmax = 3, init_opt_itnmax = 2, is_parallel = FALSE) # Create the plot old_plot_cluster_prof(bpr_cluster_obj = res)
# Cluster methylation profiles using 4 RBFs obs <- meth_data basis <- create_rbf_object(M = 4) res <- bpr_cluster_wrap(x = obs, K = 3, em_max_iter = 2, opt_itnmax = 3, init_opt_itnmax = 2, is_parallel = FALSE) # Create the plot old_plot_cluster_prof(bpr_cluster_obj = res)
(DEPRECATED) plot_fitted_profiles
is a simple function
for plotting the methylation data across a give region, together with the
fit of the methylation profiles.
old_plot_fitted_profiles( region, X, fit_prof, fit_mean = NULL, title = "Gene promoter", ... )
old_plot_fitted_profiles( region, X, fit_prof, fit_mean = NULL, title = "Gene promoter", ... )
region |
Promoter region number |
X |
Methylation data observations |
fit_prof |
Fitted profile |
fit_mean |
Fitted mean function |
title |
Title of the plot |
... |
Additional parameters |
The figure to be plotted in the device.
C.A.Kapourani [email protected]
# Fit methylation profiles using 3 RBFs obs <- meth_data y <- gex_data basis <- create_rbf_object(M = 3) out <- bpr_predict_wrap(x = obs, y = y, basis = basis, is_parallel = FALSE, opt_itnmax = 5) # Create the plot old_plot_fitted_profiles(region = 16, X = meth_data, fit_prof = out)
# Fit methylation profiles using 3 RBFs obs <- meth_data y <- gex_data basis <- create_rbf_object(M = 3) out <- bpr_predict_wrap(x = obs, y = y, basis = basis, is_parallel = FALSE, opt_itnmax = 5) # Create the plot old_plot_fitted_profiles(region = 16, X = meth_data, fit_prof = out)
Partition bulk methylation dataset to training and test set
partition_bulk_dataset(dt_obj, cpg_train_prcg = 0.5)
partition_bulk_dataset(dt_obj, cpg_train_prcg = 0.5)
dt_obj |
BPRMeth data 'region_object' object |
cpg_train_prcg |
Fraction of CpGs in each genomic region to keep for training set. |
The BPRMeth data 'region_object' object with the following changes. The 'met' element will now contain only the 'training' data. An additional element called 'met_test' will store the data that will be used during testing to evaluate the imputation performance. These data will not be seen from BPRMeth during inference.
C.A.Kapourani [email protected]
create_region_object
, read_met
,
impute_bulk_met
# Partition the synthetic data from BPRMeth package dt <- partition_bulk_dataset(encode_met)
# Partition the synthetic data from BPRMeth package dt <- partition_bulk_dataset(encode_met)
Function for plotting the clusterd methylation profiles across a given region where each colour denotes a different cluster.
plot_cluster_profiles( cluster_obj, title = "Clustered profiles", x_axis = "genomic region", y_axis = "met level", x_labels = c("Upstream", "", "Centre", "", "Downstream"), ... )
plot_cluster_profiles( cluster_obj, title = "Clustered profiles", x_axis = "genomic region", y_axis = "met level", x_labels = c("Upstream", "", "Centre", "", "Downstream"), ... )
cluster_obj |
Clustered profiles object, i.e. output from
|
title |
Plot title |
x_axis |
x axis label |
y_axis |
x axis label |
x_labels |
x axis ticks labels |
... |
Additional parameters |
A ggplot2 object.
C.A.Kapourani [email protected]
plot_infer_profiles
,
plot_predicted_expr
, boxplot_cluster_expr
# Cluster methylation profiles using 3 RBFs basis <- create_rbf_object(M = 3) # Perform clustering cl_obj <- cluster_profiles_vb(X = encode_met$met, K = 3, model = "binomial", basis = basis, vb_max_iter = 5) # Create plot g <- plot_cluster_profiles(cluster_obj = cl_obj)
# Cluster methylation profiles using 3 RBFs basis <- create_rbf_object(M = 3) # Perform clustering cl_obj <- cluster_profiles_vb(X = encode_met$met, K = 3, model = "binomial", basis = basis, vb_max_iter = 5) # Create plot g <- plot_cluster_profiles(cluster_obj = cl_obj)
Function for plotting the inferred methylation profiles across a
given region, and optionally the mean methylation rate together with the
observed methylation data, using ggplot2
.
plot_infer_profiles( region = 1, obj_prof, obj_mean = NULL, obs = NULL, title = "Inferred profiles", x_axis = "genomic region", y_axis = "met level", x_labels = c("Upstream", "", "Centre", "", "Downstream"), ... )
plot_infer_profiles( region = 1, obj_prof, obj_mean = NULL, obs = NULL, title = "Inferred profiles", x_axis = "genomic region", y_axis = "met level", x_labels = c("Upstream", "", "Centre", "", "Downstream"), ... )
region |
Genomic region number |
obj_prof |
Inferred profile, i.e. output from
|
obj_mean |
Inferred mean function, i.e. output from
|
obs |
Methylation data observations, if a list, will extract the
specific |
title |
Plot title |
x_axis |
x axis label |
y_axis |
x axis label |
x_labels |
x axis ticks labels |
... |
Additional parameters |
A ggplot2 object.
C.A.Kapourani [email protected]
plot_predicted_expr
,
plot_cluster_profiles
,
boxplot_cluster_expr
# Fit methylation profiles using 5 RBFs basis <- create_rbf_object(M = 5) prof <- infer_profiles_vb(X = encode_met$met, model = "binomial", basis = basis, is_parallel = FALSE, vb_max_iter = 5) # Create the plot g <- plot_infer_profiles(region = 16, obj_prof = prof, obs = encode_met$met)
# Fit methylation profiles using 5 RBFs basis <- create_rbf_object(M = 5) prof <- infer_profiles_vb(X = encode_met$met, model = "binomial", basis = basis, is_parallel = FALSE, vb_max_iter = 5) # Create the plot g <- plot_infer_profiles(region = 16, obj_prof = prof, obs = encode_met$met)
plot_predicted_expr
creates a scatter plot of predicted
gene expression values on the x-axis versus the measured gene expression
values on the y-axis.
plot_predicted_expr( pred_obj, title = "Predicted expression", is_margins = FALSE )
plot_predicted_expr( pred_obj, title = "Predicted expression", is_margins = FALSE )
pred_obj |
The output of the |
title |
The title of the plot. |
is_margins |
Use specified margins or not. |
A ggplot2 object.
C.A.Kapourani [email protected]
plot_infer_profiles
,
plot_cluster_profiles
, boxplot_cluster_expr
# Fit methylation profiles using 5 RBFs basis <- create_rbf_object(M = 5) prof <- infer_profiles_vb(X = encode_met$met, model = "binomial", basis = basis, is_parallel = FALSE, vb_max_iter = 5) # Predict expression pred_obj <- predict_expr(prof_obj = prof, expr = encode_expr, anno = encode_met$anno, model_name = "lm", is_summary = FALSE) # Create plot g <- plot_predicted_expr(pred_obj = pred_obj)
# Fit methylation profiles using 5 RBFs basis <- create_rbf_object(M = 5) prof <- infer_profiles_vb(X = encode_met$met, model = "binomial", basis = basis, is_parallel = FALSE, vb_max_iter = 5) # Predict expression pred_obj <- predict_expr(prof_obj = prof, expr = encode_expr, anno = encode_met$anno, model_name = "lm", is_summary = FALSE) # Create plot g <- plot_predicted_expr(pred_obj = pred_obj)
(DEPRECATED) pool_bs_seq_rep
reads and pools replicate
methylation data from BS-Seq experiments that are either in Encode RRBS or
Bismark Cov format. Read the Important section below on when to use this
function.
pool_bs_seq_rep(files, file_format = "encode_rrbs", chr_discarded = NULL)
pool_bs_seq_rep(files, file_format = "encode_rrbs", chr_discarded = NULL)
files |
A vector of filenames containing replicate experiments. This can also be just a single replicate. |
file_format |
A string denoting the file format that the BS-Seq data are
stored. Current version allows " |
chr_discarded |
A vector with chromosome names to be discarded. |
A GRanges
object. The GRanges object contains two additional
metadata columns:
total_reads
: total reads mapped to
each genomic location.
meth_reads
: methylated reads mapped to
each genomic location.
These columns can be accessed as follows: granges_object$total_reads
Unless you want to create a different workflow when
processing the BS-Seq data, you should NOT call this function, since this
is a helper function. Instead you should call the
preprocess_bs_seq
function.
Information about the file formats can be found in the following links:
Bismark Cov format: http://rnbeads.mpi-inf.mpg.de/data/RnBeads.pdf
C.A.Kapourani [email protected]
read_bs_encode_haib
, preprocess_bs_seq
# Obtain the path to the file bs_file1 <- system.file("extdata", "rrbs.bed", package = "BPRMeth") bs_file2 <- system.file("extdata", "rrbs.bed", package = "BPRMeth") # Concatenate the files bs_files <- c(bs_file1, bs_file2) # Pool the replicates pooled_data <- pool_bs_seq_rep(bs_files)
# Obtain the path to the file bs_file1 <- system.file("extdata", "rrbs.bed", package = "BPRMeth") bs_file2 <- system.file("extdata", "rrbs.bed", package = "BPRMeth") # Concatenate the files bs_files <- c(bs_file1, bs_file2) # Pool the replicates pooled_data <- pool_bs_seq_rep(bs_files)
bpr_predict_wrap
is a function that wraps all the
necessary subroutines for performing prediction on gene expression levels.
Initially, it optimizes the parameters of the basis functions so as to
learn the methylation profiles. Then, uses the learned parameters /
coefficients of the basis functions as input features for performing
regression in order to predict the corresponding gene expression levels.
predict_expr( formula = NULL, prof_obj, expr, anno, model_name = "lm", train_ind = NULL, train_perc = 0.7, fit_feature = "RMSE", cov_feature = TRUE, is_summary = TRUE )
predict_expr( formula = NULL, prof_obj, expr, anno, model_name = "lm", train_ind = NULL, train_perc = 0.7, fit_feature = "RMSE", cov_feature = TRUE, is_summary = TRUE )
formula |
An object of class |
prof_obj |
Inferred profiles object. This in general will be the output of 'infer_profiles_'(inference_meth) function. |
expr |
Gene expression data with two columns in |
anno |
Annotation data as a |
model_name |
A string denoting the regression model. Currently,
available models are: |
train_ind |
Optional vector containing the indices for the train set. |
train_perc |
Optional parameter for defining the percentage of the dataset to be used for training set, the remaining will be the test set. |
fit_feature |
Use additional feature of how well the profile fits the methylation data. Either NULL for ignoring this feature or one of the following: 1) "RMSE" or 2) "NLL" which will be used as input features for predicting expression. |
cov_feature |
Logical, whether to use coverage as input feature for predictions. |
is_summary |
Logical, print the summary statistics. |
A 'predict_expr' object which consists of the following variables:
train: The training data.
test: The test data.
model
: The fitted regression model.
train_pred
The predicted values for the training data.
test_pred
The
predicted values for the test data.
train_errors
: The training
error metrics.
test_errors
: The test error metrics.
C.A.Kapourani [email protected]
infer_profiles_mle
, infer_profiles_vb
,
infer_profiles_gibbs
, create_basis
,
# Fit methylation profiles using 5 RBFs basis <- create_rbf_object(M = 5) prof <- infer_profiles_vb(X = encode_met$met, model = "binomial", basis = basis, is_parallel = FALSE, vb_max_iter = 5) # Predict expression pred_obj <- predict_expr(prof_obj = prof, expr = encode_expr, anno = encode_met$anno, model_name = "lm", is_summary = FALSE)
# Fit methylation profiles using 5 RBFs basis <- create_rbf_object(M = 5) prof <- infer_profiles_vb(X = encode_met$met, model = "binomial", basis = basis, is_parallel = FALSE, vb_max_iter = 5) # Predict expression pred_obj <- predict_expr(prof_obj = prof, expr = encode_expr, anno = encode_met$anno, model_name = "lm", is_summary = FALSE)
(DEPRECATED) preprocess_bs_seq
is a general function for
reading and preprocessing BS-Seq data. If a vector of files is given, these
are considered as replicates and are pooled together. Finally, noisy reads
are discarded.
preprocess_bs_seq( files, file_format = "encode_rrbs", chr_discarded = NULL, min_bs_cov = 4, max_bs_cov = 1000 )
preprocess_bs_seq( files, file_format = "encode_rrbs", chr_discarded = NULL, min_bs_cov = 4, max_bs_cov = 1000 )
files |
A vector of filenames containing replicate experiments. This can also be just a single replicate. |
file_format |
A string denoting the file format that the BS-Seq data are
stored. Current version allows " |
chr_discarded |
A vector with chromosome names to be discarded. |
min_bs_cov |
The minimum number of reads mapping to each CpG site. CpGs with less reads will be considered as noise and will be discarded. |
max_bs_cov |
The maximum number of reads mapping to each CpG site. CpGs with more reads will be considered as noise and will be discarded. |
A GRanges
object. The GRanges object contains two additional
metadata columns:
total_reads
: total reads mapped to
each genomic location.
meth_reads
: methylated reads mapped to
each genomic location.
These columns can be accessed as follows:
granges_object$total_reads
Information about the file formats can be found in the following links:
Bismark Cov format: http://rnbeads.mpi-inf.mpg.de/data/RnBeads.pdf
C.A.Kapourani [email protected]
read_bs_encode_haib
pool_bs_seq_rep
# Obtain the path to the files bs_file <- system.file("extdata", "rrbs.bed", package = "BPRMeth") bs_data <- preprocess_bs_seq(bs_file, file_format = "encode_rrbs")
# Obtain the path to the files bs_file <- system.file("extdata", "rrbs.bed", package = "BPRMeth") bs_data <- preprocess_bs_seq(bs_file, file_format = "encode_rrbs")
(DEPRECATED) preprocess_final_HTS_data
performs a final
filtering and preprocessing on the data for use in downstream analysis.
These include, removing noisy gene expression data, removing or not
un-expressed genes and log2-transorming of the FPKM values.
preprocess_final_HTS_data( methyl_region, prom_reg, rna_data, gene_log2_transf = TRUE, gene_outl_thresh = TRUE, gex_outlier = 300 )
preprocess_final_HTS_data( methyl_region, prom_reg, rna_data, gene_log2_transf = TRUE, gene_outl_thresh = TRUE, gex_outlier = 300 )
methyl_region |
Methylation region data, which are the output of the
" |
prom_reg |
A |
rna_data |
A |
gene_log2_transf |
Logical, whether or not to log2 transform the gene expression data. |
gene_outl_thresh |
Logical, whehter or not to remove outlier gene expression data. |
gex_outlier |
Numeric, denoting the threshold above of which the gene expression data (before the log2 transformation) are considered as noise. |
An object which contains following information:
methyl_region
: The subset of promoter methylation region data after
the filtering process.
gex
: A vectoring storing only the
corresponding gene expression values for each promoter region.
rna_data
: The corresponding gene expression data stored as a GRanges
object.
C.A.Kapourani [email protected]
read_rna_encode_caltech
process_haib_caltech_wrap
# Obtain the path to the BS file and then read it bs_file <- system.file("extdata", "rrbs.bed", package = "BPRMeth") bs_data <- read_bs_encode_haib(bs_file) # Create promoter regions rnaseq_file <- system.file("extdata", "rnaseq.bed", package = "BPRMeth") annot_data <- read_rna_encode_caltech(rnaseq_file) prom_region <- create_anno_region(annot_data) # Create methylation regions methyl_reg <- create_region_object(bs_data, prom_region, filter_empty_region = FALSE) # Keep only covered genomic regions cov_ind <- which(!is.na(methyl_reg)) methyl_reg <- methyl_reg[cov_ind] prom_region <- prom_region[cov_ind, ] annot_data <- annot_data[cov_ind, ] # Finally preprocess the HTS data res <- preprocess_final_HTS_data(methyl_reg, prom_region, annot_data)
# Obtain the path to the BS file and then read it bs_file <- system.file("extdata", "rrbs.bed", package = "BPRMeth") bs_data <- read_bs_encode_haib(bs_file) # Create promoter regions rnaseq_file <- system.file("extdata", "rnaseq.bed", package = "BPRMeth") annot_data <- read_rna_encode_caltech(rnaseq_file) prom_region <- create_anno_region(annot_data) # Create methylation regions methyl_reg <- create_region_object(bs_data, prom_region, filter_empty_region = FALSE) # Keep only covered genomic regions cov_ind <- which(!is.na(methyl_reg)) methyl_reg <- methyl_reg[cov_ind] prom_region <- prom_region[cov_ind, ] annot_data <- annot_data[cov_ind, ] # Finally preprocess the HTS data res <- preprocess_final_HTS_data(methyl_reg, prom_region, annot_data)
(DEPRECATED) process_haib_caltech_wrap
is a wrapper
method for processing HTS data and returning the methylation promoter
regions and the corresponding gene expression data for those promoter
regions. Note that the format of BS-Seq data should be in the Encode Haib
bed format and for the RNA-Seq data in Encode Caltech bed format.
process_haib_caltech_wrap( bs_files, rna_files, chrom_size_file = NULL, chr_discarded = NULL, upstream = -7000, downstream = 7000, min_bs_cov = 4, max_bs_cov = 1000, cpg_density = 10, sd_thresh = 0.1, ignore_strand = TRUE, gene_log2_transf = TRUE, gene_outl_thresh = TRUE, gex_outlier = 300, fmin = -1, fmax = 1 )
process_haib_caltech_wrap( bs_files, rna_files, chrom_size_file = NULL, chr_discarded = NULL, upstream = -7000, downstream = 7000, min_bs_cov = 4, max_bs_cov = 1000, cpg_density = 10, sd_thresh = 0.1, ignore_strand = TRUE, gene_log2_transf = TRUE, gene_outl_thresh = TRUE, gex_outlier = 300, fmin = -1, fmax = 1 )
bs_files |
Filename (or vector of filenames if there are replicates) of the BS-Seq '.bed' formatted data to read values from. |
rna_files |
Filename of the RNA-Seq '.bed' formatted data to read values from. Currently, this version does not support pooling RNA-Seq replicates. |
chrom_size_file |
Optional filename containing genome chromosome sizes. |
chr_discarded |
A vector with chromosome names to be discarded. |
upstream |
Integer defining the length of bp upstream of TSS for creating the promoter region. |
downstream |
Integer defining the length of bp downstream of TSS for creating the promoter region. |
min_bs_cov |
The minimum number of reads mapping to each CpG site. CpGs with less reads will be considered as noise and will be discarded. |
max_bs_cov |
The maximum number of reads mapping to each CpG site. CpGs with more reads will be considered as noise and will be discarded. |
cpg_density |
Optional integer defining the minimum number of CpGs that
have to be in a methylated region. Regions with less than |
sd_thresh |
Optional numeric defining the minimum standard deviation of the methylation change in a region. This is used to filter regions with no methylation change. |
ignore_strand |
Logical, whether or not to ignore strand information. |
gene_log2_transf |
Logical, whether or not to log2 transform the gene expression data. |
gene_outl_thresh |
Logical, whehter or not to remove outlier gene expression data. |
gex_outlier |
Numeric, denoting the threshold above of which the gene expression data (before the log2 transformation) are considered as noise. |
fmin |
Optional minimum range value for region location scaling. Under this version, this parameter should be left to its default value. |
fmax |
Optional maximum range value for region location scaling. Under this version, this parameter should be left to its default value. |
A processHTS
object which contains following information:
methyl_region
: A list containing methylation data,
where each entry in the list is an dimensional matrix,
where
denotes the number of CpGs found in region
i
. The
columns contain the following information:
1st column: Contains the locations of CpGs relative to TSS. Note that the actual locations are scaled to the (fmin, fmax) region.
2nd column: Contains the total reads of each CpG in the corresponding location.
3rd column: Contains the methylated reads each CpG in the corresponding location.
gex
: A vector containing the corresponding gene
expression levels for each entry of the methyl_region
list.
prom_region
: A GRanges
object
containing corresponding annotated promoter regions for each entry of the
methyl_region
list. The GRanges object has one additional metadata
column named tss
, which stores the TSS of each promoter.
rna_data
: A GRanges
object containing
the corresponding RNA-Seq data for each entry of the methyl_region
list. The GRanges object has three additional metadata columns which are
explained in read_rna_encode_caltech
upstream
:
Integer defining the length of bp upstream of TSS.
downstream
: Integer defining the length of bp downstream of TSS.
cpg_density
: Integer defining the minimum number of CpGs that
have to be in a methylated region. Regions with less than n
CpGs are
discarded.
sd_thresh
: Numeric defining the minimum standard
deviation of the methylation change in a region. This is used to filter
regions with no methylation change.
fmin
: Minimum range
value for region location scaling.
fmax
: Maximum range value
for region location scaling.
C.A.Kapourani [email protected]
# Obtain the path to the files rrbs_file <- system.file("extdata", "rrbs.bed", package = "BPRMeth") rnaseq_file <- system.file("extdata", "rnaseq.bed", package = "BPRMeth") proc_data <- process_haib_caltech_wrap(rrbs_file, rnaseq_file)
# Obtain the path to the files rrbs_file <- system.file("extdata", "rrbs.bed", package = "BPRMeth") rnaseq_file <- system.file("extdata", "rnaseq.bed", package = "BPRMeth") proc_data <- process_haib_caltech_wrap(rrbs_file, rnaseq_file)
read_anno
reads a file containing annotation data, which
will be used to select genomic regions of interest for downstream analysis.
The annotation file format is the following: "chromosome", "start", "end",
"strand", "id", "name" (optional). Read the Important section below for
more details.
read_anno( file, chrom_size_file = NULL, chr_discarded = NULL, is_centre = FALSE, is_window = TRUE, upstream = -5000, downstream = 5000, is_anno_region = TRUE, delimiter = "\t" )
read_anno( file, chrom_size_file = NULL, chr_discarded = NULL, is_centre = FALSE, is_window = TRUE, upstream = -5000, downstream = 5000, is_anno_region = TRUE, delimiter = "\t" )
file |
File name. |
chrom_size_file |
Optional file name to read genome chromosome sizes. |
chr_discarded |
Optional vector with chromosomes to be discarded. |
is_centre |
Logical, whether 'start' and 'end' locations are pre-centred. If TRUE, the mean of the locations will be chosen as centre. If FALSE, the 'start' will be chosen as the center; e.g. for genes the 'start' denotes the TSS and we use this as centre to obtain K-bp upstream and downstream of TSS. |
is_window |
Whether to consider a predefined window region around centre. If TRUE, then 'upstream' and 'downstream' parameters are used, otherwise we consider the whole region from start to end location. |
upstream |
Integer defining the length of bp upstream of 'centre' for creating the genomic region. If is_window = FALSE, this parameter is ignored. |
downstream |
Integer defining the length of bp downstream of 'centre' for creating the genomic region. If is_window = FALSE, this parameter is ignored. |
is_anno_region |
Logical, whether or not to create genomic region. It
should be set to TRUE, if you want to use the object as input to
|
delimiter |
Delimiter format the columns are splitted. Default is tab. |
A GRanges
object.
The GRanges object contains two or three additional metadata columns:
id
: Feature ID, e.g. Ensembl IDs for each gene.
centre
: The central (middle) location of the genomic region.
This is required when transforming 'methylation regions' in the (-1, 1)
interval, the 'centre' would be at 0.
name
(Optional) the
feature name.
These columns can be accessed as follows:
granges_obj$id
When having strand information, and we are in the antisense strand ("-"), the 'start' is automatically switched with the 'end' location.
By default columns are considered in tab delimited format.
The "name" feature is optional.
When there is no "strand" info, this can be set to "*".
C.A.Kapourani [email protected]
read_met
, create_anno_region
,
create_region_object
# Obtain the path to files file <- system.file("extdata", "dummy_anno.bed", package = "BPRMeth") anno_dt <- read_anno(file, chr_discarded = c("X")) # Extract feature id anno_ids <- anno_dt$id
# Obtain the path to files file <- system.file("extdata", "dummy_anno.bed", package = "BPRMeth") anno_dt <- read_anno(file, chr_discarded = c("X")) # Extract feature id anno_ids <- anno_dt$id
(DEPRECATED) read_bs_encode_haib
reads a file containing
methylation data from BS-Seq experiments using the scan
function. The BS-Seq file should be in ENCODE HAIB bed
format. Read
the Important section below on when to use this function.
read_bs_encode_haib(file, chr_discarded = NULL, is_GRanges = TRUE)
read_bs_encode_haib(file, chr_discarded = NULL, is_GRanges = TRUE)
file |
The name of the file to read data values from. |
chr_discarded |
A vector with chromosome names to be discarded. |
is_GRanges |
Logical: if TRUE a GRanges object is returned, otherwise a data.frame object is returned. |
A GRanges
object if is_GRanges
is
TRUE, otherwise a data.table
object.
The GRanges object contains two additional metadata columns:
total_reads
: total reads mapped to each genomic location.
meth_reads
: methylated reads mapped to each genomic location.
These columns can be accessed as follows:
granges_object$total_reads
Unless you want to create a different workflow when
processing the BS-Seq data, you should NOT call this function, since this
is a helper function. Instead you should call the
preprocess_bs_seq
function.
C.A.Kapourani [email protected]
pool_bs_seq_rep
, preprocess_bs_seq
# Obtain the path to the file and then read it bs_file <- system.file("extdata", "rrbs.bed", package = "BPRMeth") bs_data <- read_bs_encode_haib(bs_file)
# Obtain the path to the file and then read it bs_file <- system.file("extdata", "rrbs.bed", package = "BPRMeth") bs_data <- read_bs_encode_haib(bs_file)
read_chrom_size
reads a file containing genome chromosome
sizes using the fread
function.
read_chrom_size(file, delimiter = "\t")
read_chrom_size(file, delimiter = "\t")
file |
File name |
delimiter |
Delimiter format the columns are splitted. Default is tab. |
A data.table
object.
C.A.Kapourani [email protected]
chr_file <- system.file("extdata", "hg19.chr.sizes", package = "BPRMeth") chr_data <- read_chrom_size(chr_file) # Extract the size of chr1 chr_data[1]
chr_file <- system.file("extdata", "hg19.chr.sizes", package = "BPRMeth") chr_data <- read_chrom_size(chr_file) # Extract the size of chr1 chr_data[1]
read_expr
reads a file containing expression data. It
should contain two columns: "id", "expr" and by default columns are
considered in tab delimited format.
read_expr(file, log2_transf = FALSE, delimiter = "\t")
read_expr(file, log2_transf = FALSE, delimiter = "\t")
file |
File name |
log2_transf |
Logical, whether to log2 transform the expression data. |
delimiter |
Delimiter format the columns are splitted. Default is tab. |
A data.table
object.
C.A.Kapourani [email protected]
# Obtain the path to files file <- system.file("extdata", "dummy_expr.bed", package = "BPRMeth") expr_dt <- read_expr(file) # Extract feature id expr_ids <- expr_dt$id
# Obtain the path to files file <- system.file("extdata", "dummy_expr.bed", package = "BPRMeth") expr_dt <- read_expr(file) # Extract feature id expr_ids <- expr_dt$id
read_met
reads a file containing methylation data using
the fread
function. Since there are different technologies, e.g.
array, bulk BS-Seq, scBS-Seq, and still there is no standard file format,
different options are available, check the Important section below on the
file format for each type you choose. If a file format is not availabe, you
need to read the file and create a GRanges
object, where the data
are ordered by chromosome and genomic location.
read_met( file, type = "sc_seq", strand_info = FALSE, chr_discarded = NULL, min_bulk_cov = 4, max_bulk_cov = 1000, delimiter = "\t" )
read_met( file, type = "sc_seq", strand_info = FALSE, chr_discarded = NULL, min_bulk_cov = 4, max_bulk_cov = 1000, delimiter = "\t" )
file |
File name. |
type |
Type of technology as character. Either "bulk_seq", "sc_seq" or "array". Check the Important section below for more details. |
strand_info |
Logical, whether or not the file contains strand information. |
chr_discarded |
Optional vector with chromosomes to be discarded. |
min_bulk_cov |
Minimum number of reads mapping to each CpG site. Used only for "bulk_seq" and CpGs with less reads will be discarded as noise. |
max_bulk_cov |
Maximum number of reads mapping to each CpG site. Used only for "bulk_seq" and CpGs with less reads will be discarded as noise. |
delimiter |
Delimiter format the columns are splitted. Default is tab. |
A GRanges
object.
The GRanges object contains one or two additional metadata columns:
met
: Methylation level.
For "array" this is the Beta or M-values
For "sc_seq" this is either 0 or 1 (unmethylated or methylated)
For "bulk_seq" this contains the number of methylated reads for each CpG.
total
: Total number
of reads for each CpG. Present only for "bulk_seq" type.
These columns
can be accessed as follows: granges_obj$met
Depending on technology type
we assume different
file formats.
"array"
File format: "chromosome",
"start", "strand" (optional), "met" .
"sc_seq"
File format:
"chromosome", "start", "strand" (optional), "met" . Where "met" should
contain only 1s or 0s.
"bulk_seq"
File format:
"chromosome", "start", "strand" (optional), "met", "total".
By default columns are considered in tab delimited format.
C.A.Kapourani [email protected]
read_anno
, read_expr
,
create_region_object
# Obtain the path to files file <- system.file("extdata", "dummy_met.bed", package = "BPRMeth") met_dt <- read_met(file) # Extract methylation level met <- met_dt$met
# Obtain the path to files file <- system.file("extdata", "dummy_met.bed", package = "BPRMeth") met_dt <- read_met(file) # Extract methylation level met <- met_dt$met
(DEPRECATED) read_rna_encode_caltech
reads a file
containing promoter annotation data together with gene expression levels
from RNA-Seq experiments using the scan
function. The RNA-Seq
file should be in ENCODE Caltech bed
format, e.g. use gtf2bed
tool if your initial file is in gtf
format.
read_rna_encode_caltech(file, chr_discarded = NULL, is_GRanges = TRUE)
read_rna_encode_caltech(file, chr_discarded = NULL, is_GRanges = TRUE)
file |
The name of the file to read data values from. |
chr_discarded |
A vector with chromosome names to be discarded. |
is_GRanges |
Logical: if TRUE a GRanges object is returned, otherwise a data.frame object is returned. |
A GRanges
object if is_GRanges
is TRUE, otherwise a
data.table
object.
The GRanges object contains three additional metadata columns:
ensembl_id
: Ensembl IDs of each gene promoter.
gene_name
: Gene name.
gene_fpkm
: Expression level in
FPKM.
These columns can be accessed as follows:
granges_object$ensembl_id
C.A.Kapourani [email protected]
read_chrom_size
, read_bs_encode_haib
# Obtain the path to the file and then read it rnaseq_file <- system.file("extdata", "rnaseq.bed", package = "BPRMeth") rna_data <- read_rna_encode_caltech(rnaseq_file)
# Obtain the path to the file and then read it rnaseq_file <- system.file("extdata", "rnaseq.bed", package = "BPRMeth") rna_data <- read_rna_encode_caltech(rnaseq_file)