Title: | Detection of cell-type-specific risk-CpG sites in epigenome-wide association studies |
---|---|
Description: | In epigenome-wide association studies, the measured signals for each sample are a mixture of methylation profiles from different cell types. The current approaches to the association detection only claim whether a cytosine-phosphate-guanine (CpG) site is associated with the phenotype or not, but they cannot determine the cell type in which the risk-CpG site is affected by the phenotype. We propose a solid statistical method, HIgh REsolution (HIRE), which not only substantially improves the power of association detection at the aggregated level as compared to the existing methods but also enables the detection of risk-CpG sites for individual cell types. The "HIREewas" R package is to implement HIRE model in R. |
Authors: | Xiangyu Luo <[email protected]>, Can Yang <[email protected]>, Yingying Wei <[email protected]> |
Maintainer: | Xiangyu Luo <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.25.0 |
Built: | 2024-11-18 03:24:22 UTC |
Source: | https://github.com/bioc/HIREewas |
In epigenome-wide association studies, the measured signals for each sample are a mixture of methylation profiles from different cell types. The current approaches to the association detection only claim whether a cytosine-phosphate-guanine (CpG) site is associated with the phenotype or not, but they cannot determine the cell type in which the risk-CpG site is affected by the phenotype. We propose a solid statistical method, HIgh REsolution (HIRE), which not only substantially improves the power of association detection at the aggregated level as compared to the existing methods but also enables the detection of risk-CpG sites for individual cell types. The "HIREewas" R package is to implement HIRE model in R.
Xiangyu Luo <[email protected]>, Can Yang <[email protected]>, Yingying Wei <[email protected]>
Maintainer: Xiangyu Luo <[email protected]>
Xiangyu Luo, Can Yang, Yingying Wei. Detection of cell-type-specific risk-CpG sites in epigenome wide association studies.
#Please see each function's example as well as the vignette.
#Please see each function's example as well as the vignette.
The HIRE function provides parameter estimates in the HIRE model and the p-values for the association of CpG sites with one phenotype in each individual cell type.
HIRE(Ometh, X, num_celltype, tol=10^(-5), num_iter=1000, alpha=0.01)
HIRE(Ometh, X, num_celltype, tol=10^(-5), num_iter=1000, alpha=0.01)
Ometh |
The observed methylation matrix (with continuous values between 0 and 1), where one row corresponds to a CpG site and one column represents a sample. |
X |
The observed phenotypes, where one row corresponds to a phenotype and one column represents a sample. |
num_celltype |
The cell type number (an integer) that needs to be specified by the users. |
tol |
The relative tolerance used to determine when the HIRE stops. When the ratio of the log observed-data likelihood difference to the log observed-data likelihood at last iteration in the absolute value is smaller than |
num_iter |
The maximum number of iterations (an integer). The default is 1000. |
alpha |
One threshold parameter in the Bonferroni correction to claim a significant cell-type-specific CpG site (a float point between 0 and 1, usually set less than 0.05). The default is 0.01. |
HIRE used the generalized EM algorithm, so the log observed-data likelihood value increases after each iteration.
HIRE returns a list with seven components.
P_t |
the cellular composition matrix, where one row corresponds to a cell type and one column represents a sample. |
mu_t |
the baseline cell-type-specific methylation profiles, where one row is a CpG site and one column is a cell type. |
beta_t |
the phenotype coefficient array with dimensionality being m (the CpG number) by K (the cell type number) by q (the phenotype number). The first dimension corresponds to CpG sites, the second dimension to cell types, and the third dimension to phenotypes. |
sig_sqErr_t |
the error variance vector withlength equal to the CpG site number. |
sig_sqTiss_t |
the cell-type-specific variance matrix, where one row is a CpG site and one column represents a cell type. |
pBIC |
the penalized BIC value, where the number of phenotype coefficients being zero depends on the parameter |
pvalues |
the p-value matrix, whose dimension is m (the CpG site number) by K*q (the cell type number times the phenotype number). In the p-values matrix, one row is a CpG site. The first K columns correspond to the p-value matrix of the phenotype 1, the second K columns corresponds to the p-value matrix of the phenotype 1, and so forth. |
Xiangyu Luo
################################################################################################ #Generate the EWAS data ################################################################################################ set.seed(05222018) #define a function to draw samples from a Dirichlet distribution rDirichlet <- function(alpha_vec){ num <- length(alpha_vec) temp <- rgamma(num, shape = alpha_vec, rate = 1) return(temp / sum(temp)) } n <- 180 #number of samples n1 <- 60 #number of controls n2 <- 120 #number of cases ################################################################################################# # K=3 ################################################################################################# m <- 2000 #number of CpG sites K <- 3 #underlying cell type number #methylation profiles #assume cell type 1 and cell type 2 are from the same lineage #cell type 1 methy1 <- rbeta(m,3,6) #cell type 2 methy2 <- methy1 + rnorm(m, sd=0.01) ind <- sample(seq_len(m), m/5) methy2[ind] <- rbeta(length(ind),3,6) #cell type 3 methy3 <- rbeta(m,3,6) mu <- cbind(methy1, methy2, methy3) #number of covariates p <- 2 #covariates / phenotype X <- rbind(c(rep(0, n1),rep(1, n2)), runif(n, min=20, max=50)) #set risk-CpG sites under each cell type for each phenotype beta <- array(0, dim=c(m,K,p)) #control vs case m_common <- 10 max_signal <- 0.15 min_signal <- 0.07 signs <- sample(c(-1,1), m_common*K, replace=TRUE) beta[seq_len(m_common),seq_len(K),1] <- signs * runif(m_common*K, min=min_signal, max=max_signal) m_seperate <- 10 signs <- sample(c(-1,1), m_seperate*2, replace=TRUE) beta[m_common+(seq_len(m_seperate)),seq_len(2),1] <- signs * runif(m_seperate*2, min=min_signal, max=max_signal) signs <- sample(c(-1,1), m_seperate, replace=TRUE) beta[m_common+m_seperate+(seq_len(m_seperate)),K,1] <- signs * runif(m_seperate, min=min_signal, max=max_signal) #age base <- 20 m_common <- 10 max_signal <- 0.015 min_signal <- 0.007 signs <- sample(c(-1,1), m_common*K, replace=TRUE) beta[base+seq_len(m_common),seq_len(K),2] <- signs * runif(m_common*K, min=min_signal, max=max_signal) m_seperate <- 10 signs <- sample(c(-1,1), m_seperate*2, replace=TRUE) beta[base+m_common+seq_len(m_seperate),seq_len(2),2] <- signs * runif(m_seperate*2, min=min_signal, max=max_signal) signs <- sample(c(-1,1), m_seperate, replace=TRUE) beta[base+m_common+m_seperate+seq_len(m_seperate),seq_len(K),2] <- signs * runif(m_seperate, min=min_signal, max=max_signal) #generate the cellular compositions P <- vapply(seq_len(n), function(i){ if(X[1,i]==0){ #if control rDirichlet(c(4,4, 2+X[2,i]/10)) }else{ rDirichlet(c(4,4, 5+X[2,i]/10)) } }, FUN.VALUE = rep(-1, 3)) #generate the observed methylation profiles Ometh <- NULL for(i in seq_len(n)){ utmp <- t(vapply(seq_len(m), function(j){ tmp1 <- colSums(X[ ,i] * t(beta[j, , ])) rnorm(K,mean=mu[j, ]+tmp1,sd=0.01) }, FUN.VALUE = rep(-1, K))) tmp2 <- colSums(P[ ,i] * t(utmp)) Ometh <- cbind(Ometh, tmp2 + rnorm(m, sd = 0.01)) } sum(Ometh > 1) Ometh[Ometh > 1] <- 1 sum(Ometh < 0) Ometh[Ometh < 0] <- 0 ################################################################################################ #Apply HIRE to the simulated EWAS data ################################################################################################ #return list by HIRE ret_list <- HIRE(Ometh, X, num_celltype=K) #case vs control #Visualize the association pattern with the case/control status in the first 100 CpG sites riskCpGpattern(ret_list$pvalues[seq_len(100), c(2,1,3)], main_title="Detected association pattern\n with disease status", hc_row_ind = FALSE) #c(2,1,3) was used because of the label switching #age #Visualize the association pattern with the age in the first 100 CpG sites riskCpGpattern(ret_list$pvalues[seq_len(100), K+c(2,1,3)], main_title="Detected association pattern\n with age", hc_row_ind = FALSE)
################################################################################################ #Generate the EWAS data ################################################################################################ set.seed(05222018) #define a function to draw samples from a Dirichlet distribution rDirichlet <- function(alpha_vec){ num <- length(alpha_vec) temp <- rgamma(num, shape = alpha_vec, rate = 1) return(temp / sum(temp)) } n <- 180 #number of samples n1 <- 60 #number of controls n2 <- 120 #number of cases ################################################################################################# # K=3 ################################################################################################# m <- 2000 #number of CpG sites K <- 3 #underlying cell type number #methylation profiles #assume cell type 1 and cell type 2 are from the same lineage #cell type 1 methy1 <- rbeta(m,3,6) #cell type 2 methy2 <- methy1 + rnorm(m, sd=0.01) ind <- sample(seq_len(m), m/5) methy2[ind] <- rbeta(length(ind),3,6) #cell type 3 methy3 <- rbeta(m,3,6) mu <- cbind(methy1, methy2, methy3) #number of covariates p <- 2 #covariates / phenotype X <- rbind(c(rep(0, n1),rep(1, n2)), runif(n, min=20, max=50)) #set risk-CpG sites under each cell type for each phenotype beta <- array(0, dim=c(m,K,p)) #control vs case m_common <- 10 max_signal <- 0.15 min_signal <- 0.07 signs <- sample(c(-1,1), m_common*K, replace=TRUE) beta[seq_len(m_common),seq_len(K),1] <- signs * runif(m_common*K, min=min_signal, max=max_signal) m_seperate <- 10 signs <- sample(c(-1,1), m_seperate*2, replace=TRUE) beta[m_common+(seq_len(m_seperate)),seq_len(2),1] <- signs * runif(m_seperate*2, min=min_signal, max=max_signal) signs <- sample(c(-1,1), m_seperate, replace=TRUE) beta[m_common+m_seperate+(seq_len(m_seperate)),K,1] <- signs * runif(m_seperate, min=min_signal, max=max_signal) #age base <- 20 m_common <- 10 max_signal <- 0.015 min_signal <- 0.007 signs <- sample(c(-1,1), m_common*K, replace=TRUE) beta[base+seq_len(m_common),seq_len(K),2] <- signs * runif(m_common*K, min=min_signal, max=max_signal) m_seperate <- 10 signs <- sample(c(-1,1), m_seperate*2, replace=TRUE) beta[base+m_common+seq_len(m_seperate),seq_len(2),2] <- signs * runif(m_seperate*2, min=min_signal, max=max_signal) signs <- sample(c(-1,1), m_seperate, replace=TRUE) beta[base+m_common+m_seperate+seq_len(m_seperate),seq_len(K),2] <- signs * runif(m_seperate, min=min_signal, max=max_signal) #generate the cellular compositions P <- vapply(seq_len(n), function(i){ if(X[1,i]==0){ #if control rDirichlet(c(4,4, 2+X[2,i]/10)) }else{ rDirichlet(c(4,4, 5+X[2,i]/10)) } }, FUN.VALUE = rep(-1, 3)) #generate the observed methylation profiles Ometh <- NULL for(i in seq_len(n)){ utmp <- t(vapply(seq_len(m), function(j){ tmp1 <- colSums(X[ ,i] * t(beta[j, , ])) rnorm(K,mean=mu[j, ]+tmp1,sd=0.01) }, FUN.VALUE = rep(-1, K))) tmp2 <- colSums(P[ ,i] * t(utmp)) Ometh <- cbind(Ometh, tmp2 + rnorm(m, sd = 0.01)) } sum(Ometh > 1) Ometh[Ometh > 1] <- 1 sum(Ometh < 0) Ometh[Ometh < 0] <- 0 ################################################################################################ #Apply HIRE to the simulated EWAS data ################################################################################################ #return list by HIRE ret_list <- HIRE(Ometh, X, num_celltype=K) #case vs control #Visualize the association pattern with the case/control status in the first 100 CpG sites riskCpGpattern(ret_list$pvalues[seq_len(100), c(2,1,3)], main_title="Detected association pattern\n with disease status", hc_row_ind = FALSE) #c(2,1,3) was used because of the label switching #age #Visualize the association pattern with the age in the first 100 CpG sites riskCpGpattern(ret_list$pvalues[seq_len(100), K+c(2,1,3)], main_title="Detected association pattern\n with age", hc_row_ind = FALSE)
The detected association pattern is a way to visualize the p-values providied by the HIRE function.
riskCpGpattern(pval_matr, main_title = "Detected association pattern", hc_row_ind = FALSE)
riskCpGpattern(pval_matr, main_title = "Detected association pattern", hc_row_ind = FALSE)
pval_matr |
the p-value matrix for one phenotype, where one row represents a CpG site and one column indicates one cell type. |
main_title |
the title name. The default is "Detected association pattern". |
hc_row_ind |
whether we conduct hierarchical clustering in the row. The default is FALSE. |
This function depends on the heatmap.2 function in the gplots R package.
return a heatmap
Xiangyu Luo
#a p-value matrix from the uniform distribution pvalues <- matrix(runif(600), 100, 6) #Visualize this p-value matrix riskCpGpattern(pvalues, main_title="An example", hc_row_ind = FALSE)
#a p-value matrix from the uniform distribution pvalues <- matrix(runif(600), 100, 6) #Visualize this p-value matrix riskCpGpattern(pvalues, main_title="An example", hc_row_ind = FALSE)