--- title: "*betaHMM* package: Quick-start guide" author: "Koyel Majumdar, Isobel Claire Gormley, Thomas Brendan Murphy, Romina Silva, Antoinette Sabrina Perry, Ronald William Watson, Florence Jaffrézic, Andrea Rau" output: BiocStyle::html_document: toc_float: true BiocStyle::pdf_document: default always_allow_html: true vignette: > %\VignetteIndexEntry{betaHMM} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}{inputenc} --- ```{r setup, include = FALSE} knitr::include_graphics("betaHMM_hex.png") options(rmarkdown.html_vignette.check_title = FALSE) knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) knitr::opts_chunk$set(echo = TRUE) knitr::opts_chunk$set(eval = TRUE) #knitr::opts_chunk$set(dev = 'png') #knitr::opts_chunk$set(dpi=100) ``` # Installation To install this package, start R (version "4.3") and enter: ```{r biocsetup, eval=FALSE} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") # The following initializes usage of Bioc devel BiocManager::install(version='devel') BiocManager::install("betaHMM") ``` # Introduction DNA methylation, the addition of a methyl group to a cytosine-guanine dinucleotide (CpG) site, is influenced by environmental factors and serves as a disease biomarker. In diploid individuals, CpG sites are hypermethylated (both strands methylated), hypomethylated (neither strand methylated), or hemimethylated (one strand methylated). Identifying differentially methylated CpG sites (DMC) and regions (DMR) can reveal the impact of environmental stressors. Methylation levels, called beta values, represent the proportion of methylated probes, usually modeled with beta distributions. They are often logit transformed into M-values for modeling. To directly model beta values and account for spatial correlation between CpG sites, we propose a homogeneous hidden Markov model (HMM) approach called betaHMM. It identifies DMCs and DMRs while considering spatial dependency and sample relationships. Simulation and prostate cancer data demonstrate its effectiveness. Through our submission of betaHMM to Bioconductor, we intend to contribute to the open-source software ecosystem, supporting robust and reproducible data analysis for both established and emerging biological assays. This document gives a quick tour of the functionalities in **betaHMM**. See `help(package="betaHMM")` for further details and references provided by `citation("betaHMM")`. # Walk through ## Prerequisites Before starting the **betaHMM** walk through, the user should have a working R software environment installed on their machine. The **betaHMM** package has the following dependencies which, if not already installed on the machine will automatically be installed along with the package: **stats, ggplot2, utils, scales, methods, pROC, foreach, doParallel, cowplot, dplyr, tidyr, stringr**. Assuming that the user has the **betaHMM** package installed, the user first needs to load the package: ```{r package, include=TRUE, echo=TRUE, message=FALSE,warning=FALSE} library(betaHMM) ``` ## Loading the data ## Loading the methylation and annotation data The **betaHMM** software package offers a pre-processed methylation dataset containing beta values obtained from DNA samples collected from four patients with high-grade prostate cancer. These samples encompass both benign and tumor prostate tissues and were subjected to methylation profiling using Infinium MethylationEPIC Beadchip technology. The dataset comprises DNA samples from R = 2 treatment conditions for each of N = 4 patients, with each DNA sample providing beta values for C = 694,820 CpG sites. This data collection was part of a study on prostate cancer methylomics (Silva et al. 2020). For testing purposes, a subset of CpG sites from chromosome 7 has been included in the package. This package also provides a subset of the EPIC annotation file, which users need to input into the **betaHMM** function. Users can load these two datasets from the package and inspect the first six rows in the dataframes using the following procedure: ```{r data,include=TRUE, echo=TRUE} data(pca_methylation_data) head(pca_methylation_data) data(annotation_data) head(annotation_data) ``` ## The betaHMM workflow The betaHMM model, which is employed to identify DMCs (differentially methylated CpG sites) and DMRs (differentially methylated regions), consists of three crucial functions. The entire process is carried out separately for each chromosome and involves the following steps: \begin{itemize} \item the \textbf{betaHMM} function: This function handles model parameter and hidden state estimation.It comprises three main steps: initialization, the Baum-Welch algorithm, and the Viterbi algorithm. \item The \textbf{DMC identification} function: This function utilizes the output from the betaHMM and identifies the hidden states that are differentially methylated, along with the DMCs themselves. The selection of DMCs is based on the area-under-curve (AUC) method. \item The \textbf{DMR identification} function: This function operates on the output from the DMC identification function. It includes a user-defined parameter specifying the minimum number of adjacent CpGs required to form a DMR. The output of this function is a dataframe that contains information about the number of DMCs within a DMR, the CpG sites involved, and the starting and ending locations of the DMR. \end{itemize} ## Model parameter estimation The initial phase of the workflow involves employing the Baum-Welch algorithm to estimate the model parameters. Subsequently, we apply the Viterbi algorithm to determine the most probable sequence of hidden states. ```{r betaHMM,include=TRUE, echo=TRUE} M <- 3 ## No. of methylation states in a DNA sample type N <- 4 ## No. of patients R <- 2 ## No. of treatment conditions my.seed <- 321 ## set seed for reproducibility betaHMM_out <- betaHMM(pca_methylation_data, annotation_data, M = 3, N = 4, R = 2, parallel_process = FALSE, seed = my.seed, treatment_group = c("Benign","Tumour")) ``` ## Summary of model parameters The resulting output of a call to betaHMM is an S4 object of class betaHMMResults. ```{r betaHMMclass,include=TRUE, echo=TRUE} class(betaHMM_out) ``` The parameters estimated can be displayed using the following S4 methods: ```{r betaHMMaccessor,include=TRUE, echo=TRUE} ## transition matrix estimated for all chromosomes A(betaHMM_out) ## Shape parameters estimated for a certain chromosome phi(betaHMM_out) ## Hidden states assigned to all CpG sites for a certain chromosome head(hidden_states(betaHMM_out)[["chr 7"]]) ``` A summary of the model parameters estimated for each chromosome can be obtained as below: ```{r betaHMMsummary,include=TRUE, echo=TRUE} summary(betaHMM_out) ``` ## DMC identification After estimating the parameters and hidden states, we proceed to calculate the AUC metric, which quantifies the dissimilarities between the cumulative distributions estimated for each hidden state within each chromosome. A user-defined threshold for the AUC metric is then applied to identify the most differentially methylated hidden states. Additionally, we utilize a user-defined threshold for the measure of uncertainty regarding membership in these highly differentially methylated hidden states to pinpoint the most differentially methylated CpG sites. To access the dataframe containing information about CpG site locations, methylation values, hidden state assignments, and a flag indicating DMC status, you can use the S4 `assay` command. ```{r dmc,include=TRUE, echo=TRUE} dmc_out <- dmc_identification(betaHMM_out) dmc_df <- assay(dmc_out) head(dmc_df) ``` ## Summary of the DMCs identified ```{r dmcsummary,include=TRUE, echo=TRUE} summary(dmc_out) ``` ## Plot the density estimates of the model parameter estimates The fitted density estimates, kernel density estimates, and the uncertainty in the hidden state assignment can be observed using the plot function for the betaHMM output. Since the parameters are estimated individually for each chromosome, one can generate plots for each chromosome separately. To specify the chromosome of interest, the user should utilize the `chromosome` parameter within the function. Additionally, the AUC metrics calculated for the hidden states can also be displayed using the `AUC` parameter. By providing the AUC values obtained through the `dmc_identification` function as an input parameter, the plot will depict the AUC metrics corresponding to the selected chromosome in the plot panels. ```{r betaHMMplot,include=TRUE,echo=TRUE,fig.width=8,fig.height=5,dev='png'} AUC_chr <- AUC(dmc_out) plot(betaHMM_out, chromosome = "7", what = "fitted density", AUC = AUC_chr) ``` The visualization of uncertainties in hidden state estimation is accomplished using a boxplot. Additionally, users have the option to input the threshold of uncertainty intended for DMC identification into the plotting function. This allows for the incorporation of the specified threshold into the visualization of uncertainties. ```{r betaHMMplot2,include=TRUE,echo=TRUE,fig.width=6,fig.height=5,dev='png'} plot(betaHMM_out, chromosome = "7", what = "uncertainty", uncertainty_threshold = 0.2) ``` ## DMR identification from DMCs identified Spatially correlated CpG sites give rise to clusters of CpG sites with similar methylation states, leading to the formation of biologically significant regions known as differentially methylated regions (DMRs). To define the number of adjacent DMCs required to identify a DMR, users can utilize the user-defined parameter `DMC_count`, which is set to a default value of 2. If no value is specified, the function automatically identifies regions within a chromosome containing two or more adjacent DMCs as DMRs. The DMR location information, along with the DMCs within each DMR, is presented as an S4 output of class `dmrResults`. Accessing this output can be achieved using the S4 assay method. ```{r dmr,include=TRUE, echo=TRUE} dmr_out <- dmr_identification(dmc_out, parallel_process = FALSE) dmr_df <- assay(dmr_out) head(dmr_df) ``` ## Summary of the DMRs identified ```{r dmrsummary,include=TRUE, echo=TRUE} summary(dmr_out) ``` ## Plot to visualise the DMCs and DMRs The S4 plot method has the capability to utilize the output generated by the `dmc_identification` function for plotting methylation values along with the uncertainty associated with being identified as a DMC. To create such plots, users must provide the starting CpG site's IlmnID as an input to the `start_CpG` parameter. Additionally, the `end_CpG` parameter can either take the IlmnID of the ending CpG site to be plotted or the number of CpG sites to be plotted, excluding the starting CpG site. It is worth noting that even though a CpG site may exhibit a very low uncertainty in being associated with the hidden state identified as the most differentially methylated, it might not be selected as a DMC if it falls below the threshold value set by the user. ```{r dmrplot,include=TRUE,echo=TRUE,fig.width=7,fig.height = 5, dev = 'png'} plot(dmc_out, start_CpG = "cg17750844", end_CpG = 15) ``` ## Threshold identification in DNA samples belonging to a specific condition The initialization of the intricate \eqn{K} hidden states betaHMM model, aimed at identifying differentially methylated hidden states across DNA samples collected from multiple biological conditions, begins with the utilization of a simpler 3-state betaHMM model for parameter estimation within a single treatment condition for identifying the 3 known methylation states (hypomethylation, hemimethylation and hypermethylation). Subsequently, these estimated parameters are amalgamated to construct the \eqn{3^R} hidden states parameter model. The function employed to estimate the 3-state betaHMM also provides an objective estimation of the threshold methylation value distinguishing the three methylation states. This function can be executed independently to analyze the data distribution within DNA samples originating from a single biological condition. ```{r threshold,include=TRUE, echo=TRUE} threshold_out <- threshold_identification(pca_methylation_data[,1:5], package_workflow = FALSE, annotation_file = annotation_data, M = 3, N = 4, parameter_estimation_only = FALSE, seed = my.seed) threshold(threshold_out) ``` ## Plotting the results from threshold identification function ```{r thresholdplot,include=TRUE,echo=TRUE,fig.width=5,fig.height=4,dev='png'} plot(threshold_out, plot_threshold = TRUE, what = "fitted density") ``` ```{r} sessionInfo() ```