---
title: "methyLImp2 vignette"
author: "Anna Plaksienko"
package: methyLImp2
output: 
    BiocStyle::html_document
vignette: >
    %\VignetteIndexEntry{methyLImp2 vignette}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}  
---

```{r knitr options, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup, message = FALSE}
library(methyLImp2)
library(SummarizedExperiment)
library(BiocParallel)
```
# Introduction

DNA methylation profiles often contain multiple missing values due to some technological problems. Meanwhile, methods for downstream analysis of such data require you to perform imputation of those missing values first. Although several imputation methods exist, most are inefficient with large dimensions and are not specific to methylation datasets. Previously, we have introduced *methyLImp* - imputation method that exploits inter-sample correlation in the methylation data in its model and also does imputation simultaneously for all CpGs with the same missingness pattern (i.e. NAs in the same samples). Although it demonstrated great performance in comparison with other methods (see [here](https://pubmed.ncbi.nlm.nih.gov/30796811/)), there was still a room for improvement in terms of the running time. Therefore, we upgraded it to *methyLImp2*!

This version is parallelized over chromosomes and, optionally, in case of large sample size, can use mini-batch approach. In our simulation study these two improvements allowed to reduce the running time from more than a day to just half an hour for the EPIC dataset of 456 samples!

In this vignette we will demonstrate the usage of the *methyLImp2* package for the imputation of the missing values in the methylation dataset. More details about the method itself can be found in [the manuscript](https://academic.oup.com/bioinformatics/article/40/1/btae001/7517106). If you use our package in your research and wish to support us, kindly cite the paper (see `citation(package = "methyLImp2")`).

## Installation
To install *methyLImp2* from Bioconductor, run
```{r installation, eval = FALSE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("methyLImp2")
```


# Using *methyLImp2*

In this section we demonstrate the usage of the *methyLImp2* method. This section consists of four parts: data description, generation of the artificial missing values into example dataset (since we need to know the true values for the performance evaluation), application of *methyLImp2* method and then performance evaluation (we compare imputed values with the true ones). 

## About the data

Typically, imputation of the missing values is done before any other steps, like normalization, data exploration, analysis etc. Note that some packages do imputation while loading .idat files into R, so then you should set that step accordingly, to be able to use *methyLImp* after loading the data.

Here we use a subset of GSE199057 Gene Expression Omnibus dataset with mucosa samples from non-colon-cancer patients.  Methylation data were measured on EPIC arrays. For the sake of vignette and due to package size restrictions, we reduce the number of samples from 68 to randomly sampled 24. We also restricted the dataset to the two shortest chromosomes, 18 and 21, using only a quarter of probes from each, and filtered out SNPs, so the total number of probes is 6064. We refer the reader to our [simulation studies](https://github.com/annaplaksienko/methyLImp2_simulation_studies) for full size dataset performance and running time results. 

This dataset is a numeric matrix. You can use it as an input for *methyLImp2*. Data should be in a "tidy" format, i.e. with samples in rows and variables (probes, CpGs) in columns. However, if you use another possible input format, SummarizedExperiment, dimensions should be reversed (see later).

```{r load the dataset}
data(beta, package = "methyLImp2")
print(dim(beta))
```


## Missing values generation

Here we add artificial NAs into the dataset and memorize their positions so that we can later evaluate the performance of the method. We first randomly chose 3\% of probes to have artificial NAs. Then, for each probe, we randomly defined the number of NAs from a Poisson distribution with $\lambda$ appropriate to the sample size of the dataset (here for 24 samples we use $\lambda = 3.5$). Finally, these NAs are randomly placed among the samples. 

`generateMissingData` function returns a list of two objects: matrix of beta values with artificial NAs and a list of positions of those NAs (columns and then rows for each column). We'll need that second object to be able to evaluate performance later (since the dataset already has some ''natural'' and, therefore, "unevaluatable" NAs, we can't just directly compare two matrices, we need to know which entries to compare.)

```{r generate artificial NAs}
with_missing_data <- generateMissingData(beta, lambda = 3.5)
beta_with_nas <- with_missing_data$beta_with_nas
na_positions <- with_missing_data$na_positions
```

As already mentioned, you can provide a numeric matrix as an argument for *methyLImp2*. However, as many users work with SummarizedExperiment objects in the Bioconductor workflow, we will construct such object here for the sake of demonstration. Note how we transpose the matrix since typical assay format is variables in rows and samples in columns. 
```{r construct SE}
data(beta_meta)
beta_SE <- SummarizedExperiment(assays = SimpleList(beta = t(beta_with_nas)), 
                                colData = beta_meta)
```

## Using methyLImp2

Now, let's run *methyLImp2*! You start by providing either a numeric data matrix with missing values, with samples in rows and variables (probes) in columns, or a SummarizedExperiment object, from which the first assays slot will be imputed. You also need to provide the type of your data - 450k or EPIC. These are the only two arguments you have to provide. However, there are a few other things you can tune so that the method suits your needs best: 

* Specify what groups you have in your data. *methyLImp2* works best when imputation is done on each group of samples independently. Therefore, you should specify the correspondence of samples to groups. *methyLImp2* will split the data by groups, perform imputation and put the data back together. Here we've already pre-filtered the data to only one group, so we do not use this argument;

* Specify the type of data as "user-provided". Type of data (450k or EPIC) is used to split CpGs across chromosomes. Match of CpGs to chromosomes is taken from ChAMPdata package (there, in turn, it is from Illumina website). Therefore, if you wish to provide your own match, specify "user" in the type argument and provide a data frame with in the annotation argument; 

* Set up mini-batch computation. If your dataset is quite big sample-wise, you can opt to use only a fraction of samples for the imputation to decrease the running time, say 10, 20, 50% (depending on the original number of samples). Subsample will be chosen randomly for each calculation. The bigger the subsample - the better is the performance, hence by default it is 1. An option to improve the performance but still keep the running time low is to repeat computation for a (randomly chosen) fraction of samples several times, maybe 2, 3, 5. In the manuscript we explore how these two tuning parameters influence the running time and the performance.

* Choose the number of cores/workers. *methyLImp2* first splits the data by chromosomes and then does imputation for each subset in parallel. The default of the `BiocParallel` package is two less than total number of your cores. You can change it to your chosen N by setting `BPPARAM = SnowParam(workers = N)`. Since here we have only two chromosomes, *methyLImp2* will give a warning that it overwrote default settings. Note that we also encourage you to set `exportglobals = FALSE` since it allows to slightly speed up the computation; 

* Choose not to overwrite your data with the imputed one. Here we opt to use the default setting and overwrite the existing assay of the SummarizedExperiment object to save memory. However, you can set `overwrite_res = FALSE` and then another slot will be added.

```{r run methyLImp}
time <- system.time(beta_SE_imputed <- methyLImp2(input = beta_SE, 
                                               type = "EPIC", 
                                               BPPARAM = SnowParam(exportglobals = FALSE),
                                               minibatch_frac = 0.5))
print(paste0("Runtime was ", round(time[3], digits = 2), " seconds."))
```

## Performance evaluation

Now we evaluate the performance of the algorithm with root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE) - the lower the values are, the better - and Pearson correlation coefficient (PCC) - closer to 1, the better.

```{r evaluate performance}
performance <- evaluatePerformance(beta, t(assays(beta_SE_imputed)[[1]]), 
                                   na_positions)
print(performance)
```


# Session info

```{r sessionInfo}
sessionInfo()
```