library(methyLImp2)
#> Warning: multiple methods tables found for 'union'
#> Warning: multiple methods tables found for 'intersect'
#> Warning: multiple methods tables found for 'setdiff'
library(SummarizedExperiment)
library(BiocParallel)
DNA methylation profiles often contain multiple missing values due to some technological problems. Meanwhile, methods for downstream analysis of such data require you to perform imputation of those missing values first. Although several imputation methods exist, most are inefficient with large dimensions and are not specific to methylation datasets. Previously, we have introduced methyLImp - imputation method that exploits inter-sample correlation in the methylation data in its model and also does imputation simultaneously for all CpGs with the same missingness pattern (i.e. NAs in the same samples). Although it demonstrated great performance in comparison with other methods (see here), there was still a room for improvement in terms of the running time. Therefore, we upgraded it to methyLImp2!
This version is parallelized over chromosomes and, optionally, in case of large sample size, can use mini-batch approach. In our simulation study these two improvements allowed to reduce the running time from more than a day to just half an hour for the EPIC dataset of 456 samples!
In this vignette we will demonstrate the usage of the
methyLImp2 package for the imputation of the missing values in
the methylation dataset. More details about the method itself can be
found in the
manuscript. If you use our package in your research and wish to
support us, kindly cite the paper (see
citation(package = "methyLImp2")
).
In this section we demonstrate the usage of the methyLImp2 method. This section consists of four parts: data description, generation of the artificial missing values into example dataset (since we need to know the true values for the performance evaluation), application of methyLImp2 method and then performance evaluation (we compare imputed values with the true ones).
Typically, imputation of the missing values is done before any other steps, like normalization, data exploration, analysis etc. Note that some packages do imputation while loading .idat files into R, so then you should set that step accordingly, to be able to use methyLImp after loading the data.
Here we use a subset of GSE199057 Gene Expression Omnibus dataset with mucosa samples from non-colon-cancer patients. Methylation data were measured on EPIC arrays. For the sake of vignette and due to package size restrictions, we reduce the number of samples from 68 to randomly sampled 24. We also restricted the dataset to the two shortest chromosomes, 18 and 21, using only a quarter of probes from each, and filtered out SNPs, so the total number of probes is 6064. We refer the reader to our simulation studies for full size dataset performance and running time results.
This dataset is a numeric matrix. You can use it as an input for methyLImp2. Data should be in a “tidy” format, i.e. with samples in rows and variables (probes, CpGs) in columns. However, if you use another possible input format, SummarizedExperiment, dimensions should be reversed (see later).
Here we add artificial NAs into the dataset and memorize their positions so that we can later evaluate the performance of the method. We first randomly chose 3% of probes to have artificial NAs. Then, for each probe, we randomly defined the number of NAs from a Poisson distribution with λ appropriate to the sample size of the dataset (here for 24 samples we use λ = 3.5). Finally, these NAs are randomly placed among the samples.
generateMissingData
function returns a list of two
objects: matrix of beta values with artificial NAs and a list of
positions of those NAs (columns and then rows for each column). We’ll
need that second object to be able to evaluate performance later (since
the dataset already has some ‘’natural’’ and, therefore, “unevaluatable”
NAs, we can’t just directly compare two matrices, we need to know which
entries to compare.)
with_missing_data <- generateMissingData(beta, lambda = 3.5)
#> The input dataset has 3 missing entries.
#> After NA generation, the dataset has 654 missing entries.
beta_with_nas <- with_missing_data$beta_with_nas
na_positions <- with_missing_data$na_positions
As already mentioned, you can provide a numeric matrix as an argument for methyLImp2. However, as many users work with SummarizedExperiment objects in the Bioconductor workflow, we will construct such object here for the sake of demonstration. Note how we transpose the matrix since typical assay format is variables in rows and samples in columns.
Now, let’s run methyLImp2! You start by providing either a numeric data matrix with missing values, with samples in rows and variables (probes) in columns, or a SummarizedExperiment object, from which the first assays slot will be imputed. You also need to provide the type of your data - 450k or EPIC. These are the only two arguments you have to provide. However, there are a few other things you can tune so that the method suits your needs best:
Specify what groups you have in your data. methyLImp2 works best when imputation is done on each group of samples independently. Therefore, you should specify the correspondence of samples to groups. methyLImp2 will split the data by groups, perform imputation and put the data back together. Here we’ve already pre-filtered the data to only one group, so we do not use this argument;
Specify the type of data as “user-provided”. Type of data (450k or EPIC) is used to split CpGs across chromosomes. Match of CpGs to chromosomes is taken from ChAMPdata package (there, in turn, it comes from Illumina website). Therefore, if you wish to provide your own match, specify “user” in the type argument and then provide an annotation data frame with in the annotation argument (see example below);
cpg | chr | |
---|---|---|
1 | cg13869341 | chr1 |
2 | cg14008030 | chr1 |
67325 | cg03327570 | chr2 |
67326 | cg04948016 | chr2 |
300000 | cg15188574 | chr11 |
Set up mini-batch computation. If your dataset is quite big sample-wise, you can opt to use only a fraction of samples for the imputation to decrease the running time, say 10, 20, 50% (depending on the original number of samples). Subsample will be chosen randomly for each calculation. The bigger the subsample - the better is the performance, hence by default it is 1. An option to improve the performance but still keep the running time low is to repeat computation for a (randomly chosen) fraction of samples several times, maybe 2, 3, 5. In the manuscript we explore how these two tuning parameters influence the running time and the performance.
Choose the number of cores/workers. methyLImp2 first
splits the data by chromosomes and then does imputation for each subset
in parallel. The default of the BiocParallel
package is two
less than total number of your cores. You can change it to your chosen N
by setting BPPARAM = SnowParam(workers = N)
. Since here we
have only two chromosomes, methyLImp2 will give a warning that
it overwrote default settings. Note that we also encourage you to set
exportglobals = FALSE
since it allows to slightly speed up
the computation;
Choose not to overwrite your data with the imputed one. Here we
opt to use the default setting and overwrite the existing assay of the
SummarizedExperiment object to save memory. However, you can set
overwrite_res = FALSE
and then another slot will be
added.
Now we evaluate the performance of the algorithm with root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE) - the lower the values are, the better - and Pearson correlation coefficient (PCC) - closer to 1, the better.
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] BiocParallel_1.41.0 SummarizedExperiment_1.37.0
#> [3] Biobase_2.67.0 MatrixGenerics_1.19.0
#> [5] matrixStats_1.4.1 methyLImp2_1.3.0
#> [7] ChAMPdata_2.38.0 GenomicRanges_1.59.0
#> [9] GenomeInfoDb_1.43.1 IRanges_2.41.1
#> [11] S4Vectors_0.45.2 BiocGenerics_0.53.3
#> [13] generics_0.1.3 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] Matrix_1.7-1 jsonlite_1.8.9 crayon_1.5.3
#> [4] compiler_4.4.2 BiocManager_1.30.25 parallel_4.4.2
#> [7] jquerylib_0.1.4 yaml_2.3.10 fastmap_1.2.0
#> [10] lattice_0.22-6 R6_2.5.1 XVector_0.47.0
#> [13] S4Arrays_1.7.1 knitr_1.49 DelayedArray_0.33.2
#> [16] snow_0.4-4 maketools_1.3.1 GenomeInfoDbData_1.2.13
#> [19] bslib_0.8.0 rlang_1.1.4 cachem_1.1.0
#> [22] xfun_0.49 sass_0.4.9 sys_3.4.3
#> [25] SparseArray_1.7.2 cli_3.6.3 zlibbioc_1.52.0
#> [28] grid_4.4.2 digest_0.6.37 lifecycle_1.0.4
#> [31] evaluate_1.0.1 corpcor_1.6.10 codetools_0.2-20
#> [34] buildtools_1.0.0 abind_1.4-8 rmarkdown_2.29
#> [37] httr_1.4.7 tools_4.4.2 htmltools_0.5.8.1
#> [40] UCSC.utils_1.3.0