Title: | Missing value estimation of DNA methylation data |
---|---|
Description: | This package allows to estimate missing values in DNA methylation data. methyLImp method is based on linear regression since methylation levels show a high degree of inter-sample correlation. Implementation is parallelised over chromosomes since probes on different chromosomes are usually independent. Mini-batch approach to reduce the runtime in case of large number of samples is available. |
Authors: | Pietro Di Lena [aut] , Anna Plaksienko [aut, cre] , Claudia Angelini [aut] , Christine Nardini [aut] |
Maintainer: | Anna Plaksienko <[email protected]> |
License: | GPL-3 |
Version: | 1.3.0 |
Built: | 2024-12-18 03:25:22 UTC |
Source: | https://github.com/bioc/methyLImp2 |
The GSE199057 Gene Expression Omnibus dataset contains 68 mucosa samples from non-colon-cancer patients, from which we randomly sampled 24. Methylation data were measured on EPIC arrays and after removal of sex chromosomes and SNPs loci, it contains 816 126 probes. Pre-processing can be found on the _methyLImp2_ simulation github page https://github.com/annaplaksienko/methyLImp2_simulation_studies. Here we subset only a quarter of probes from two smallest chromosomes (18 and 21) for the sake of demonstration.
data(beta)
data(beta)
A numeric matrix
A numeric matrix
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi
The GSE199057 Gene Expression Omnibus dataset contains 68 mucosa samples from non-colon-cancer patients, from which we randomly sampled 24. Dataset contains metadata for those.
data(beta_meta)
data(beta_meta)
A data.frame
A data.frame.
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi
A snippet from an annotation for 450K methylation dataset from ChAMPdata package. Only 5 CpGs are chosen simply to provide an example of the data frame organization.
data(custom_anno_example)
data(custom_anno_example)
A data.frame
A data.frame.
This function computes performance metrics as an element-wise difference
between two matrices (skipping NA elements that were not imputed):
* ;
*
;
*
(here we omit the true beta-values equal to 0 and their predicted values
to avoid an indeterminate measure);
*
.
evaluatePerformance(beta_true, beta_imputed, na_positions)
evaluatePerformance(beta_true, beta_imputed, na_positions)
beta_true |
first numeric data matrix. |
beta_imputed |
second numeric data matrix |
na_positions |
a list where each element is a list of two elements: column id and ids of rows with NAs in that column (structure matches the output of generateMissingData function). We need this because some NAs in the dataset are from real data and not artificial, so we can't evaluate the performance of the method on them since we do not know real value. Therefore, we need to know the positions of artificial NAs. |
A numerical vector of four numbers: root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE) and Pearson correlation coefficient (PCC).
data(beta) with_missing_data <- generateMissingData(beta, lambda = 3.5) beta_with_nas <- with_missing_data$beta_with_nas na_positions <- with_missing_data$na_positions beta_imputed <- methyLImp2(input = beta_with_nas, type = "EPIC", minibatch_frac = 0.5, BPPARAM = BiocParallel::SnowParam(workers = 1)) evaluatePerformance(beta, beta_imputed, na_positions)
data(beta) with_missing_data <- generateMissingData(beta, lambda = 3.5) beta_with_nas <- with_missing_data$beta_with_nas na_positions <- with_missing_data$na_positions beta_imputed <- methyLImp2(input = beta_with_nas, type = "EPIC", minibatch_frac = 0.5, BPPARAM = BiocParallel::SnowParam(workers = 1)) evaluatePerformance(beta, beta_imputed, na_positions)
This function generates missing values for the simulation purposes
(to apply methyLImp method and then compare the imputed values with
the true ones that have been replaced by NAs).
First, we randomly choose 3% of all probes.
Then for each of the chosen probes, we randomly define the number of NAs
from a Poisson distribution with , appropriate to the sample
size of the dataset (unless specified by the user, here we use
). Finally, these amount of NAs is
randomly placed among the samples.
generateMissingData(beta, lambda = NULL)
generateMissingData(beta, lambda = NULL)
beta |
a numeric data matrix into which one wants to add some missing values |
lambda |
a number, parameter of the Poisson distribution that will indicate how many samples will have missing values in each selected probe. |
A list with two slots: a numeric data matrix with generated NAs in some entries and a list of positions of those NAs.
data(beta) beta_with_nas <- generateMissingData(beta, lambda = 3.5)
data(beta) beta_with_nas <- generateMissingData(beta, lambda = 3.5)
This function performs missing value imputation specific for DNA methylation data. The method is based on linear regression since methylation levels show a high degree of inter-sample correlation. Implementation is parallelised over chromosomes to improve the running time.
methyLImp2( input, which_assay = NULL, type = c("450K", "EPIC", "user"), annotation = NULL, groups = NULL, range = NULL, skip_imputation_ids = NULL, BPPARAM = BiocParallel::bpparam(), minibatch_frac = 1, minibatch_reps = 1, overwrite_res = TRUE )
methyLImp2( input, which_assay = NULL, type = c("450K", "EPIC", "user"), annotation = NULL, groups = NULL, range = NULL, skip_imputation_ids = NULL, BPPARAM = BiocParallel::bpparam(), minibatch_frac = 1, minibatch_reps = 1, overwrite_res = TRUE )
input |
either a numeric data matrix with missing values to be, with named samples in rows and variables (probes) in named columns, or a SummarizedExperiment object, with an assay with variables in rows and samples in columns, as standard. |
which_assay |
a character specifying the name of assay of the SummarizedExperiment object to impute. By default the first one will be imputed. |
type |
a type of data, 450K or EPIC. Type is used to split CpGs across chromosomes. Match of CpGs to chromosomes is taken from ChAMPdata package. If you wish to provide your own match, specify "user" in this argument and provide a data frame in the next argument. |
annotation |
a data frame, user provided match between CpG sites and chromosomes. Must contain two columns: cpg and chr. Choose "user" in the previous argument to be able to provide user annotation. |
groups |
a vector of the same length as the number of samples that
identifies what groups does each sample correspond, e.g. |
range |
a vector of two numbers, |
skip_imputation_ids |
a numeric vector of ids of the columns with NAs
for which not to perform the imputation. If |
BPPARAM |
set of options for parallelization through BiocParallel package.
For details we refer to their documentation. The one thing most users probably
wish to customize is the number of cores. By default it is set to
|
minibatch_frac |
a number between 0 and 1, what fraction of samples to use for mini-batch computation. Remember that if your data has several groups, mini-batch will be applied to each group separately but with the same fraction, so choose it accordingly. However, if your chosen fraction will be smaller than a matrix dimension for some groups, mini-batch will be just ignored. We advise to use mini-batch only if you have large number of samples, order of hundreds. The default is 1 (i.e., 100% of samples are used, no mini-batch). |
minibatch_reps |
a number, how many times to repeat computations with a fraction of samples specified above (more times -> better performance but more runtime). The default is 1 (as a companion to default fraction of 100%, i.e. no mini-batch). |
overwrite_res |
a boolean specifying whether to overwrite an imputed slot
of the SummarizedExperiment object or to add another slot with
imputed data. The default is |
Either a numeric matrix with imputed data or a SummarizedExperiment object.
data(beta) beta_with_nas <- generateMissingData(beta, lambda = 3.5)$beta_with_nas beta_imputed <- methyLImp2(input = beta_with_nas, type = "EPIC", minibatch_frac = 0.5, BPPARAM = BiocParallel::SnowParam(workers = 1))
data(beta) beta_with_nas <- generateMissingData(beta, lambda = 3.5)$beta_with_nas beta_imputed <- methyLImp2(input = beta_with_nas, type = "EPIC", minibatch_frac = 0.5, BPPARAM = BiocParallel::SnowParam(workers = 1))
This function split a given methylation dataset into a list of datasets according to a given annotation.
split_by_chromosomes(data, type = c("450K", "EPIC", "user"), annotation = NULL)
split_by_chromosomes(data, type = c("450K", "EPIC", "user"), annotation = NULL)
data |
a numeric data matrix with with samples in rows and variables (probes) in named columns. |
type |
a type of data, 450K or EPIC. Type is used to split CpGs across chromosomes. Match of CpGs to chromosomes is taken from ChAMPdata package. If you wish to provide your own match, specify "user" in this argument and provide a data frame in the next argument. |
annotation |
a data frame, user provided match between CpG sites and chromosomes. Must contain two columns: cpg and chr. Choose "user" in the previous argument to be able to provide user annotation. |
A list of numeric data matrices where each matrix contains probes from one chromosome.