Title: | Mechanism-Aware Imputation |
---|---|
Description: | A two-step approach to imputing missing data in metabolomics. Step 1 uses a random forest classifier to classify missing values as either Missing Completely at Random/Missing At Random (MCAR/MAR) or Missing Not At Random (MNAR). MCAR/MAR are combined because it is often difficult to distinguish these two missing types in metabolomics data. Step 2 imputes the missing values based on the classified missing mechanisms, using the appropriate imputation algorithms. Imputation algorithms tested and available for MCAR/MAR include Bayesian Principal Component Analysis (BPCA), Multiple Imputation No-Skip K-Nearest Neighbors (Multi_nsKNN), and Random Forest. Imputation algorithms tested and available for MNAR include nsKNN and a single imputation approach for imputation of metabolites where left-censoring is present. |
Authors: | Jonathan Dekermanjian [aut, cre], Elin Shaddox [aut], Debmalya Nandy [aut], Debashis Ghosh [aut], Katerina Kechris [aut] |
Maintainer: | Jonathan Dekermanjian <[email protected]> |
License: | GPL-3 |
Version: | 1.13.0 |
Built: | 2024-10-30 07:42:28 UTC |
Source: | https://github.com/bioc/MAI |
A two-step approach to imputing missing data in metabolomics. Step 1 uses a random forest classifier to classify missing values as either Missing Completely at Random/Missing At Random (MCAR/MAR) or Missing Not At Random (MNAR). MCAR/MAR are combined because it is often difficult to distinguish these two missing types in metabolomics data. Step 2 imputes the missing values based on the classified missing mechanisms, using the appropriate imputation algorithms. Imputation algorithms tested and available for MCAR/MAR include Bayesian Principal Component Analysis (BPCA), Multiple Imputation No-Skip K-Nearest Neighbors (Multi_nsKNN), and Random Forest. Imputation algorithms tested and available for MNAR include nsKNN and a single imputation approach for imputation of metabolites where left-censoring is present.
MAI(data_miss, MCAR_algorithm = c("BPCA", "Multi_nsKNN", "random_forest"), MNAR_algorithm = c("nsKNN", "Single"), n_cores = 1, assay_ix = 1, forest_list_args = list( ntree = 300, proximity = FALSE ), verbose = TRUE )
MAI(data_miss, MCAR_algorithm = c("BPCA", "Multi_nsKNN", "random_forest"), MNAR_algorithm = c("nsKNN", "Single"), n_cores = 1, assay_ix = 1, forest_list_args = list( ntree = 300, proximity = FALSE ), verbose = TRUE )
data_miss |
A matrix or dataframe, or a SummarizedExperiment containing missing values designated by "NA" to impute |
MCAR_algorithm |
The imputation algorithm you wish to use to impute MCAR predicted missing values. possible algorithms c("BPCA", "Multi_nsKNN", "random_forest") |
MNAR_algorithm |
The imputation algorithm you wish to use to impute MNAR predicted missing values. possible algorithms c("Single", "nsKNN") |
n_cores |
The number of cores you want to utilize. Default is 1 core. To use all cores specify n_cores = -1. |
assay_ix |
If data is a Summarized Experiment then this argument defines the index of the assay to impute. Default is set to the first assay. |
forest_list_args |
Random forest named arguments to pass to the random forest training process. Defualt args are ntree = 300 and proximity = FALSE |
verbose |
A toggle to suppress console output. Default is TRUE |
When matrix or dataframe returns a list containing the following:
Imputed Data |
Returns dataframes of MAI imputation |
Estimated Parameters |
Returns the estimated |
When a Summarized Experiment returns:
Imputed Assay |
Returns the imputed data in the specified assay based on the assay_ix assigned |
Estimated Parameters |
Returns estimated parameters in the metadata of the Summarized Experiment as a list |
data(untargeted_LCMS_data) MAI(data_miss=untargeted_LCMS_data, MCAR_algorithm = "BPCA", MNAR_algorithm="Single", n_cores = 1, assay_ix = 1, forest_list_args = list( ntree = 300, proximity = FALSE ), verbose = TRUE)
data(untargeted_LCMS_data) MAI(data_miss=untargeted_LCMS_data, MCAR_algorithm = "BPCA", MNAR_algorithm="Single", n_cores = 1, assay_ix = 1, forest_list_args = list( ntree = 300, proximity = FALSE ), verbose = TRUE)
This data set is randomly generated. We impose 30 percent missing values using the Mixed missingness algorithm developed by Styczynski et al. Where the parameters alpha, beta, and gamma were chosen to be 30, 70, and 40 percent, respectively.
Lee JY, Styczynski MP. NS-kNN: a modified k-nearest neighbors approach for imputing metabolomics data. Metabolomics. 2018;14(12):153.