| Title: | Microbiome data integration method via shared dictionary learning |
|---|---|
| Description: | MetaDICT is a method for the integration of microbiome data. This method is designed to remove batch effects and preserve biological variation while integrating heterogeneous datasets. MetaDICT can better avoid overcorrection when unobserved confounding variables are present. |
| Authors: | Bo Yuan [aut, cre] (ORCID: <https://orcid.org/0009-0008-5428-4447>), Shulei Wang [aut] |
| Maintainer: | Bo Yuan <[email protected]> |
| License: | Artistic-2.0 |
| Version: | 1.3.0 |
| Built: | 2026-05-30 09:37:02 UTC |
| Source: | https://github.com/bioc/MetaDICT |
A k-nearest neighbor graph is constructed based on Euclidean distance. Then community detection method is applied to identify communities.
Various values of k within a specific range are tried and the one that yields the highest average Silhouette score is selected.
community_detection( X, max_k = 10, method = "Louvain", resolution = 1, min_k = 2 )community_detection( X, max_k = 10, method = "Louvain", resolution = 1, min_k = 2 )
X |
Input data. Rows represent clustering objects, and columns represent features. |
max_k |
The largest number of connected neighbors. |
method |
The community detection method to use. Options include |
resolution |
The resolution parameter for the Louvain algorithm. |
min_k |
The smallest number of connected neighbors. |
A list with the following components:
cluster |
– The estimated cluster labels. |
graph |
– The |
data(exampleData) O = exampleData$O meta = exampleData$meta dist_mat = exampleData$dist_mat metadict_res = MetaDICT(O, meta, distance_matrix = dist_mat) D = metadict_res$D D_filter = D[,1:20] taxa_c = community_detection(D_filter, max_k = 5)data(exampleData) O = exampleData$O meta = exampleData$meta dist_mat = exampleData$dist_mat metadict_res = MetaDICT(O, meta, distance_matrix = dist_mat) D = metadict_res$D D_filter = D[,1:20] taxa_c = community_detection(D_filter, max_k = 5)
Check the format of inputs.
data_check( count, meta, covariates = "all", distance_matrix = NULL, tree = NULL, taxonomy = NULL, tax_level = NULL, verbose = TRUE )data_check( count, meta, covariates = "all", distance_matrix = NULL, tree = NULL, taxonomy = NULL, tax_level = NULL, verbose = TRUE )
count |
The integrated count table of taxa by samples. The
|
meta |
The integrated meta table |
covariates |
The covariates used in data integration. Default is all. |
distance_matrix |
The distance matrix that measures sequence dissimilarity. |
tree |
The phylogenetic tree (optional if distance matrix or taxonomy is provided). |
taxonomy |
The taxonomy table (optional if distance matrix or phylogenetic tree is provided). |
tax_level |
The taxonomic level of count table. |
verbose |
Logical; whether to print progress messages. Default is TRUE. |
a list contains count list, meta table list, sequencing distance
matrix and parameters.
A list containing example OTU table, sample metadata, taxonomy, tree, and sequence distance matrix.
data(exampleData)data(exampleData)
A named list with 5 elements: O, dist_mat, meta, taxonomy, tree.
A list containing example OTU table, sample metadata.
data(exampleData_transfer)data(exampleData_transfer)
A named list file with 2 elements: new_data, new_meta.
A method for microbiome data integration. This method is designed to remove batch effects and preserve biological variation while integrating heterogeneous datasets. MetaDICT can better avoid overcorrection when unobserved confounding variables are present.
MetaDICT( count, meta, covariates = "all", tree = NULL, taxonomy = NULL, distance_matrix = NULL, tax_level = NULL, customize_parameter = FALSE, alpha = 0.1, beta = 0.01, normalization = "uq", max_iter = 10000, imputation = FALSE, verbose = TRUE, optim_trace = FALSE )MetaDICT( count, meta, covariates = "all", tree = NULL, taxonomy = NULL, distance_matrix = NULL, tax_level = NULL, customize_parameter = FALSE, alpha = 0.1, beta = 0.01, normalization = "uq", max_iter = 10000, imputation = FALSE, verbose = TRUE, optim_trace = FALSE )
count |
The integrated count table (taxa-by-sample matrix).
Should be provided as either a |
meta |
The integrated meta table containing sample information and batch IDs. The data must include a column named 'batch' containing all batch IDs. The row names of the meta should match the sample names in the count table. |
covariates |
The covariates used in data integration. Default is |
tree |
The phylogenetic tree (optional if a distance matrix or taxonomy is provided). |
taxonomy |
The taxonomy table (optional if a distance matrix or phylogenetic tree is provided). The row names of the taxonomy table should match the taxa names in the count table. |
distance_matrix |
A |
tax_level |
The taxonomic level of the count table. |
customize_parameter |
A logical variable. Set to |
alpha |
A parameter controlling the rank of the final corrected count table.
A larger |
beta |
A parameter controlling the smoothness of the estimated measurement efficiency.
A larger |
normalization |
The normalization method. Options are |
max_iter |
The maximum number of iterations for the optimization process. Default is |
imputation |
A logical variable. Whether to allow MetaDICT to perform imputation
based on dictionary learning results. Default is |
verbose |
A logical variable. Whether to generate verbose output. Default is |
optim_trace |
A logical variable. Whether to print optimization steps. Default is |
MetaDICT is a two-step approach. It initially estimates the batch effects by covariate balancing, then refines the estimation via shared dictionary learning.
A list with the following components:
count |
( |
D |
( |
R |
( |
w |
( |
meta |
( |
dist_mat |
( |
data(exampleData) O = exampleData$O meta = exampleData$meta dist_mat = exampleData$dist_mat metadict_res = MetaDICT(O, meta, distance_matrix = dist_mat)data(exampleData) O = exampleData$O meta = exampleData$meta dist_mat = exampleData$dist_mat metadict_res = MetaDICT(O, meta, distance_matrix = dist_mat)
This function adds new studies to an integrated dataset using a pre-learned dictionary. The corrected data can be directly used with machine learning models trained on the previously integrated dataset, enabling seamless application without retraining.
metadict_add_new_data( newdata, newmeta, integrated_result, customize_parameter = FALSE, beta = 0.01, normalization = "uq", max_iter = 10000, imputation = FALSE, verbose = TRUE, optim_trace = FALSE )metadict_add_new_data( newdata, newmeta, integrated_result, customize_parameter = FALSE, beta = 0.01, normalization = "uq", max_iter = 10000, imputation = FALSE, verbose = TRUE, optim_trace = FALSE )
newdata |
The integrated count table of new studies.
Rows represent taxa, and columns represent samples.
Should be provided as either a |
newmeta |
The integrated meta table ( |
integrated_result |
The output list from a previous MetaDICT integration task. |
customize_parameter |
A logical variable.
Set to |
beta |
A parameter controlling the smoothness of the estimated measurement efficiency.
A larger |
normalization |
The normalization method. Options are |
max_iter |
The maximum number of iterations for the optimization process.
Default is |
imputation |
A logical variable.
Whether to allow MetaDICT to perform imputation based on dictionary learning results.
Default is |
verbose |
A logical variable.
Whether to generate verbose output. Default is |
optim_trace |
A logical variable.
Whether to print optimization steps. Default is |
This function estimates measurement efficiency and debiased representations for new studies while keeping the dictionary unchanged.
A list with the following components:
count |
( |
D |
( |
R |
( |
w |
( |
meta |
( |
dist_mat |
( |
data(exampleData) O = exampleData$O meta = exampleData$meta dist_mat = exampleData$dist_mat metadict_res = MetaDICT(O, meta, distance_matrix = dist_mat) data("exampleData_transfer") new_data = exampleData_transfer$new_data new_meta = exampleData_transfer$new_meta new_data_res = metadict_add_new_data(new_data, new_meta, metadict_res)data(exampleData) O = exampleData$O meta = exampleData$meta dist_mat = exampleData$dist_mat metadict_res = MetaDICT(O, meta, distance_matrix = dist_mat) data("exampleData_transfer") new_data = exampleData_transfer$new_data new_meta = exampleData_transfer$new_meta new_data_res = metadict_add_new_data(new_data, new_meta, metadict_res)
PCoA plots for continuous variables.
pcoa_plot_continuous( X, covariate, title, R2 = TRUE, dissimilarity = "Bray-Curtis", point_size = 1 )pcoa_plot_continuous( X, covariate, title, R2 = TRUE, dissimilarity = "Bray-Curtis", point_size = 1 )
X |
Abundance matrix. Rows represent taxa, and columns represent samples. |
covariate |
A discrete sample covariate. |
title |
The title of the graph. |
R2 |
A logical variable.
Whether to display the R² statistic in the subtitle. Default is |
dissimilarity |
The dissimilarity type to use. Options include:
|
point_size |
The size of the points in the plot. Default is |
a PCoA plot.
data(exampleData) O = exampleData$O Y = runif(ncol(O)) pcoa_plot_continuous(O,Y,"Y")data(exampleData) O = exampleData$O Y = runif(ncol(O)) pcoa_plot_continuous(O,Y,"Y")
PCoA plots for discrete variables.
pcoa_plot_discrete( X, covariate, title, R2 = TRUE, dissimilarity = "Bray-Curtis", colorset = "Set1", point_size = 1 )pcoa_plot_discrete( X, covariate, title, R2 = TRUE, dissimilarity = "Bray-Curtis", colorset = "Set1", point_size = 1 )
X |
Abundance matrix. Rows represent taxa, and columns represent samples. |
covariate |
A discrete sample covariate. |
title |
The title of the graph. |
R2 |
A logical variable.
Whether to display the R² statistic in the subtitle. Default is |
dissimilarity |
The dissimilarity type. Options include:
|
colorset |
The color set for visualization. Default is |
point_size |
The size of the points in the plot. Default is |
a PCoA plot.
data(exampleData) O = exampleData$O meta = exampleData$meta batchid = meta$batch pcoa_plot_discrete(O,batchid,"Batch")data(exampleData) O = exampleData$O meta = exampleData$meta batchid = meta$batch pcoa_plot_discrete(O,batchid,"Batch")
This function produces singular value plots for each input dataset to assess the validity of the low-rank assumption. A rapid decay in the singular values indicates that the dataset can be effectively approximated by matrix factorization.
plot_singular_values(count, meta)plot_singular_values(count, meta)
count |
The integrated count table of taxa by samples. The
|
meta |
The integrated meta table |
A list of ggplot objects displaying the singular values for each dataset.
data(exampleData) O = exampleData$O meta = exampleData$meta plot_singular_values(O, meta)data(exampleData) O = exampleData$O meta = exampleData$meta plot_singular_values(O, meta)