--- title: "Tutorial on Data Sanity and Integrity Checks" author: - Menglu Liang$^1$, Huang Lin$^1$ - $^1$University of Maryland, College Park, MD 20742 date: '`r format(Sys.Date(), "%B %d, %Y")`' output: rmarkdown::html_vignette bibliography: bibliography.bib vignette: > %\VignetteIndexEntry{Tutorial on Data Sanity and Integrity Checks} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r setup, message = FALSE, warning = FALSE, comment = NA} knitr::opts_chunk$set(warning = FALSE, comment = NA, fig.width = 6.25, fig.height = 5) library(ANCOMBC) library(tidyverse) ``` # 1. Introduction The `data_sanity_check` function performs essential validations on the input data to ensure its integrity before further processing. It verifies data types, confirms the structure of the input data, and checks for consistency between sample names in the metadata and the feature table, safeguarding against common data input errors. # 2. Installation Download package. ```{r getPackage, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("ANCOMBC") ``` Load the package. ```{r load, eval=FALSE} library(ANCOMBC) ``` # 3. Examples ## 3.1 Import a `phyloseq` object The HITChip Atlas dataset contains genus-level microbiota profiling with HITChip for 1006 western adults with no reported health complications, reported in [@lahti2014tipping]. The dataset is available via the microbiome R package [@lahti2017tools] in phyloseq [@mcmurdie2013phyloseq] format. ```{r} data(atlas1006, package = "microbiome") atlas1006 ``` List the taxonomic levels available for data aggregation. ```{r} phyloseq::rank_names(atlas1006) ``` List the variables available in the sample metadata. ```{r} colnames(microbiome::meta(atlas1006)) ``` Data sanity and integrity check. ```{r} # With `group` variable check_results = data_sanity_check(data = atlas1006, tax_level = "Family", fix_formula = "age + sex + bmi_group", group = "bmi_group", struc_zero = TRUE, global = TRUE, verbose = TRUE) ``` ```{r} # Without `group` variable check_results = data_sanity_check(data = atlas1006, tax_level = "Family", fix_formula = "age + sex + bmi_group", group = NULL, struc_zero = FALSE, global = FALSE, verbose = TRUE) ``` ## 3.2 Import a `tse` object ```{r} tse = mia::makeTreeSummarizedExperimentFromPhyloseq(atlas1006) ``` List the taxonomic levels available for data aggregation. ```{r} mia::taxonomyRanks(tse) ``` List the variables available in the sample metadata. ```{r} colnames(SummarizedExperiment::colData(tse)) ``` Data sanity and integrity check. ```{r} check_results = data_sanity_check(data = tse, assay_name = "counts", tax_level = "Family", fix_formula = "age + sex + bmi_group", group = "bmi_group", struc_zero = TRUE, global = TRUE, verbose = TRUE) ``` ## 3.3 Import a `matrix` or `data.frame` Both abundance data and sample metadata are required for this import method. Note that aggregating taxa to higher taxonomic levels is not supported in this method. Ensure that the data is already aggregated to the desired taxonomic level before proceeding. If aggregation is needed, consider creating a `phyloseq` or `tse` object for importing. ```{r} abundance_data = microbiome::abundances(atlas1006) meta_data = microbiome::meta(atlas1006) ``` Ensure that the `rownames` of the metadata correspond to the `colnames` of the abundance data. ```{r} all(rownames(meta_data) %in% colnames(abundance_data)) ``` List the variables available in the sample metadata. ```{r} colnames(meta_data) ``` Data sanity and integrity check. ```{r} check_results = data_sanity_check(data = abundance_data, assay_name = "counts", tax_level = "Family", meta_data = meta_data, fix_formula = "age + sex + bmi_group", group = "bmi_group", struc_zero = TRUE, global = TRUE, verbose = TRUE) ``` # Session information ```{r sessionInfo, message = FALSE, warning = FALSE, comment = NA} sessionInfo() ``` # References