--- title: "Building Trees from Taxonomies" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{2. Building Trees from Taxonomies} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r opts, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", message = FALSE, warning = FALSE ) ``` Taxonomy provides useful context for interpreting microbiome data, and taxonomy tables are often provided as output from microbiome sequencing pipelines or extracted from Bioconductor containers such as [phyloseq](https://bioconductor.org/packages/phyloseq/). However, phylobar requires a phylo object as input. This is a more formal representation of a tree than a table of taxonomy assignments. To support the conversion, the package includes a helper function to convert such assignments into the required tree format. Since taxonomy tables lack a standardized representation, small discrepancies can produce invalid trees. This article collects examples from other vignettes of the common issues that can arise when constructing a taxonomy tree and goes into more depth about the strategies that were used to address them. ```{r setup, message=FALSE} library(ape) library(dplyr) library(phylobar) library(phyloseq) library(stringr) library(zen4R) ``` # Checking that that the induced tree is correct Before looking at potential issues, let's review how to check whether an input tree is valid. In the block below, we generate a random tree using the rtree function from ape. We can confirm validity with the `checkValidPhylo` function. If the output is TRUE, the tree can be used for phylobar visualizations. If not, the function provides hints on how to adjust the taxonomy table so that `taxonomy_to_tree` produces a valid tree. ```{r check-valid} tree <- rtree(20) checkValidPhylo(tree) ``` # Introducing a missing root node One common issue is the absence of a root node that links all descendant microbes. Many taxonomy tables implicitly assume the Bacteria kingdom and begin at the phylum level. In such cases, validity checks may fail with an error like the one below. This occurs because each phylum is treated as a separate tree, rather than being connected into a single tree with the Kingdom as the root. ```{r atlas-invalid} data(atlas1006, package = "microbiome") tree <- taxonomy_to_tree(tax_table(atlas1006)) checkValidPhylo(tree) ``` This problem can be resolved by introducing a new "kingdom" column from which all phyla descend. With this fix, `taxonomy_to_tree` correctly builds edges between the first and second columns. Re-running the validity check now produces a valid tree object, which is used in the earlier Atlas vignette. ```{r atlas-fixed} taxa <- tax_table(atlas1006) taxa <- cbind(Kingdom = "Bacteria", taxa) taxa <- phylobar::add_prefix(taxa) tree <- taxonomy_to_tree(taxa) checkValidPhylo(tree) ``` # Skipping over missing taxonomic assignments Another common issue involves missing taxonomic assignments. phylobar can skip missing entries, but only if they are encoded as NA. If missing values are stored as character strings (e.g., "unclassified"), they are treated as valid categories. Since they often appear at multiple levels of resolution and under different parents, this breaks the tree structure. ```{r hfhs-invalid} download_zenodo("10.5281/zenodo.18791960", tempdir()) HFHSdata <- readRDS(str_c(tempdir(), "/HFHS-data.rds")) taxa <- HFHSdata$filtered_taxonomy tree <- taxonomy_to_tree(taxa) checkValidPhylo(tree) ``` To avoid this, missing values should be explicitly coded as NA. This approach is illustrated in our high-fat, high-sugar diet vignette, where the taxonomy table is pre-processed to replace placeholder strings with NA. Here is the taxonomy before any correction. Notice that the NAs are not properly coded -- we see the prefix coming from the taxonomic level, but no corresponding name. ```{r hfhs-before} head(taxa) ``` We address this in the block below, checking for whether the name ended with `__` and replacing those with explicit NA value. ```{r hfhs-clean} taxa <- taxa |> select(-X1, X1) |> mutate(across(everything(), ~if_else(str_ends(., "_"), NA, .))) head(taxa) ``` Once this transformation is made, the table can be used to construct a valid tree. ```{r hfhs-fixed} tree <- taxonomy_to_tree(taxa) checkValidPhylo(tree) ``` # Avoiding duplicated names across different taxonomic levels Another common problem arises when taxonomic names are not explicitly distinguished across different levels of resolution. For example, in the `dietswap` dataset, some phylum- and family-level assignments share the same names. If uncorrected, `taxonomy_to_tree` interprets these edges as loops, which produces an invalid tree. The validity check will fail in this case. To see this, let's first extract the taxonomy table. ```{r dietswap-extract} data("dietswap", package = "microbiome") diet_temp <- subset_samples(dietswap, timepoint == 1) diet <- subset_taxa(diet_temp, taxa_sums(diet_temp) > 0) taxa <- tax_table(diet) ``` Note the repeated phylum and family names. This causes the check to fail. ```{r dietswap-invalid} head(taxa) tree <- taxonomy_to_tree(taxa) checkValidPhylo(tree) ``` To address this, we can add a small prefix that encodes the taxonomic rank of each assignment. The helper function `add_prefix` supports this concatenation. At this stage, however, running the validity check still raises an error. As in the Atlas example, the issue is the absence of a root node connecting the different phyla. Adding a "Kingdom" column resolves this by linking the trees together. ```{r dietswap-fixed} taxa <- phylobar::add_prefix(taxa) taxa <- cbind(Kingdom = "k_Bacteria", taxa) tree <- taxonomy_to_tree(taxa) checkValidPhylo(tree) ``` The issue of duplicated taxonomy names also appears in the Global Patterns dataset. For instance, several phylum- and class-level assignments share identical names. ```{r globalpatterns-extract} data(GlobalPatterns, package = "phyloseq") chlamydiae <- subset_taxa(GlobalPatterns, Phylum == "Chlamydiae") taxa <- tax_table(chlamydiae) head(taxa) ``` This duplication results in an invalid phylo. ```{r globalpatterns-invalid} tree <- taxonomy_to_tree(taxa) checkValidPhylo(tree) ``` Again, we can resolve this using the `add_prefix` function. This dataset raises one additional challenge: the ASV names are stored only as row names, not as an explicit column in the taxonomy table. Without this, we cannot reach the leaf nodes of the tree. The solution is to introduce a new column containing the ASV identifiers. ```{r globalpatterns-prepare} taxa <- data.frame(taxa) taxa <- phylobar::add_prefix(taxa) taxa$ASV <- rownames(taxa) head(taxa) ``` After this adjustment, checking the validity of the resulting tree still returns an error. In this case, the error can be safely ignored: it occurs when a tree contains nodes with more than two descendants. phylobar accommodates such multifurcations without issue. ```{r globalpatterns-check} tree <- taxonomy_to_tree(taxa) checkValidPhylo(tree) ``` To verify, we can simply plot the phylo object directly before creating a phylobar visualization. You can check that this a static version of the same tree that we work with in the main Global Patterns vignette. ```{r globalpatterns-plot} plot(tree) ``` ```{r session-info} sessionInfo() ```