PhILR is short for “Phylogenetic Isometric Log-Ratio Transform” (Silverman et al. 2017). This package provides functions for the analysis of compositional data (e.g., data representing proportions of different variables/parts). Specifically this package allows analysis of compositional data where the parts can be related through a phylogenetic tree (as is common in microbiota survey data) and makes available the Isometric Log Ratio transform built from the phylogenetic tree and utilizing a weighted reference measure (Egozcue and Pawlowsky-Glahn 2016).
The goal of PhILR is to transform compositional data into an orthogonal unconstrained space (real space) with phylogenetic / evolutionary interpretation while preserving all information contained in the original composition. Unlike in the original compositional space, in the transformed real space, standard statistical tools may be applied. For a given set of samples consisting of measurements of taxa, we transform data into a new space of samples and orthonormal coordinates termed ‘balances’. Each balance is associated with a single internal node of a phylogenetic tree with the taxa as leaves. The balance represents the log-ratio of the geometric mean abundance of the two groups of taxa that descend from the given internal node. More details on this method can be found in Silverman et al. (2017) (Link).
The analysis uses abundance table and a phylogenetic tree. These can
be provided as separate data objects, or embedded in standard
R/Bioconductor data containers. The philr R package supports two
alternative data containers for microbiome data, TreeSE
(Huang et al. 2021) and
phyloseq
(McMurdie and Holmes
2013).
We demonstrate PhILR analysis by using the Global Patterns dataset that was originally published by Caporaso et al. (2011).
Let us first load necessary libraries.
## [1] '1.33.0'
## [1] '5.8'
## [1] '3.5.1'
We show the GlobalPatterns example workflow as initially outlined in (McMurdie and Holmes 2013).
We retrieve the example data in TreeSummarizedExperiment
(TreeSE
) data format in this vignette (Huang et al. 2021), and then show example also
for the phyloseq format. The
TreeSE version for the GlobalPatterns data is provided with the mia
package (Lahti et al. 2020).
Let us load the data.
## [1] '1.15.6'
## [1] '1.1.4'
Taxa that were not seen with more than 3 counts in at least 20% of samples are filtered. Subsequently, those with a coefficient of variation ≤ 3 are filtered. Finally we add a pseudocount of 1 to the remaining OTUs to avoid calculating log-ratios involving zeros. Alternatively other replacement methods (multiplicative replacement etc…) may be used instead if desired; the subsequent taxa weighting procedure we will describe complements a variety of zero replacement methods.
## Select prevalent taxa
tse <- GlobalPatterns %>% subsetByPrevalentTaxa(
detection = 3,
prevalence = 20/100,
as_relative = FALSE)
## Pick taxa that have notable abundance variation across sammples
variable.taxa <- apply(assays(tse)$counts, 1, function(x) sd(x)/mean(x) > 3.0)
tse <- tse[variable.taxa,]
# Collapse the tree!
# Otherwise the original tree with all nodes is kept
# (including those that were filtered out from rowData)
tree <- ape::keep.tip(phy = rowTree(tse), tip = rowLinks(tse)$nodeNum)
rowTree(tse) <- tree
## Add a new assay with a pseudocount
assays(tse)$counts.shifted <- assays(tse)$counts + 1
We have now removed the filtered taxa from the OTU table, pruned the phylogenetic tree, and subset the taxa table. Here is the result of those filtering steps.
## class: TreeSummarizedExperiment
## dim: 1248 26
## metadata(0):
## assays(2): counts counts.shifted
## rownames(1248): 540305 108964 ... 516119 145149
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(26): CL3 CC1 ... Even2 Even3
## colData names(7): X.SampleID Primer ... SampleType Description
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (1248 rows)
## rowTree: 1 phylo tree(s) (1248 leaves)
## colLinks: NULL
## colTree: NULL
Next we check that the tree is rooted and binary (all multichotomies have been resolved).
## [1] '5.8'
## [1] TRUE
## [1] TRUE
Note that if the tree is not binary, the function
multi2di
from the ape
package can be used to
replace multichotomies with a series of dichotomies with one (or
several) branch(es) of zero length.
Once this is done, we name the internal nodes of the tree so they are
easier to work with. We prefix the node number with n
and
thus the root is named n1
.
tree <- makeNodeLabel(tree, method="number", prefix='n')
# Add the modified tree back to the (`TreeSE`) data object
rowTree(tse) <- tree
We note that the tree is already rooted with Archea as the outgroup
and no multichotomies are present. This uses the function
name.balance
from the philr
package. This
function uses a simple voting scheme to find a consensus naming for the
two clades that descend from a given balance. Specifically for a balance
named x/y
, x
refers to the consensus name of
the clade in the numerator of the log-ratio and y
refers to
the denominator.
# Extract taxonomy table from the TreeSE object
tax <- rowData(tse)[,taxonomyRanks(tse)]
# Get name balances
name.balance(tree, tax, 'n1')
## [1] "Kingdom_Archaea/Kingdom_Bacteria"
Finally we transpose the OTU table (philr
uses the
conventions of the compositions
package for compositional
data analysis in R, taxa are columns, samples are rows). Then we will
take a look at part of the dataset in more detail.
otu.table <- t(as(assays(tse)$counts.shifted, "matrix"))
tree <- rowTree(tse)
metadata <- colData(tse)
tax <- rowData(tse)[,taxonomyRanks(tse)]
otu.table[1:2,1:2] # OTU Table
## 540305 108964
## CL3 1 1
## CC1 1 2
##
## Phylogenetic tree with 1248 tips and 1247 internal nodes.
##
## Tip labels:
## 540305, 108964, 175045, 546313, 54107, 71074, ...
## Node labels:
## n1, n2, n3, n4, n5, n6, ...
##
## Rooted; includes branch lengths.
## DataFrame with 2 rows and 7 columns
## X.SampleID Primer Final_Barcode Barcode_truncated_plus_T
## <factor> <factor> <factor> <factor>
## CL3 CL3 ILBC_01 AACGCA TGCGTT
## CC1 CC1 ILBC_02 AACTCG CGAGTT
## Barcode_full_length SampleType Description
## <factor> <factor> <factor>
## CL3 CTAGCGTGCGT Soil Calhoun South Carolina Pine soil, pH 4.9
## CC1 CATCGACGAGT Soil Cedar Creek Minnesota, grassland, pH 6.1
## DataFrame with 2 rows and 7 columns
## Kingdom Phylum Class Order Family
## <character> <character> <character> <character> <character>
## 540305 Archaea Crenarchaeota Thaumarchaeota Cenarchaeales Cenarchaeaceae
## 108964 Archaea Crenarchaeota Thaumarchaeota Cenarchaeales Cenarchaeaceae
## Genus Species
## <character> <character>
## 540305 NA NA
## 108964 Nitrosopumilus pIVWA5
A new variable distinguishing human/non-human:
The function philr::philr()
implements a user friendly
wrapper for the key steps in the philr transform.
philr::phylo2sbp()
philr::buildilrBasep()
philr::miniclo()
) and ‘shift’ dataset using the weightings
(Egozcue and Pawlowsky-Glahn 2016) using
the function philr::shiftp()
.philr::ilrp()
philr::calculate.blw()
.Note: The preprocessed OTU table should be passed to the function
philr::philr()
before it is closed (normalized) to relative
abundances, as some of the preset weightings of the taxa use the
original count data to down weight low abundance taxa.
Here we will use the same weightings as we used in the main paper.
You can run philr
with the abundance table and
phylogenetic tree.
gp.philr <- philr(otu.table, tree,
part.weights='enorm.x.gm.counts',
ilr.weights='blw.sqrt')
gp.philr[1:5,1:5]
## n1 n2 n3 n4 n5
## CL3 -1.3638521 1.9756259 2.6111996 -3.3174292 0.08335109
## CC1 -0.9441168 3.9054807 2.9804522 -4.7771598 -0.05334306
## SV1 5.8436901 5.9067782 6.7315081 -8.8020849 0.08335109
## M31Fcsw -3.9010427 -0.1816618 -0.5432099 0.1705271 0.08335109
## M11Fcsw -5.4554073 0.5398249 -0.5647474 0.5551616 -0.02389182
Alternatively, you can provide the data directly in
TreeSE
format.
gp.philr <- philr(tse, abund_values = "counts.shifted",
part.weights='enorm.x.gm.counts',
ilr.weights='blw.sqrt')
Alternatively, you can provide the data in phyloseq
format. For simplicity, let us just convert the TreeSE
object to phyloseq
object to give a brief example.
pseq <- convertToPhyloseq(tse, assay.type="counts.shifted")
gp.philr <- philr(pseq,
part.weights='enorm.x.gm.counts',
ilr.weights='blw.sqrt')
After running philr
the transformed data is represented
in terms of balances and since each balance is associated with a single
internal node of the tree, we denote the balances using the same names
we assigned to the internal nodes (e.g., n1
).
Euclidean distance in PhILR space can be used for ordination analysis. Let us first calculate distances and then calculate standard MDS ordination.
Let us next visualize the ordination. This example employs standard
tools for ordination and visualization that can be used regardless of
the preferred data container. Note that the phyloseq
and
TreeSE
frameworks may provide access to additional
ordination and visualization methods.
More than just ordination analysis, PhILR provides an entire
coordinate system in which standard multivariate tools can be used. Here
we will make use of sparse logistic regression (from the
glmnet
package) to identify a small number of balances that
best distinguish human from non-human samples.
Now we will fit a sparse logistic regression model (logistic regression with l1 penalty)
## [1] '4.1.8'
We will use a hard-threshold for the l1 penalty of λ = 0.2526 which we choose so that the resulting number of non-zero coefficients is ≈ 5 (for easy of visualization in this tutorial).
top.coords <- as.matrix(coefficients(glmmod, s=0.2526))
top.coords <- rownames(top.coords)[which(top.coords != 0)]
(top.coords <- top.coords[2:length(top.coords)]) # remove the intercept as a coordinate
## [1] "n16" "n106" "n122" "n188" "n730"
To find the taxonomic labels that correspond to these balances we can
use the function philr::name.balance()
. This funciton uses
a simple voting scheme to name the two descendent clades of a given
balance separately. For a given clade, the taxonomy table is subset to
only contain taxa from that clade. Starting at the finest taxonomic rank
(e.g., species) the subset taxonomy table is checked to see if any label
(e.g., species name) represents ≥ threshold (default 95%) of the table
entries at that taxonomic rank. If no consensus identifier is found, the
table is checked at the next-most specific taxonomic rank (etc…).
## n16
## "Kingdom_Bacteria/Phylum_Firmicutes"
## n106
## "Order_Actinomycetales/Order_Actinomycetales"
## n122
## "Kingdom_Bacteria/Phylum_Cyanobacteria"
## n188
## "Genus_Campylobacter/Phylum_Proteobacteria"
## n730
## "Order_Bacteroidales/Order_Bacteroidales"
We can also get more information on what goes into the naming by viewing the votes directly.
votes <- name.balance(tree, tax, 'n730', return.votes = c('up', 'down'))
votes[[c('up.votes', 'Family')]] # Numerator at Family Level
## votes
## Porphyromonadaceae
## 1
## votes
## Bacteroidaceae Porphyromonadaceae Prevotellaceae Rikenellaceae
## 12 9 10 5
## [1] '3.15.0'
## [1] '1.1.4'
Above we found the top 5 coordinates (balances) that distinguish
whether a sample is from a human or non-human source. Now using the
ggtree
(Yu et al. 2016)
package we can visualize these balances on the tree using the
geom_balance
object. To use these functions we need to know
the acctual node number (not just the names we have given) of these
balances on the tree. To convert between node number and name, we have
added the functions philr::name.to.nn()
and
philr::nn.to.name()
. In addition, it is important that we
know which clade of the balance is in the numerator (+) and which is in
the denominator (-) of the log-ratio. To help us keep track we have
created the function philr::annotate_balance()
which allows
us to easily label these two clades.
tc.nn <- name.to.nn(tree, top.coords)
tc.colors <- c('#a6cee3', '#1f78b4', '#b2df8a', '#33a02c', '#fb9a99')
p <- ggtree(tree, layout='fan') +
geom_balance(node=tc.nn[1], fill=tc.colors[1], alpha=0.6) +
geom_balance(node=tc.nn[2], fill=tc.colors[2], alpha=0.6) +
geom_balance(node=tc.nn[3], fill=tc.colors[3], alpha=0.6) +
geom_balance(node=tc.nn[4], fill=tc.colors[4], alpha=0.6) +
geom_balance(node=tc.nn[5], fill=tc.colors[5], alpha=0.6)
p <- annotate_balance(tree, 'n16', p=p, labels = c('n16+', 'n16-'),
offset.text=0.15, bar=FALSE)
annotate_balance(tree, 'n730', p=p, labels = c('n730+', 'n730-'),
offset.text=0.15, bar=FALSE)
We can also view the distribution of these 5 balances for
human/non-human sources. In order to plot with ggplot2
we
first need to convert the PhILR transformed data to long format. We have
included a function philr::convert_to_long()
for this
purpose.
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `gather()` instead.
## ℹ The deprecated feature was likely used in the philr package.
## Please report the issue at <https://github.com/jsilve24/philr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Lets just look at balance n16 vs. balance n730 (the ones we annotated in the above tree).
## [1] '1.3.1'
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] tidyr_1.3.1 ggtree_3.15.0
## [3] glmnet_4.1-8 Matrix_1.7-1
## [5] dplyr_1.1.4 mia_1.15.6
## [7] TreeSummarizedExperiment_2.15.0 Biostrings_2.75.1
## [9] XVector_0.47.0 SingleCellExperiment_1.29.1
## [11] MultiAssayExperiment_1.33.1 SummarizedExperiment_1.37.0
## [13] Biobase_2.67.0 GenomicRanges_1.59.1
## [15] GenomeInfoDb_1.43.2 IRanges_2.41.1
## [17] S4Vectors_0.45.2 BiocGenerics_0.53.3
## [19] generics_0.1.3 MatrixGenerics_1.19.0
## [21] matrixStats_1.4.1 ggplot2_3.5.1
## [23] ape_5.8 philr_1.33.0
## [25] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] splines_4.4.2 ggplotify_0.1.2
## [3] tibble_3.2.1 rpart_4.1.23
## [5] DirichletMultinomial_1.49.0 lifecycle_1.0.4
## [7] lattice_0.22-6 MASS_7.3-61
## [9] backports_1.5.0 magrittr_2.0.3
## [11] Hmisc_5.2-0 sass_0.4.9
## [13] rmarkdown_2.29 jquerylib_0.1.4
## [15] yaml_2.3.10 DBI_1.2.3
## [17] buildtools_1.0.0 minqa_1.2.8
## [19] ade4_1.7-22 abind_1.4-8
## [21] zlibbioc_1.52.0 quadprog_1.5-8
## [23] purrr_1.0.2 phyloseq_1.51.0
## [25] yulab.utils_0.1.8 nnet_7.3-19
## [27] sandwich_3.1-1 GenomeInfoDbData_1.2.13
## [29] ggrepel_0.9.6 irlba_2.3.5.1
## [31] tidytree_0.4.6 maketools_1.3.1
## [33] vegan_2.6-8 rbiom_1.0.3
## [35] permute_0.9-7 DelayedMatrixStats_1.29.0
## [37] codetools_0.2-20 DelayedArray_0.33.2
## [39] scuttle_1.17.0 shape_1.4.6.1
## [41] tidyselect_1.2.1 aplot_0.2.3
## [43] UCSC.utils_1.3.0 farver_2.1.2
## [45] lme4_1.1-35.5 ScaledMatrix_1.15.0
## [47] viridis_0.6.5 base64enc_0.1-3
## [49] jsonlite_1.8.9 BiocNeighbors_2.1.1
## [51] multtest_2.63.0 decontam_1.27.0
## [53] Formula_1.2-5 survival_3.7-0
## [55] scater_1.35.0 iterators_1.0.14
## [57] foreach_1.5.2 tools_4.4.2
## [59] treeio_1.31.0 Rcpp_1.0.13-1
## [61] glue_1.8.0 gridExtra_2.3
## [63] SparseArray_1.7.2 xfun_0.49
## [65] mgcv_1.9-1 withr_3.0.2
## [67] BiocManager_1.30.25 fastmap_1.2.0
## [69] boot_1.3-31 rhdf5filters_1.19.0
## [71] bluster_1.17.0 fansi_1.0.6
## [73] digest_0.6.37 rsvd_1.0.5
## [75] R6_2.5.1 gridGraphics_0.5-1
## [77] colorspace_2.1-1 lpSolve_5.6.22
## [79] utf8_1.2.4 data.table_1.16.2
## [81] DECIPHER_3.3.0 httr_1.4.7
## [83] htmlwidgets_1.6.4 S4Arrays_1.7.1
## [85] pkgconfig_2.0.3 gtable_0.3.6
## [87] sys_3.4.3 htmltools_0.5.8.1
## [89] biomformat_1.35.0 scales_1.3.0
## [91] ggfun_0.1.7 knitr_1.49
## [93] rstudioapi_0.17.1 reshape2_1.4.4
## [95] checkmate_2.3.2 nlme_3.1-166
## [97] nloptr_2.1.1 cachem_1.1.0
## [99] zoo_1.8-12 rhdf5_2.51.0
## [101] stringr_1.5.1 parallel_4.4.2
## [103] vipor_0.4.7 foreign_0.8-87
## [105] pillar_1.9.0 grid_4.4.2
## [107] vctrs_0.6.5 slam_0.1-55
## [109] BiocSingular_1.23.0 beachmat_2.23.2
## [111] cluster_2.1.6 beeswarm_0.4.0
## [113] htmlTable_2.4.3 evaluate_1.0.1
## [115] mvtnorm_1.3-2 cli_3.6.3
## [117] compiler_4.4.2 rlang_1.1.4
## [119] crayon_1.5.3 labeling_0.4.3
## [121] mediation_4.5.0 plyr_1.8.9
## [123] fs_1.6.5 ggbeeswarm_0.7.2
## [125] stringi_1.8.4 viridisLite_0.4.2
## [127] BiocParallel_1.41.0 munsell_0.5.1
## [129] lazyeval_0.2.2 patchwork_1.3.0
## [131] sparseMatrixStats_1.19.0 Rhdf5lib_1.29.0
## [133] igraph_2.1.1 RcppParallel_5.1.9
## [135] bslib_0.8.0 phangorn_2.12.1
## [137] fastmatch_1.1-4