Title: | Phylogenetic partitioning based ILR transform for metagenomics data |
---|---|
Description: | PhILR is short for Phylogenetic Isometric Log-Ratio Transform. This package provides functions for the analysis of compositional data (e.g., data representing proportions of different variables/parts). Specifically this package allows analysis of compositional data where the parts can be related through a phylogenetic tree (as is common in microbiota survey data) and makes available the Isometric Log Ratio transform built from the phylogenetic tree and utilizing a weighted reference measure. |
Authors: | Justin Silverman [aut, cre], Leo Lahti [ctb] |
Maintainer: | Justin Silverman <[email protected]> |
License: | GPL-3 |
Version: | 1.33.0 |
Built: | 2024-12-30 03:21:34 UTC |
Source: | https://github.com/bioc/philr |
annotate a balance oriented with respect to the PhILR transform.
That is, you can specify labels for the numerator (up
) and
denominator (down
).
annotate_balance( tr, coord, p = NULL, labels = c("+", "-"), offset = 0, offset.text = 0.03, bar = TRUE, barsize = 0.01, barfill = "darkgrey", geom = "text", ... )
annotate_balance( tr, coord, p = NULL, labels = c("+", "-"), offset = 0, offset.text = 0.03, bar = TRUE, barsize = 0.01, barfill = "darkgrey", geom = "text", ... )
tr |
phylo object |
coord |
named internal node/balance to annotate |
p |
ggtree plot (tree layer), if |
labels |
label for the numerator and denominator of the balance respectively |
offset |
offset for bar (if |
offset.text |
offset of text from bar (if |
bar |
logical, should bar for each clade be plotted |
barsize |
width of bar (if |
barfill |
fill of bar |
geom |
geom used to draw label (e.g., |
... |
additional parameters passed to |
ggplot object
Justin Silverman
Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution 2016, doi:10.1111/2041-210X.12628
tr <- named_rtree(10) annotate_balance(tr, 'n4', size=7) annotate_balance(tr, 'n4', size=7, barsize=0.04, barfill='darkgreen', offset.text=0.05, color='red') annotate_balance(tr, 'n4', bar=FALSE, size=7) annotate_balance(tr, 'n4', bar=TRUE, size=7, labels=c('Num', 'Denom'), offset.text=.3) annotate_balance(tr, 'n4', bar=TRUE, geom='label', size=8, offset.text=0.1)
tr <- named_rtree(10) annotate_balance(tr, 'n4', size=7) annotate_balance(tr, 'n4', size=7, barsize=0.04, barfill='darkgreen', offset.text=0.05, color='red') annotate_balance(tr, 'n4', bar=FALSE, size=7) annotate_balance(tr, 'n4', bar=TRUE, size=7, labels=c('Num', 'Denom'), offset.text=.3) annotate_balance(tr, 'n4', bar=TRUE, geom='label', size=8, offset.text=0.1)
Weighted ILR Contrast Matrix
buildilrBasep(W, p)
buildilrBasep(W, p)
W |
sequantial binary partition matrix (e.g., signary matrix; output of phylo2sbp) |
p |
weights (should not be closed) |
matrix
Justin Silverman (adapted from compositions::gsi.buildilrBase)
J. J. Egozcue, V. Pawlowsky-Glahn (2016) Changing the Reference Measure in the Simplex and its Weighting Effects. Austrian Journal of Statistics 45(4):25-44
p <- seq(.1,1,by=.2) tr <- named_rtree(5) sbp <- phylo2sbp(tr) buildilrBasep(sbp, p)
p <- seq(.1,1,by=.2) tr <- named_rtree(5) sbp <- phylo2sbp(tr) buildilrBasep(sbp, p)
Calculates the weightings for ILR coordinates based on branch lenghts of a phylogenetic tree via a few different methods (see details).
calculate.blw(tree, method = "sum.children")
calculate.blw(tree, method = "sum.children")
tree |
a |
method |
options include: (default) |
ILR balances built from a binary partition of a phylogenetic tree
can be imbued with branch length information. This function is helpful in
calculating those weightings.
There are a number of methods for calculating these weightings, the default
'sum.children'
calculates the weighting for a given balance as the sum
of its two direct children's branch length. An alternative that has been
as yet less studied is 'mean.descendants'
to calculate the weighting
for a given balance as the sum of its two direct children's branch lengths
PLUS for each child the average distance from it to its descendant tips.
Note: That some trees contain tips with branch lengths of zero length.
This can result in that tip being unreasonably downweighted, as such this
function automatically adds a small pseudocount to those tips with zero
length (equal to the smallest non-zero)
branch length on the tree.
vector of weightings for ILR coordinates produced via specified method.
Justin Silverman
tr <- named_rtree(50) calculate.blw(tr, method='sum.children')[1:10] calculate.blw(tr, method='mean.descendants')[1:10]
tr <- named_rtree(50) calculate.blw(tr, method='sum.children')[1:10] calculate.blw(tr, method='mean.descendants')[1:10]
Weighted CLR Transform
clrp(y, p) clrpInv(y.star)
clrp(y, p) clrpInv(y.star)
y |
shifted data matrix (e.g., output of shiftp) |
p |
weights (should not be closed) |
y.star |
a data matrix that represents data transformed by |
Note that this function will close the dataset y
to 1.
Inverting clrp
transform should be followed by shiftpInv
to return to unshifted original compositoin (see examples).
matrix
Justin Silverman
J. J. Egozcue, V. Pawlowsky-Glahn (2016) Changing the Reference Measure in the Simplex and its Weighting Effects. Austrian Journal of Statistics 45(4):25-44
p <- seq(.1,1,by=.2) c <- t(rmultinom(10,100,c(.1,.6,.2,.3,.2))) + 0.65 # add a small pseudocount x <- miniclo(c) y <- shiftp(x, p) y.star <- clrp(y, p) y.star # Untransform data (note use of shiftp and miniclo to return to x) y.closed <- clrpInv(y.star) all.equal(miniclo(shiftpInv(y.closed, p)), x)
p <- seq(.1,1,by=.2) c <- t(rmultinom(10,100,c(.1,.6,.2,.3,.2))) + 0.65 # add a small pseudocount x <- miniclo(c) y <- shiftp(x, p) y.star <- clrp(y, p) y.star # Untransform data (note use of shiftp and miniclo to return to x) y.closed <- clrpInv(y.star) all.equal(miniclo(shiftpInv(y.closed, p)), x)
Converts wide format ILR transformed data (see philr
) to
long format useful in various plotting functions where long format data is
required.
convert_to_long(x, labels)
convert_to_long(x, labels)
x |
PhILR transformed data in wide format (samples by balances)
(see |
labels |
vector (of length |
x
in long format with columns
sample
labels
coord
value
tr <- named_rtree(5) x <- t(rmultinom(10,100,c(.1,.6,.2,.3,.2))) + 0.65 # add small pseudocount colnames(x) <- tr$tip.label x.philr <- philr(x, tree=tr, part.weights='enorm.x.gm.counts', ilr.weights='blw.sqrt', return.all=FALSE) convert_to_long(x.philr, rep(c('a','b'), 5))
tr <- named_rtree(5) x <- t(rmultinom(10,100,c(.1,.6,.2,.3,.2))) + 0.65 # add small pseudocount colnames(x) <- tr$tip.label x.philr <- philr(x, tree=tr, part.weights='enorm.x.gm.counts', ilr.weights='blw.sqrt', return.all=FALSE) convert_to_long(x.philr, rep(c('a','b'), 5))
Calculates geometric mean of columns. Does not calculate WEIGHTED geometric means (vs. g.rowMeans)
g.colMeans(x)
g.colMeans(x)
x |
matrix or vector |
vector (geometric mean of columns)
g.rowMeans
philr:::g.colMeans(rbind(c(2,4,4), c(2,4,4)))
philr:::g.colMeans(rbind(c(2,4,4), c(2,4,4)))
Calculates weighted geometric mean (see references). Note
if p=rep(1, nrow(y))
(default) then this is just the
geometric mean of rows.
g.rowMeans(y, p = rep(1, nrow(y)))
g.rowMeans(y, p = rep(1, nrow(y)))
y |
shifted data matrix (e.g., output of shiftp) |
p |
weights (should not be closed) |
vector (weighted geometric mean of rows)
J. J. Egozcue, V. Pawlowsky-Glahn (2016) Changing the Reference Measure in the Simplex and its Weighting Effects. Austrian Journal of Statistics 45(4):25-44
p <- seq(.1,1,by=.2) c <- t(rmultinom(10,100,c(.1,.6,.2,.3,.2))) + 0.65 # add a small pseudocount x <- miniclo(c) y <- shiftp(x, p) philr:::g.rowMeans(y, p)
p <- seq(.1,1,by=.2) c <- t(rmultinom(10,100,c(.1,.6,.2,.3,.2))) + 0.65 # add a small pseudocount x <- miniclo(c) y <- shiftp(x, p) philr:::g.rowMeans(y, p)
Calculated using weighted CLR transform (clrp)
ilrp(y, p, V) ilrpInv(y.star, V)
ilrp(y, p, V) ilrpInv(y.star, V)
y |
shifted data matrix (e.g., output of shiftp) |
p |
weights (should not be closed) |
V |
weighted contrast matrix (e.g., output of buildilrBasep) |
y.star |
a data matrix that represents data transformed by |
matrix
Justin Silverman
J. J. Egozcue, V. Pawlowsky-Glahn (2016) Changing the Reference Measure in the Simplex and its Weighting Effects. Austrian Journal of Statistics 45(4):25-44
# Weights p <- seq(.1,1,by=.2) # Shifted Composition c <- t(rmultinom(10,100,c(.1,.6,.2,.3,.2))) + 0.65 # add a small pseudocount x <- miniclo(c) y <- shiftp(x, p) # Contrast Matrix tr <- named_rtree(5) sbp <- phylo2sbp(tr) V <- buildilrBasep(sbp, p) y.star <- ilrp(y, p, V) y.star # Untransform data (note use of shiftp and miniclo to return to x) y.closed <- ilrpInv(y.star, V) all.equal(miniclo(shiftpInv(y.closed, p)), x, check.attributes=FALSE)
# Weights p <- seq(.1,1,by=.2) # Shifted Composition c <- t(rmultinom(10,100,c(.1,.6,.2,.3,.2))) + 0.65 # add a small pseudocount x <- miniclo(c) y <- shiftp(x, p) # Contrast Matrix tr <- named_rtree(5) sbp <- phylo2sbp(tr) V <- buildilrBasep(sbp, p) y.star <- ilrp(y, p, V) y.star # Untransform data (note use of shiftp and miniclo to return to x) y.closed <- ilrpInv(y.star, V) all.equal(miniclo(shiftpInv(y.closed, p)), x, check.attributes=FALSE)
Calculates the mean distance from each internal node to its descendant tips
mean_dist_to_tips(tree)
mean_dist_to_tips(tree)
tree |
a |
This is a function used by calculate.blw
when
method='mean.descendants'
, there this function is called twice, once
for each direct child of a given internal node and the results are summed
for each node.
vector (named if internal nodes are named)
tr <- named_rtree(5) mean_dist_to_tips(tr)
tr <- named_rtree(5) mean_dist_to_tips(tr)
small function to close (aka normalize by proportions, aka total sum scaling)
a dataset to a constant k
(usually taken to be 1). After closure the row sums
of the dataset should sum to k
.
miniclo(c, k = 1)
miniclo(c, k = 1)
c |
dataset to be closed |
k |
closure constant |
matrix (if c is a vector or matrix) or data.frame (if c is a data.frame)
c <- matrix(c(1,2,3,1,2,3,1,2,3), nrow = 3, byrow=TRUE) miniclo(c) miniclo(c, k=2)
c <- matrix(c(1,2,3,1,2,3,1,2,3), nrow = 3, byrow=TRUE) miniclo(c) miniclo(c, k=2)
Useful if you want to convert between node labels (c
), tip labels
(t
) and the internal integer number that identifies that node
(nn
). Particularly for use with plotting libraries.
nn.to.name(tr, x) name.to.nn(tr, x)
nn.to.name(tr, x) name.to.nn(tr, x)
tr |
object of type |
x |
vector of numerics or characters |
vector
tr <- named_rtree(5) name.to.nn(tr, 'n1') name.to.nn(tr,c('n1','n2','t1')) nn.to.name(tr, 1:9)
tr <- named_rtree(5) name.to.nn(tr, 'n1') name.to.nn(tr,c('n1','n2','t1')) nn.to.name(tr, 1:9)
For a given ILR balance (coordinate) assigns a name to the balance based on a provided taxonomy table. This is useful for interpretation of the balances.
name.balance( tr, tax, coord, method = "voting", thresh = 0.95, return.votes = NULL )
name.balance( tr, tax, coord, method = "voting", thresh = 0.95, return.votes = NULL )
tr |
an object of class |
tax |
a matrix/data.frame of taxonomy, rownames should correspond to
|
coord |
the name of a balance/internal node on the tree (given as a string) |
method |
currently only |
thresh |
threshold for assignment of taxonomy to a given part of a
balance (must be greater than 0.5 if |
return.votes |
whether voting results by taxonomic level should be
shown for
|
A bit of terminology:
this is the same as the names of the balances which should be
the same as the names of the internal nodes of tr
this is the child node of coord
that is represented in
the numerator of the coord
balance.
this is the child node of coord
that is represented in
the denominator of the coord
balance
The method 'voting'
assigns the name of the each part of a balance
(e.g., numerator and denominator / each child of coord
) as follows:
First Subset tax
to contain only descendent tips of the given
child of coord
Second At the finest taxonomic (farthest right of tax
) see if
any one taxonomic label is present at or above thresh
. If yes output
that taxonomic label (at that taxonomic level) as the label for that child
of coord
. If no then move to coarser taxonomic level (leftward) and
repeat.
If return.votes=NULL
returns a string of the form
(ex. 'Genus_Bacteroides/Phylum_Firmicutes'). Otherwise returns
a list with the above string as 'name', see Arguments for show.votes
for other optional returned items.
Justin Silverman
tr <- named_rtree(40) tax <- data.frame(Kingdom=rep('A', 40), Phylum=rep(c('B','C'), each=20), Genus=c(sample(c('D','F'),20, replace=TRUE), sample(c('G','H'), 20, replace=TRUE))) rownames(tax) <- tr$tip.label name.balance(tr, tax, 'n1') name.balance(tr, tax, 'n34') name.balance(tr,tax, 'n34', return.votes = c('up', 'down'))
tr <- named_rtree(40) tax <- data.frame(Kingdom=rep('A', 40), Phylum=rep(c('B','C'), each=20), Genus=c(sample(c('D','F'),20, replace=TRUE), sample(c('G','H'), 20, replace=TRUE))) rownames(tax) <- tr$tip.label name.balance(tr, tax, 'n1') name.balance(tr, tax, 'n34') name.balance(tr,tax, 'n34', return.votes = c('up', 'down'))
Internal nodes are named by numbering and adding the prefix 'n'. This function is laregly for use in examples throughout this package.
named_rtree(n)
named_rtree(n)
n |
an integer giving the number of tips in the tree. |
An object of class "phylo"
named_rtree(5)
named_rtree(5)
This is the main function for building the phylogenetic ILR basis, calculating the weightings (of the parts and the ILR coordinates) and then transforming the data.
philr( x, tree = NULL, sbp = NULL, part.weights = "uniform", ilr.weights = "uniform", return.all = FALSE, pseudocount = 0, abund_values = "counts", ... )
philr( x, tree = NULL, sbp = NULL, part.weights = "uniform", ilr.weights = "uniform", return.all = FALSE, pseudocount = 0, abund_values = "counts", ... )
x |
matrix of data to be transformed (samples are rows, compositional parts are columns) - zero must be dealt with either with pseudocount, multiplicative replacement, or another method. |
tree |
a |
sbp |
(Optional) give a precomputed sbp matrix |
part.weights |
weightings for parts, can be a named vector with names corresponding to
|
ilr.weights |
weightings for the ILR coordiantes can be a named vector with names
corresponding to names of internal nodes of
|
return.all |
return all computed parts (e.g., computed sign matrix( |
pseudocount |
optional pseudocount added to observation matrix ('x') to avoid numerical issues from zero values. Default value is 0 which has no effect (allowing the user to handle zeros in their own preffered way before calling philr). Values < 0 given an error. |
abund_values |
A single character value for selecting the
|
... |
other parameters passed to philr.data.frame or philr.TreeSummarizedExperiment |
This is a utility function that pulls together a number of other functions in
philr
. The steps that are executed are as follows:
Create sbp (sign matrix) if not given
Create parts weightings if not given
Shift the dataset with respect to the new reference measure (e.g., part weightings)
Create the basis contrast matrix from the sign matrix and the reference measure
Transform the data based on the contrast matrix and the reference measure
Calculate the specified ILR weightings and multiply each balance by the corresponding weighting
Note for both the reference measure
(part weightings) and the ILR weightings, specifying 'uniform'
will give the same
results as not weighting at all.
Note that some of the prespecified part.weights assume
x
is given as counts and not as relative abundances. Except in this case counts or
relative abundances can be given.
The tree argument is ignored if the x
argument is
assay
or
assay
These objects can include a phylogenetic tree. If the phylogenetic tree
is missing from these objects, it should be integrated directly in these
data objects before running philr
. Alternatively, you can always
provide the abundance matrix and tree separately in their standard formats.
If you have a
assay
,
this can be converted into
assay
,
to incorporate tree information.
matrix if return.all=FALSE
, if return.all=TRUE
then
a list is returned (see above).
Justin Silverman; S3 methods by Leo Lahti
# Prepare example data tr <- named_rtree(5) x <- t(rmultinom(10,100,c(.1,.6,.2,.3,.2))) + 0.65 # add a small pseudocount colnames(x) <- tr$tip.label philr(x, tr, part.weights='enorm.x.gm.counts', ilr.weights='blw.sqrt', return.all=FALSE) # Running philr on a TreeSummarizedExperiment object ## Prepare example data library(mia) library(tidyr) data(GlobalPatterns, package="mia") ## Select prevalent taxa tse <- GlobalPatterns %>% subsetByPrevalentTaxa( detection = 3, prevalence = 20/100, as_relative = FALSE) ## Pick taxa that have notable abundance variation across sammples variable.taxa <- apply(assay(tse, "counts"), 1, function(x) sd(x)/mean(x) > 3.0) tse <- tse[variable.taxa,] # Collapse the tree tree <- ape::keep.tip(phy = rowTree(tse), tip = rowLinks(tse)$nodeNum) rowTree(tse) <- tree ## Add a new assay with a pseudocount assays(tse)$counts.shifted <- assay(tse, "counts") + 1 ## Run philr for TreeSummarizedExperiment object ## using the pseudocount data res.tse <- philr(tse, part.weights='enorm.x.gm.counts', ilr.weights='blw.sqrt', return.all=FALSE, abund_values="counts.shifted") # Running philr on a phyloseq object ## Not run: pseq <- convertToPhyloseq(tse) res.pseq <- philr(pseq, part.weights='enorm.x.gm.counts', ilr.weights='blw.sqrt', return.all=FALSE, pseudocount=0.5) ## End(Not run)
# Prepare example data tr <- named_rtree(5) x <- t(rmultinom(10,100,c(.1,.6,.2,.3,.2))) + 0.65 # add a small pseudocount colnames(x) <- tr$tip.label philr(x, tr, part.weights='enorm.x.gm.counts', ilr.weights='blw.sqrt', return.all=FALSE) # Running philr on a TreeSummarizedExperiment object ## Prepare example data library(mia) library(tidyr) data(GlobalPatterns, package="mia") ## Select prevalent taxa tse <- GlobalPatterns %>% subsetByPrevalentTaxa( detection = 3, prevalence = 20/100, as_relative = FALSE) ## Pick taxa that have notable abundance variation across sammples variable.taxa <- apply(assay(tse, "counts"), 1, function(x) sd(x)/mean(x) > 3.0) tse <- tse[variable.taxa,] # Collapse the tree tree <- ape::keep.tip(phy = rowTree(tse), tip = rowLinks(tse)$nodeNum) rowTree(tse) <- tree ## Add a new assay with a pseudocount assays(tse)$counts.shifted <- assay(tse, "counts") + 1 ## Run philr for TreeSummarizedExperiment object ## using the pseudocount data res.tse <- philr(tse, part.weights='enorm.x.gm.counts', ilr.weights='blw.sqrt', return.all=FALSE, abund_values="counts.shifted") # Running philr on a phyloseq object ## Not run: pseq <- convertToPhyloseq(tse) res.pseq <- philr(pseq, part.weights='enorm.x.gm.counts', ilr.weights='blw.sqrt', return.all=FALSE, pseudocount=0.5) ## End(Not run)
Inverse of PhILR Transform
philrInv( x.ilrp, tree = NULL, sbp = NULL, V = NULL, part.weights = NULL, ilr.weights = NULL )
philrInv( x.ilrp, tree = NULL, sbp = NULL, V = NULL, part.weights = NULL, ilr.weights = NULL )
x.ilrp |
transformed data to which the inverse transform will be applied |
tree |
(optional) to be used to build sbp and contrast matrix (see details) |
sbp |
(optional) the sbp (sequential binary partition) used to build a contrast matrix (see details) |
V |
(optional) the contrast matrix (see details) |
part.weights |
weightings for parts, can be a named vector with names
corresponding to |
ilr.weights |
weightings for the ILR coordiantes can be a named vector
with names corresponding to names of internal nodes of |
This is a utility function for calculating the inverse of the
philr
transform. Note that at least one of the following
parameters must be specified (tree
, sbp
, or V
).
a matrix of compositions (rows are samples, columns are parts), function removes the effects of ilr weights, part weights, and unshifts the composition.
Justin Silverman
tr <- named_rtree(5) x <- t(rmultinom(10,100,c(.1,.6,.2,.3,.2))) + 0.65 # add small pseudocount colnames(x) <- tr$tip.label d <- philr(x, tr, part.weights='enorm.x.gm.counts', ilr.weights='blw.sqrt', return.all=TRUE) d.inverted <- philrInv(d$x.ilrp, V=d$V, part.weights = d$p, ilr.weights = d$ilr.weights) all.equal(miniclo(x), d.inverted)
tr <- named_rtree(5) x <- t(rmultinom(10,100,c(.1,.6,.2,.3,.2))) + 0.65 # add small pseudocount colnames(x) <- tr$tip.label d <- philr(x, tr, part.weights='enorm.x.gm.counts', ilr.weights='blw.sqrt', return.all=TRUE) d.inverted <- philrInv(d$x.ilrp, V=d$V, part.weights = d$p, ilr.weights = d$ilr.weights) all.equal(miniclo(x), d.inverted)
This function converts a binary phylogenetic tree to sequential binary parition to be used to then build an ILR basis for compositional metagenomic data.
phylo2sbp(tr)
phylo2sbp(tr)
tr |
a |
The choice of orientation for a balance (i.e., which of the two descendant clades of an internal node is in the numerator or denominator of the log-ratio) is given by the default of the function phangorn::Children and that choise is used consistently throughout the philr package.
a n by n-1 matrix of the sequential binary partition sign matrix
Justin Silverman
Schliep K.P. 2011. phangorn: phylogenetic analysis in R. Bioinformatics, 27(4) 592-593
tr <- named_rtree(5) phylo2sbp(tr)
tr <- named_rtree(5) phylo2sbp(tr)
Shift must be applied before transformation
shiftp(x, p) shiftpInv(y, p)
shiftp(x, p) shiftpInv(y, p)
x |
closed compositional data matrix (or vector) |
p |
weights (should not be closed) |
y |
the shifted composition output by |
shifted data matrix y
(no closure is applied) rows are
samples, columns are parts
Justin Silverman & J. J. Egozcue
J. J. Egozcue, V. Pawlowsky-Glahn (2016) Changing the Reference Measure in the Simplex and its Weighting Effects. Austrian Journal of Statistics 45(4):25-44
p <- seq(.1,1,by=.2) c <- t(rmultinom(10,100,c(.1,.6,.2,.3,.2))) + 0.65 # add a small pseudocount x <- miniclo(c) shiftp(x, p)
p <- seq(.1,1,by=.2) c <- t(rmultinom(10,100,c(.1,.6,.2,.3,.2))) + 0.65 # add a small pseudocount x <- miniclo(c) shiftp(x, p)