Title: | Tools for performing taxonomic assignment |
---|---|
Description: | Tools for performing taxonomic assignment based on phylogeny using pplacer and clst. |
Authors: | Noah Hoffman |
Maintainer: | Noah Hoffman <[email protected]> |
License: | GPL-3 |
Version: | 1.55.0 |
Built: | 2024-10-30 04:41:19 UTC |
Source: | https://github.com/bioc/clstutils |
Tools for performing taxonomic assignment based on phylogeny using pplacer and clst.
Package: | clstutils |
Type: | Package |
Author: | Noah Hoffman <[email protected]> |
License: | GPL3 |
Index:
Noah Hoffman
Maintainer: <[email protected]>
library(clstutils) packageDescription("clstutils")
library(clstutils) packageDescription("clstutils")
Given taxonomic information from a reference package and inter-node
distances from a reference tree, perform classification of one or more
placements provided by pplacer
.
classifyPlacements(taxdata, treedists, placetab, ..., verbose = FALSE, debug = FALSE)
classifyPlacements(taxdata, treedists, placetab, ..., verbose = FALSE, debug = FALSE)
taxdata |
|
treedists |
output of |
placetab |
a data.frame with columns |
... |
extra arguments passed to |
verbose |
writes progress messages to terminal if TRUE |
debug |
be very verbose if TRUE |
The output is a data.frame
describing the taxonomic assignment,
along with a description of the confidence of the classification. See
the man page for classify
for details on the output.
Noah Hoffman
placefile <- system.file('extdata', 'merged.json', package='clstutils') distfile <- system.file('extdata', 'merged.distmat.bz2', package='clstutils') refpkgz <- system.file('extdata', 'vaginal_16s.refpkg.tar.gz', package='clstutils') tmpdir <- tempdir() orig.dir <- getwd() setwd(tmpdir) system(paste("tar --no-same-owner -xzf", refpkgz)) setwd(orig.dir) refpkg <- file.path(tmpdir, "vaginal_16s.refpkg") treedists <- treeDists(distfile=distfile, placefile=placefile) taxdata <- taxonomyFromRefpkg(refpkg, seqnames=rownames(treedists$dmat), lowest_rank="species") placetab <- data.frame(at=49, edge=5.14909e-07, branch=5.14909e-07) result <- classifyPlacements(taxdata, treedists, placetab) result
placefile <- system.file('extdata', 'merged.json', package='clstutils') distfile <- system.file('extdata', 'merged.distmat.bz2', package='clstutils') refpkgz <- system.file('extdata', 'vaginal_16s.refpkg.tar.gz', package='clstutils') tmpdir <- tempdir() orig.dir <- getwd() setwd(tmpdir) system(paste("tar --no-same-owner -xzf", refpkgz)) setwd(orig.dir) refpkg <- file.path(tmpdir, "vaginal_16s.refpkg") treedists <- treeDists(distfile=distfile, placefile=placefile) taxdata <- taxonomyFromRefpkg(refpkg, seqnames=rownames(treedists$dmat), lowest_rank="species") placetab <- data.frame(at=49, edge=5.14909e-07, branch=5.14909e-07) result <- classifyPlacements(taxdata, treedists, placetab) result
Outliers are defined as elements with edge length to the centermost element > cutoff. The distance threshold (cutoff) can be either specified, or calculated as a quantile of all pairwise distances in the matrix.
findOutliers(mat, quant, cutoff)
findOutliers(mat, quant, cutoff)
mat |
square matrix of distances |
quant |
given all pairwise distances x, calculate distance threshold as quantile(x, quant). Values closer to 0 are more stringent. |
cutoff |
an absolute cutoff overriding quant |
Returns a boolean vector corresponding to margin of mat; outliers have a value of TRUE.
Noah Hoffman
library(ape) data(seqs) data(seqdat) dmat <- ape::dist.dna(seqs[seqdat$tax_name == 'Enterococcus faecium',], pairwise.deletion=TRUE, as.matrix=TRUE, model='raw') summary(dmat[lower.tri(dmat)]) outliers <- findOutliers(dmat, cutoff=0.015) table(outliers)
library(ape) data(seqs) data(seqdat) dmat <- ape::dist.dna(seqs[seqdat$tax_name == 'Enterococcus faecium',], pairwise.deletion=TRUE, as.matrix=TRUE, model='raw') summary(dmat[lower.tri(dmat)]) outliers <- findOutliers(dmat, cutoff=0.015) table(outliers)
Given a square matrix of pairwise distances, return indices of N objects with a maximal sum of pairwise distances.
maxDists(mat, idx = NA, N = 1, exclude = rep(FALSE, nrow(mat)), include.center = TRUE)
maxDists(mat, idx = NA, N = 1, exclude = rep(FALSE, nrow(mat)), include.center = TRUE)
mat |
square distance matrix |
idx |
starting indices; if missing, starts with the object with the maximum median distance to all other objects. |
N |
total number of selections; length of idx is subtracted. |
exclude |
boolean vector indicating elements to exclude from the calculation. |
include.center |
includes the "most central" element (ie, the one with the smallest median of pairwise distances to all other elements) if TRUE |
A vector of indices corresponding to the margin of mat
.
Note that it is important to evaluate if the candidate sequences contain outliers (for example, mislabeled sequences), because these will assuredly be included in a maximally diverse set of elements!
Noah Hoffman
library(ape) library(clstutils) data(seqs) data(seqdat) efaecium <- seqdat$tax_name == 'Enterococcus faecium' seqdat <- subset(seqdat, efaecium) seqs <- seqs[efaecium,] dmat <- ape::dist.dna(seqs, pairwise.deletion=TRUE, as.matrix=TRUE, model='raw') ## find a maximally diverse set without first identifying outliers picked <- maxDists(dmat, N=10) picked prettyTree(nj(dmat), groups=ifelse(1:nrow(dmat) %in% picked,'picked','not picked')) ## restrict selected elements to non-outliers outliers <- findOutliers(dmat, cutoff=0.015) picked <- maxDists(dmat, N=10, exclude=outliers) picked prettyTree(nj(dmat), groups=ifelse(1:nrow(dmat) %in% picked,'picked','not picked'), X = outliers)
library(ape) library(clstutils) data(seqs) data(seqdat) efaecium <- seqdat$tax_name == 'Enterococcus faecium' seqdat <- subset(seqdat, efaecium) seqs <- seqs[efaecium,] dmat <- ape::dist.dna(seqs, pairwise.deletion=TRUE, as.matrix=TRUE, model='raw') ## find a maximally diverse set without first identifying outliers picked <- maxDists(dmat, N=10) picked prettyTree(nj(dmat), groups=ifelse(1:nrow(dmat) %in% picked,'picked','not picked')) ## restrict selected elements to non-outliers outliers <- findOutliers(dmat, cutoff=0.015) picked <- maxDists(dmat, N=10, exclude=outliers) picked prettyTree(nj(dmat), groups=ifelse(1:nrow(dmat) %in% picked,'picked','not picked'), X = outliers)
Extends plot.phylo
to draw a
phylogenetic tree with additional annotation.
prettyTree(x, groups, fill, X, O, indices, labels, show = rep(TRUE, length(x)), largs = list(cex = 0.75, bty = "n", yjust = 0.5), parargs = list(mar = c(bottom = 5, left = 2, top = 2, right = ifelse(is.null(largs), 2, 8)), xpd = NA), pointargs = list(), glyphs, shuffleGlyphs = NA, ...)
prettyTree(x, groups, fill, X, O, indices, labels, show = rep(TRUE, length(x)), largs = list(cex = 0.75, bty = "n", yjust = 0.5), parargs = list(mar = c(bottom = 5, left = 2, top = 2, right = ifelse(is.null(largs), 2, 8)), xpd = NA), pointargs = list(), glyphs, shuffleGlyphs = NA, ...)
x |
an object of class phylo, eg |
groups |
a factor (or object coercible) to a factor assigning group identity to leaf nodes in x |
fill |
vector (logical or indices) of points to fill |
X |
vector of points to mark with an X |
O |
vector of points to mark with a circle |
indices |
label points with indices (all points if 'yes', or a subset indicated by a vector) |
labels |
character vector of tip labels in the same order as
|
show |
boolean vector of points to show |
largs |
arguments controlling appearance of the legend or NULL for no legend |
parargs |
arguments to pass par() |
pointargs |
arguments to pass points() (other than pch, col, bg) |
glyphs |
a |
shuffleGlyphs |
NA or an integer (argument to |
... |
passed to |
prettyTree
adds to a plot drawn by plot.phylo
Vectors specifying annotation should be in the order of row or column
labels of the distance matrix used to generate x
.
Plots to the active device; no return value.
See package vignette for additional examples.
Noah Hoffman
library(ape) data(seqs) data(seqdat) prettyTree(nj(dist.dna(seqs)), groups=seqdat$tax_name)
library(ape) data(seqs) data(seqdat) prettyTree(nj(dist.dna(seqs)), groups=seqdat$tax_name)
Read the manifest file from a refpackage and return a list containing the package contents.
refpkgContents(path, manifest = "CONTENTS.json")
refpkgContents(path, manifest = "CONTENTS.json")
path |
path to a refpkg directory |
manifest |
name of the manifest file |
Returns a list of lists. Run example(refpkgContents)
for details.
Noah Hoffman
The decsription and specification for a reference package can be found in the project repository in github: https://github.com/fhcrc/taxtastic
Scripts and tools for creating reference packages are provided in the
python package taxonomy
, also available from the taxtastic
project site.
archive <- 'vaginal_16s.refpkg.tar.gz' destdir <- tempdir() system(gettextf('tar -xzf %s --directory="%s"', system.file('extdata',archive,package='clstutils'), destdir)) refpkg <- file.path(destdir, sub('.tar.gz','',archive)) contents <- refpkgContents(refpkg) str(refpkg)
archive <- 'vaginal_16s.refpkg.tar.gz' destdir <- tempdir() system(gettextf('tar -xzf %s --directory="%s"', system.file('extdata',archive,package='clstutils'), destdir)) refpkg <- file.path(destdir, sub('.tar.gz','',archive)) contents <- refpkgContents(refpkg) str(refpkg)
Provides annotation for link{seqs}
, an aligned 16S
rRNA sequences representing three Enterococcus species.
data(seqdat)
data(seqdat)
A data frame with 200 observations on the following 5 variables.
seqname
a character vector
accession
a character vector containing GenBank accession numbers.
tax_id
a character vector
tax_name
a character vector
isType
a logical vector indicating if the sequence is from a type strain.
These sequences were downloaded from the Ribosomal Database Project website http://rdp.cme.msu.edu/
data(seqdat) with(seqdat,{ table(tax_name, isType) })
data(seqdat) with(seqdat,{ table(tax_name, isType) })
Aligned 16S rRNA sequences representing three Enterococcus species.
data(seqs)
data(seqs)
The format is: 'DNAbin' raw [1:200, 1:1848] - - - - ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:200] "S000001976" "S000008133" "S000013428" "S000127028" ... ..$ : NULL
These sequences were downloaded from the Ribosomal Database Project website http://rdp.cme.msu.edu/
data(seqs) seqs
data(seqs) seqs
Construct a data.frame providing the lineage of each sequence represented in the reference package.
taxonomyFromRefpkg(path, seqnames, lowest_rank = NA)
taxonomyFromRefpkg(path, seqnames, lowest_rank = NA)
path |
path to a refpkg directory |
seqnames |
optional character vector of sequence names. If
provided, determines the order of rows in |
lowest_rank |
name of the most specific (ie, rightmost) rank to
include. Default is the name of the rightmost column in
|
A list with the following elements:
taxNames |
a named character vector of taxonomic names (names are tax_ids) |
taxTab |
a |
Noah Hoffman
The decsription and specification for a reference package can be found in the project repository in github: https://github.com/fhcrc/taxtastic
Scripts and tools for creating reference packages are provided in the
python package taxonomy
, also available from the taxtastic
project site.
archive <- 'vaginal_16s.refpkg.tar.gz' destdir <- tempdir() system(gettextf('tar -xzf %s --directory="%s"', system.file('extdata',archive,package='clstutils'), destdir)) refpkg <- file.path(destdir, sub('.tar.gz','',archive)) reftax <- taxonomyFromRefpkg(refpkg) str(reftax)
archive <- 'vaginal_16s.refpkg.tar.gz' destdir <- tempdir() system(gettextf('tar -xzf %s --directory="%s"', system.file('extdata',archive,package='clstutils'), destdir)) refpkg <- file.path(destdir, sub('.tar.gz','',archive)) reftax <- taxonomyFromRefpkg(refpkg) str(reftax)
Provides objects (dists, paths) that can be used to calculate vectors of distances between an internal node and each leaf node. Also returns a square matrix of distances between leaf nodes.
treeDists(placefile, distfile)
treeDists(placefile, distfile)
placefile |
path to |
distfile |
path to output of |
A placement on an edge looks like this:
proximal | | d_p | |---- x | | d_d | | distal
d_p is the distance from the placement x to the proximal side of the edge, and d_d the distance to the distal side.
If the distance from x to a leaf y is an S-distance Q, then the path from x to y will go through the distal side of the edge and we will need to add d_d to Q to get the distance from x to y. If the distance from x to a leaf y is a P-distance Q, then the path from x to y will go through the proximal side of the edge, and we will need to subtract off d_d from Q to get the distance from x to y. In either case, we always need to add the length of the pendant edge, which is the second column.
To review, say the values of the two leftmost columns are a and b for a given placement x, and that it is on an edge i. We are interested in the distance of x to a leaf y, which is on edge j. We look at the distance matrix, entry (i,j), and say it is an S-distance Q. Then our distance is Q+a+b. If it is a P-distance Q, then the distance is Q-a+b.
The distances between leaves should always be P-distances, and there we need no trickery.
(thanks to Erick Matsen for this description)
A list with the following elements:
dists |
rectangular matrix of distances with rows corresponding to
all nodes in pplacer order, and columns corresponding to tips in the
order of the corresponding |
paths |
rectangular matrix in the same configuration as
|
dmat |
square matrix containing distances between pairs of tips. |
The output of this function is required for classifyPlacements
.
Noah Hoffman
Documentation for pplacer
and guppy
can be found here:
http://matsen.fhcrc.org/pplacer/
placefile <- system.file('extdata','merged.json', package='clstutils') distfile <- system.file('extdata','merged.distmat.bz2', package='clstutils') treedists <- treeDists(placefile, distfile) ## coordinates of a single placement placetab <- data.frame(at=49, edge=5.14909e-07, branch=5.14909e-07) ## dvects is a matrix in which each row corresponds to a vector of ## distances between a single placement along the edge of the reference ## tree used to generate 'distfile', and each column correspons to a ## reference sequence (ie, a terminal node). dvects <- with(placetab, { treedists$dists[at+1,,drop=FALSE] + treedists$paths[at+1,,drop=FALSE]*edge + branch })
placefile <- system.file('extdata','merged.json', package='clstutils') distfile <- system.file('extdata','merged.distmat.bz2', package='clstutils') treedists <- treeDists(placefile, distfile) ## coordinates of a single placement placetab <- data.frame(at=49, edge=5.14909e-07, branch=5.14909e-07) ## dvects is a matrix in which each row corresponds to a vector of ## distances between a single placement along the edge of the reference ## tree used to generate 'distfile', and each column correspons to a ## reference sequence (ie, a terminal node). dvects <- with(placetab, { treedists$dists[at+1,,drop=FALSE] + treedists$paths[at+1,,drop=FALSE]*edge + branch })