Title: | Handling and analysis of high-throughput microbiome census data |
---|---|
Description: | phyloseq provides a set of classes and tools to facilitate the import, storage, analysis, and graphical display of microbiome census data. |
Authors: | Paul J. McMurdie <[email protected]>, Susan Holmes <[email protected]>, with contributions from Gregory Jordan and Scott Chamberlain |
Maintainer: | Paul J. McMurdie <[email protected]> |
License: | AGPL-3 |
Version: | 1.51.0 |
Built: | 2024-11-30 03:35:16 UTC |
Source: | https://github.com/bioc/phyloseq |
There are already several ecology and phylogenetic packages available in R, including the adephylo, vegan, ade4, picante, ape, phangorn, phylobase, and OTUbase packages. These can already take advantage of many of the powerful statistical and graphics tools available in R. However, prior to phyloseq a user must devise their own methods for parsing the output of their favorite OTU clustering application, and, as a consequence, there is also no standard within Bioconductor (or R generally) for storing or sharing the suite of related data objects that describe a phylogenetic sequencing project. The phyloseq package seeks to address these issues by providing a related set of S4 classes that internally manage the handling tasks associated with organizing, linking, storing, and analyzing phylogenetic sequencing data. phyloseq additionally provides some convenience wrappers for input from common clustering applications, common analysis pipelines, and native implementation of methods that are not available in other R packages.
Paul J. McMurdie II [email protected]
See the documentation for the Extract
generic,
defined in the R base-package
for the expected behavior.
## S4 method for signature 'otu_table,ANY,ANY,ANY' x[i, j, ..., drop = TRUE] ## S4 method for signature 'sample_data,ANY,ANY,ANY' x[i, j, ..., drop = TRUE] ## S4 method for signature 'taxonomyTable,ANY,ANY,ANY' x[i, j, ..., drop = TRUE] ## S4 method for signature 'XStringSet,character,ANY,ANY' x[i]
## S4 method for signature 'otu_table,ANY,ANY,ANY' x[i, j, ..., drop = TRUE] ## S4 method for signature 'sample_data,ANY,ANY,ANY' x[i, j, ..., drop = TRUE] ## S4 method for signature 'taxonomyTable,ANY,ANY,ANY' x[i, j, ..., drop = TRUE] ## S4 method for signature 'XStringSet,character,ANY,ANY' x[i]
x |
object from which to extract element(s) or in which to replace element(s). |
i |
indices specifying elements to extract or replace. Indices are
For When indexing arrays by An index value of |
j |
See |
... |
See |
drop |
For matrices and arrays. If |
One special exception to standard behavior of these methods in phyloseq is that
the drop
argument is set internally to FALSE
.
This helps avoid bugs during complicated subsetting with multiple components,
where it is necessary to be able to use a two dimensional indexing even
if one of those dimensions has only 1 rank.
Put another way, these phyloseq-defined extractions never collapse their result
into a vector. See the documentation of Extract
for
more information about the drop
argument.
data(esophagus) nrow(otu_table(esophagus)) nrow(otu_table(esophagus)[1:5, ])
data(esophagus) nrow(otu_table(esophagus)) nrow(otu_table(esophagus)[1:5, ])
This function is used internally by many accessors and in
many functions/methods that need to access a particular type of component data.
If something is wrong, or the slot is missing, the expected behavior is that
this function will return NULL. Thus, the output can be tested by
is.null
as verification of the presence of a particular
data component. Unlike the component-specific accessors (e.g. otu_table
,
or phy_tree
),
the default behavior is not to stop with an error if the desired slot is empty.
In all cases this is controlled by the errorIfNULL
argument, which can
be set to TRUE
if an error is desired.
access(physeq, slot, errorIfNULL=FALSE)
access(physeq, slot, errorIfNULL=FALSE)
physeq |
(Required). |
slot |
(Required). A character string indicating the slot (not data class) of the component data type that is desired. |
errorIfNULL |
(Optional). Logical. Should the accessor stop with
an error if the slot is empty ( |
Returns the component object specified by the argument slot
.
Returns NULL if slot does not exist. Returns physeq
as-is
if it is a component class that already matches the slot name.
getslots.phyloseq
, merge_phyloseq
# ## data(GlobalPatterns) ## access(GlobalPatterns, "tax_table") ## access(GlobalPatterns, "phy_tree") ## access(otu_table(GlobalPatterns), "otu_table") ## # Should return NULL: ## access(otu_table(GlobalPatterns), "sample_data") ## access(otuTree(GlobalPatterns), "sample_data") ## access(otuSam(GlobalPatterns), "phy_tree")
# ## data(GlobalPatterns) ## access(GlobalPatterns, "tax_table") ## access(GlobalPatterns, "phy_tree") ## access(otu_table(GlobalPatterns), "otu_table") ## # Should return NULL: ## access(otu_table(GlobalPatterns), "sample_data") ## access(otuTree(GlobalPatterns), "sample_data") ## access(otuSam(GlobalPatterns), "phy_tree")
tax_table
from a named possibly-jagged listBuild a tax_table
from a named possibly-jagged list
build_tax_table(taxlist)
build_tax_table(taxlist)
taxlist |
(Required). A list in which each element is a vector of taxonomic assignments named by rank. Every element of every vector must be named by the rank it represents. Every element of the list (every vector) should correspond to a single OTU and be named for that OTU. |
A tax_table
(taxonomyTable-class
) that
has been built from taxlist
. The OTU names of this output will be
the element names of taxlist
, and a separate taxonomic rank
(column) will be included for each unique rank found among the element names
of each vector in the list. NA_character_
is the default value of
elements in the tax_table
for which there is no corresponding
information in taxlist
.
taxvec1 = c("Root", "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Staphylococcaceae") parse_taxonomy_default(taxvec1) parse_taxonomy_greengenes(taxvec1) taxvec2 = c("Root;k__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Staphylococcaceae") parse_taxonomy_qiime(taxvec2) taxlist1 = list(OTU1=parse_taxonomy_greengenes(taxvec1), OTU2=parse_taxonomy_qiime(taxvec2)) taxlist2 = list(OTU1=parse_taxonomy_default(taxvec1), OTU2=parse_taxonomy_qiime(taxvec2)) build_tax_table(taxlist1) build_tax_table(taxlist2)
taxvec1 = c("Root", "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Staphylococcaceae") parse_taxonomy_default(taxvec1) parse_taxonomy_greengenes(taxvec1) taxvec2 = c("Root;k__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Staphylococcaceae") parse_taxonomy_qiime(taxvec2) taxlist1 = list(OTU1=parse_taxonomy_greengenes(taxvec1), OTU2=parse_taxonomy_qiime(taxvec2)) taxlist2 = list(OTU1=parse_taxonomy_default(taxvec1), OTU2=parse_taxonomy_qiime(taxvec2)) build_tax_table(taxlist1) build_tax_table(taxlist2)
Published in Nature in early 2011, this work compared (among other things), the faecal microbial communities from 22 subjects using complete shotgun DNA sequencing. Authors further compared these microbial communities with the faecal communities of subjects from other studies. A total of 280 faecal samples / subjects are represented in this dataset, and 553 genera. The authors claim that the data naturally clumps into three community-level clusters, or “enterotypes”, that are not immediately explained by sequencing technology or demographic features of the subjects, but with potential relevance to understanding human gut microbiota.
abstract from research article (quoted):
Our knowledge of species and functional composition of the human gut microbiome is rapidly increasing, but it is still based on very few cohorts and little is known about variation across the world. By combining 22 newly sequenced faecal metagenomes of individuals from four countries with previously published data sets, here we identify three robust clusters (referred to as enterotypes hereafter) that are not nation or continent specific. We also confirmed the enterotypes in two published, larger cohorts, indicating that intestinal microbiota variation is generally stratified, not continuous. This indicates further the existence of a limited number of well-balanced host-microbial symbiotic states that might respond differently to diet and drug intake. The enterotypes are mostly driven by species composition, but abundant molecular functions are not necessarily provided by abundant species, highlighting the importance of a functional analysis to understand microbial communities. Although individual host properties such as body mass index, age, or gender cannot explain the observed enterotypes, data-driven marker genes or functional modules can be identified for each of these host properties. For example, twelve genes significantly correlate with age and three functional modules with the body mass index, hinting at a diagnostic potential of microbial markers.
(end quote)
Arumugam, M., Raes, J., et al.
Arumugam, M., et al. (2011). Enterotypes of the human gut microbiome.
Nature, 473(7346), 174-180.
http://www.nature.com/doifinder/10.1038/nature09944 See supplemental information for subject data.
OTU-clustered data was downloaded from the publicly-accessible:
http://www.bork.embl.de/Docu/Arumugam_et_al_2011/downloads.html
data(enterotype) ig <- make_network(enterotype, "samples", max.dist=0.3) plot_network(ig, enterotype, color="SeqTech", shape="Enterotype", line_weight=0.3, label=NULL)
data(enterotype) ig <- make_network(enterotype, "samples", max.dist=0.3) plot_network(ig, enterotype, color="SeqTech", shape="Enterotype", line_weight=0.3, label=NULL)
Includes just 3 samples, 1 each from 3 subjects. Although the research article mentions 4 subjects, only 3 are included in this dataset.
abstract from research article (quoted):
The esophagus, like other luminal organs of the digestive system, provides a potential environment for bacterial colonization, but little is known about the presence of a bacterial biota or its nature. By using broad-range 16S rDNA PCR, biopsies were examined from the normal esophagus of four human adults. The 900 PCR products cloned represented 833 unique sequences belonging to 41 genera, or 95 species-level operational taxonomic units (SLOTU); 59 SLOTU were homologous with culture-defined bacterial species, 34 with 16S rDNA clones, and two were not homologous with any known bacterial 16S rDNA. Members of six phyla, Firmicutes, Bacteroides, Actinobacteria, Proteobacteria, Fusobacteria, and TM7, were represented. A large majority of clones belong to 13 of the 41 genera (783/900, 87%), or 14 SLOTU (574/900, 64%) that were shared by all four persons. Streptococcus (39%), Prevotella (17%), and Veilonella (14%) were most prevalent. The present study identified 56-79% of SLOTU in this bacterial ecosystem. Most SLOTU of esophageal biota are similar or identical to residents of the upstream oral biota, but the major distinction is that a large majority (82%) of the esophageal bacteria are known and cultivable. These findings provide evidence for a complex but conserved bacterial population in the normal distal esophagus.
(end quote)
A description of the 16S rRNA sequence processing can be found on the mothur-wiki
at the link below. A cutoff of 0.10 was used for OTU clustering in that example,
and it is taken here as well to create example data, esophagus
, which was
easily imported with the import_mothur()
function.
Pei et al. [email protected]
Pei, Z., Bini, E. J., Yang, L., Zhou, M., Francois, F., & Blaser, M. J. (2004). Bacterial biota in the human distal esophagus. Proceedings of the National Academy of Sciences of the United States of America, 101(12), 4250-4255. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC384727
mothur-processed files and the sequence data can be downloaded from a zip-file, along with additional description, from the following URL: http://www.mothur.org/wiki/Esophageal_community_analysis
data(esophagus) UniFrac(esophagus, weighted=TRUE) # How to re-create the esophagus dataset using import_mothur function mothlist <- system.file("extdata", "esophagus.fn.list.gz", package="phyloseq") mothgroup <- system.file("extdata", "esophagus.good.groups.gz", package="phyloseq") mothtree <- system.file("extdata", "esophagus.tree.gz", package="phyloseq") show_mothur_cutoffs(mothlist) cutoff <- "0.10" esophman <- import_mothur(mothlist, mothgroup, mothtree, cutoff)
data(esophagus) UniFrac(esophagus, weighted=TRUE) # How to re-create the esophagus dataset using import_mothur function mothlist <- system.file("extdata", "esophagus.fn.list.gz", package="phyloseq") mothgroup <- system.file("extdata", "esophagus.good.groups.gz", package="phyloseq") mothtree <- system.file("extdata", "esophagus.tree.gz", package="phyloseq") show_mothur_cutoffs(mothlist) cutoff <- "0.10" esophman <- import_mothur(mothlist, mothgroup, mothtree, cutoff)
Published in PNAS in early 2011. This work compared the microbial communities from 25 environmental samples and three known “mock communities” – a total of 9 sample types – at a depth averaging 3.1 million reads per sample. Authors were able to reproduce diversity patterns seen in many other published studies, while also invesitigating technical issues/bias by applying the same techniques to simulated microbial communities of known composition.
abstract from research article (quoted):
The ongoing revolution in high-throughput sequencing continues to democratize the ability of small groups of investigators to map the microbial component of the biosphere. In particular, the coevolution of new sequencing platforms and new software tools allows data acquisition and analysis on an unprecedented scale. Here we report the next stage in this coevolutionary arms race, using the Illumina GAIIx platform to sequence a diverse array of 25 environmental samples and three known “mock communities” at a depth averaging 3.1 million reads per sample. We demonstrate excellent consistency in taxonomic recovery and recapture diversity patterns that were previously reported on the basis of metaanalysis of many studies from the literature (notably, the saline/nonsaline split in environmental samples and the split between host-associated and free-living communities). We also demonstrate that 2,000 Illumina single-end reads are sufficient to recapture the same relationships among samples that we observe with the full dataset. The results thus open up the possibility of conducting large-scale studies analyzing thousands of samples simultaneously to survey microbial communities at an unprecedented spatial and temporal resolution.
(end quote)
Many thanks to J. Gregory Caporaso for directly providing the OTU-clustered data files for inclusion in this package.
Caporaso, J. G., et al.
Caporaso, J. G., et al. (2011). Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. PNAS, 108, 4516-4522. PMCID: PMC3063599
The primary article can be viewed/downloaded at: http://www.pnas.org/content/108/suppl.1/4516.short
The examples on the phyloseq wiki page for plot_ordination
show
many more examples:
https://github.com/joey711/phyloseq/wiki/plot_ordination
data(GlobalPatterns) plot_richness(GlobalPatterns, x="SampleType", measures=c("Observed", "Chao1", "Shannon"))
data(GlobalPatterns) plot_richness(GlobalPatterns, x="SampleType", measures=c("Observed", "Chao1", "Shannon"))
Published in early 2011, this work compared 24 separate soil microbial communities under four treatment conditions via multiplexed/barcoded 454-pyrosequencing of PCR-amplified 16S rRNA gene fragments. The authors found differences in the composition and structure of microbial communities between soil treatments. As expected, the soil microbial communities were highly diverse, with a staggering 16,825 different OTUs (species) observed in the included dataset. Interestingly, this study used a larger number of replicates than previous studies of this type, for a total of 56 samples, and the putatively low resampling rate of species between replicated sequencing trials (“OTU overlap”) was a major concern by the authors.
This dataset contains an experiment-level (phyloseq-class
) object,
which in turn contains the taxa-contingency table and soil-treatment table
as otu_table-class
and sample_data-class
components, respectively.
This data was
imported from raw files supplied directly by the authors via personal communication
for the purposes of including as an example in the phyloseq-package
.
As this data is sensitive to choices in OTU-clustering parameters, attempts to recreate
the otu_table
from the raw sequencing data may give slightly different results
than the table provided here.
abstract from research article (quoted):
To determine the reproducibility and quantitation of the amplicon sequencing-based
detection approach for analyzing microbial community structure, a total of 24 microbial
communities from a long-term global change experimental site were examined. Genomic DNA
obtained from each community was used to amplify 16S rRNA genes with two or three
barcode tags as technical replicates in the presence of a small quantity (0.1% wt/wt)
of genomic DNA from Shewanella oneidensis MR-1 as the control. The technical
reproducibility of the amplicon sequencing-based detection approach is quite low,
with an average operational taxonomic unit (OTU) overlap of 17.2%+/-
2.3%
between two technical replicates, and 8.2%+/-
2.3% among three technical
replicates, which is most likely due to problems associated with random sampling processes.
Such variations in technical replicates could have substantial effects on estimating
beta-diversity but less on alpha-diversity. A high variation was also observed in the
control across different samples (for example, 66.7-fold for the forward primer),
suggesting that the amplicon sequencing-based detection approach could not be quantitative.
In addition, various strategies were examined to improve the comparability of amplicon
sequencing data, such as increasing biological replicates, and removing singleton sequences
and less-representative OTUs across biological replicates. Finally, as expected, various
statistical analyses with preprocessed experimental data revealed clear differences in
the composition and structure of microbial communities between warming and non-warming,
or between clipping and non-clipping. Taken together, these results suggest that amplicon
sequencing-based detection is useful in analyzing microbial community structure even
though it is not reproducible and quantitative. However, great caution should be taken
in experimental design and data interpretation when the amplicon sequencing-based detection
approach is used for quantitative analysis of the beta-diversity of microbial communities.
(end quote)
Jizhong Zhou, et al.
Zhou, J., Wu, L., Deng, Y., Zhi, X., Jiang, Y.-H., Tu, Q., Xie, J., et al.
Reproducibility and quantitation of amplicon sequencing-based detection.
The ISME Journal. (2011) 5(8):1303-1313. doi:10.1038/ismej.2011.11
The article can be accessed online at http://www.nature.com/ismej/journal/v5/n8/full/ismej201111a.html
# Load the data data(soilrep) ################################################################################ # Alpha diversity (richness) example. Accept null hypothesis: # No convincing difference in species richness between warmed/unwarmed soils. ################################################################################ # Graphically compare richness between the different treatments. man.col <- c(WC="red", WU="brown", UC="blue", UU="darkgreen") plot_richness(soilrep, x="Treatment", color="Treatment", measures=c("Observed", "Chao1", "Shannon"))
# Load the data data(soilrep) ################################################################################ # Alpha diversity (richness) example. Accept null hypothesis: # No convincing difference in species richness between warmed/unwarmed soils. ################################################################################ # Graphically compare richness between the different treatments. man.col <- c(WC="red", WU="brown", UC="blue", UU="darkgreen") plot_richness(soilrep, x="Treatment", color="Treatment", measures=c("Observed", "Chao1", "Shannon"))
dist
class.See dist
for details
about this type of a distance matrix object.
Takes a phyloseq-class
object and method option, and returns
a dist
ance object suitable for certain
ordination methods and other distance-based analyses.
Only
sample-wise distances are currently supported (the type
argument),
but eventually species-wise (OTU-wise)
distances may be supported as well.
distance(physeq, method, type = "samples", ...) ## S4 method for signature 'phyloseq,ANY' distance(physeq, method) ## S4 method for signature 'otu_table,character' distance(physeq, method, type = "samples", ...) ## S4 method for signature 'phyloseq,character' distance(physeq, method, type = "samples", ...)
distance(physeq, method, type = "samples", ...) ## S4 method for signature 'phyloseq,ANY' distance(physeq, method) ## S4 method for signature 'otu_table,character' distance(physeq, method, type = "samples", ...) ## S4 method for signature 'phyloseq,character' distance(physeq, method, type = "samples", ...)
physeq |
(Required). A |
method |
(Required). A character string.
Provide one of the currently supported options.
See Note that for the common definition of The following methods are implemented explicitly within
the
Alternatively, you can provide
a character string that defines a custom distance method, if it has the form
described in |
type |
(Optional). A character string. The type of pairwise comparisons
being calculated: sample-wise or taxa-wise. The default is
|
... |
Additional arguments passed on to the appropriate distance
function, determined by the |
Depending on the method
argument, distance()
wraps one of
UniFrac
,
DPCoA
,
JSD
,
vegdist
,
betadiver
,
designdist
, or
dist
.
An object of class “dist
” suitable for certain
ordination methods and other distance-based analyses.
plot_ordination
,
UniFrac
,
DPCoA
,
JSD
,
vegdist
,
betadiver
,
designdist
,
dist
.
data(esophagus) distance(esophagus, "uunifrac") # Unweighted UniFrac distance(esophagus, "wunifrac") # weighted UniFrac distance(esophagus, "jaccard", binary = TRUE) # vegdist jaccard distance(esophagus, "gower") # vegdist option "gower" distance(esophagus, "g") # designdist method option "g" distance(esophagus, "minkowski") # invokes a method from the base dist() function. distance(esophagus, "(A+B-2*J)/(A+B)") # designdist custom distance distanceMethodList help("distance")
data(esophagus) distance(esophagus, "uunifrac") # Unweighted UniFrac distance(esophagus, "wunifrac") # weighted UniFrac distance(esophagus, "jaccard", binary = TRUE) # vegdist jaccard distance(esophagus, "gower") # vegdist option "gower" distance(esophagus, "g") # designdist method option "g" distance(esophagus, "minkowski") # invokes a method from the base dist() function. distance(esophagus, "(A+B-2*J)/(A+B)") # designdist custom distance distanceMethodList help("distance")
distance
Distance methods should be specified by exact string match. Cannot do partial matching for all options, because too many similar options in downstream method dispatch.
distanceMethodList
distanceMethodList
A list of character vectors. Every entry specifies a supported distance method. Names in the list indicate which downstream function is being utilized for further details. Same functions are linked in the itemized list below.
unifrac
wunifrac
dpcoa
jsd
manhattan
euclidean
canberra
bray
kulczynski
jaccard
gower
altGower
morisita
horn
mountford
raup
binomial
chao
cao
w
-
c
wb
r
I
e
t
me
j
sor
m
-
co
cc
g
-
l
hk
rlb
sim
gl
z
maximum
binary
minkowski
ANY
distanceMethodList
distanceMethodList
Function uses abundance (otu_table-class
) and
phylogenetic (phylo
) components of a
phyloseq-class
experiment-level object
to perform a
Double Principle Coordinate Analysis (DPCoA), relying heavily on
the underlying (and more general) function, dpcoa
.
The distance object ultimately provided is the square root of the
cophenetic/patristic (cophenetic.phylo
) distance
between the species, which is always Euclidean.
Although this distance is Euclidean, for numerical reasons it
will sometimes look non-Euclidean, and a correction will be performed.
See correction
argument.
DPCoA(physeq, correction = cailliez, scannf = FALSE, ...)
DPCoA(physeq, correction = cailliez, scannf = FALSE, ...)
physeq |
(Required). A |
correction |
(Optional). A function. The function must be
able to take a non-Euclidean Although the distance matrix should always be Euclidean, for numerical
reasons it will sometimes appear non-Euclidean and a correction method must
be applied. Two recommended correction methods are
|
scannf |
(Optional). Logical. Default is |
... |
Additional arguments passed to |
A dpcoa
-class object (see dpcoa
).
Julia Fukuyama [email protected]. Adapted for phyloseq by Paul J. McMurdie.
Pavoine, S., Dufour, A.B. and Chessel, D. (2004) From dissimilarities among species to dissimilarities among communities: a double principal coordinate analysis. Journal of Theoretical Biology, 228, 523-537.
# # # # # # Esophagus data(esophagus) eso.dpcoa <- DPCoA(esophagus) eso.dpcoa plot_ordination(esophagus, eso.dpcoa, "samples") plot_ordination(esophagus, eso.dpcoa, "species") plot_ordination(esophagus, eso.dpcoa, "biplot") # # # # # # # # GlobalPatterns data(GlobalPatterns) # subset GP to top-150 taxa (to save computation time in example) keepTaxa <- names(sort(taxa_sums(GlobalPatterns), TRUE)[1:150]) GP <- prune_taxa(keepTaxa, GlobalPatterns) # Perform DPCoA GP.dpcoa <- DPCoA(GP) plot_ordination(GP, GP.dpcoa, color="SampleType")
# # # # # # Esophagus data(esophagus) eso.dpcoa <- DPCoA(esophagus) eso.dpcoa plot_ordination(esophagus, eso.dpcoa, "samples") plot_ordination(esophagus, eso.dpcoa, "species") plot_ordination(esophagus, eso.dpcoa, "biplot") # # # # # # # # GlobalPatterns data(GlobalPatterns) # subset GP to top-150 taxa (to save computation time in example) keepTaxa <- names(sort(taxa_sums(GlobalPatterns), TRUE)[1:150]) GP <- prune_taxa(keepTaxa, GlobalPatterns) # Perform DPCoA GP.dpcoa <- DPCoA(GP) plot_ordination(GP, GP.dpcoa, color="SampleType")
Performs a number of standard alpha diversity estimates,
and returns the results as a data.frame
.
Strictly speaking, this function is not only estimating richness,
despite its name.
It can operate on the cumulative population of all
samples in the dataset, or by repeating the richness estimates for each
sample individually.
NOTE: You must use untrimmed datasets
for meaningful results, as these estimates (and even the “observed” richness)
are highly dependent on the number of singletons. You can always trim the data
later on if needed, just not before using this function.
estimate_richness(physeq, split = TRUE, measures = NULL)
estimate_richness(physeq, split = TRUE, measures = NULL)
physeq |
(Required). |
split |
(Optional). Logical. Should a separate set of richness estimates be performed for each sample? Or alternatively, pool all samples and estimate richness of the entire set. |
measures |
(Optional). Default is |
A data.frame
of the richness estimates, and their standard error.
Check out the custom plotting function, plot_richness
,
for easily showing the results of different estimates,
with method-specific error-bars.
Also check out the internal functions borrowed from the vegan
package:
## There are many more interesting examples at the phyloseq online tutorials. ## http://joey711.github.com/phyloseq/plot_richness-examples data("esophagus") # Default is all available measures estimate_richness(esophagus) # Specify just one: estimate_richness(esophagus, measures="Observed") # Specify a few: estimate_richness(esophagus, measures=c("Observed", "InvSimpson", "Shannon", "Chao1"))
## There are many more interesting examples at the phyloseq online tutorials. ## http://joey711.github.com/phyloseq/plot_richness-examples data("esophagus") # Default is all available measures estimate_richness(esophagus) # Specify just one: estimate_richness(esophagus, measures="Observed") # Specify a few: estimate_richness(esophagus, measures=c("Observed", "InvSimpson", "Shannon", "Chao1"))
Creates the environment table that is needed for the original UniFrac
algorithm. Useful for cross-checking, or if want to use UniFrac server.
Optionally the ENV-formatted table can be returned to the R
workspace, and the tree component can be exported as Nexus format
(Recommended).
export_env_file(physeq, file = "", writeTree = TRUE, return = FALSE)
export_env_file(physeq, file = "", writeTree = TRUE, return = FALSE)
physeq |
(Required). Experiment-level ( |
file |
(Optional). The file path for export. If not-provided, the
expectation is that you will want to set |
writeTree |
(Optional). Write the phylogenetic tree as well as the
the ENV table. Default is |
return |
(Optional). Should the ENV table be returned to the R workspace?
Default is |
# # Load example data # data(esophagus) # export_env_file(esophagus, "~/Desktop/esophagus.txt")
# # Load example data # data(esophagus) # export_env_file(esophagus, "~/Desktop/esophagus.txt")
.names
and .dist
files for mothurThe purpose of this function is to allow a user to easily export a distance object
as a pair of files that can be immediately imported by mothur for OTU clustering
and related analysis. A distance object can be created in R
in a number of
ways, including via cataloguing the cophentic distances of a tree object.
export_mothur_dist(x, out=NULL, makeTrivialNamesFile=NULL)
export_mothur_dist(x, out=NULL, makeTrivialNamesFile=NULL)
x |
(Required). A |
out |
(Optional). The desired output filename for the |
makeTrivialNamesFile |
(Optional). Default |
A character vector of the different cutoff values contained in the file.
For a given set of arguments to the cluster()
command from within
mothur, a number of OTU-clustering results are returned in the same
list file. The exact cutoff values used by mothur can vary depending
on the input data. This simple function returns the cutoffs that were actually
included in the mothur output. This an important extra step prior to
importing the OTUs with the import_mothur_otulist()
function.
# data(esophagus) myDistObject <- as.dist(ape::cophenetic.phylo(phy_tree(esophagus))) export_mothur_dist(myDistObject)
# data(esophagus) myDistObject <- as.dist(ape::cophenetic.phylo(phy_tree(esophagus))) export_mothur_dist(myDistObject)
This function is directly analogous to the
genefilter
function for microarray filtering,
but is used for filtering OTUs from phyloseq objects.
It applies an arbitrary set of functions —
as a function list, for instance, created by filterfun
—
as across-sample criteria, one OTU at a time.
It takes as input a phyloseq object,
and returns a logical vector
indicating whether or not each OTU passed the criteria.
Alternatively, if the "prune"
option is set to FALSE
,
it returns the already-trimmed version of the phyloseq object.
filter_taxa(physeq, flist, prune=FALSE)
filter_taxa(physeq, flist, prune=FALSE)
physeq |
(Required). A |
flist |
(Required). A function or list of functions that take a vector
of abundance values and return a logical. Some canned useful function types
are included in the |
prune |
(Optional). A logical. Default |
A logical vector equal to the number of taxa in physeq
.
This can be provided directly to prune_taxa
as first argument.
Alternatively, if prune==TRUE
, the pruned phyloseq-class
object is returned instead.
filterfun
,
genefilter_sample
,
filterfun_sample
data("enterotype") require("genefilter") flist <- filterfun(kOverA(5, 2e-05)) ent.logi <- filter_taxa(enterotype, flist) ent.trim <- filter_taxa(enterotype, flist, TRUE) identical(ent.trim, prune_taxa(ent.logi, enterotype)) identical(sum(ent.logi), ntaxa(ent.trim)) filter_taxa(enterotype, flist, TRUE)
data("enterotype") require("genefilter") flist <- filterfun(kOverA(5, 2e-05)) ent.logi <- filter_taxa(enterotype, flist) ent.trim <- filter_taxa(enterotype, flist, TRUE) identical(ent.trim, prune_taxa(ent.logi, enterotype)) identical(sum(ent.logi), ntaxa(ent.trim)) filter_taxa(enterotype, flist, TRUE)
filterfun
.See the filterfun
, from the Bioconductor repository,
for a taxa-/gene-wise filter (and further examples).
filterfun_sample(...)
filterfun_sample(...)
... |
A comma-separated list of functions. |
An enclosure (function) that itself will return a logical vector, according to the functions provided in the argument list, evaluated in order. The output of filterfun_sample is appropriate for the ‘flist’ argument to the genefilter_sample method.
# Use simulated abundance matrix set.seed(711) testOTU <- otu_table(matrix(sample(1:50, 25, replace=TRUE), 5, 5), taxa_are_rows=FALSE) f1 <- filterfun_sample(topk(2)) wh1 <- genefilter_sample(testOTU, f1, A=2) wh2 <- c(TRUE, TRUE, TRUE, FALSE, FALSE) prune_taxa(wh1, testOTU) prune_taxa(wh2, testOTU)
# Use simulated abundance matrix set.seed(711) testOTU <- otu_table(matrix(sample(1:50, 25, replace=TRUE), 5, 5), taxa_are_rows=FALSE) f1 <- filterfun_sample(topk(2)) wh1 <- genefilter_sample(testOTU, f1, A=2) wh2 <- c(TRUE, TRUE, TRUE, FALSE, FALSE) prune_taxa(wh1, testOTU) prune_taxa(wh2, testOTU)
This is a wrapper for the clusGap
function,
expecting an ordination result as the main data argument.
gapstat_ord(ord, axes = c(1:2), type = "sites", FUNcluster = function(x, k) { list(cluster = pam(x, k, cluster.only = TRUE)) }, K.max = 8, ...)
gapstat_ord(ord, axes = c(1:2), type = "sites", FUNcluster = function(x, k) { list(cluster = pam(x, k, cluster.only = TRUE)) }, K.max = 8, ...)
ord |
(Required). An ordination object. The precise class can vary.
Any ordination classes supported internally by the phyloseq package
should work, ultimately by passing to the |
axes |
(Optional). The ordination axes that you want to include. |
type |
(Optional). One of |
FUNcluster |
(Optional). This is passed to
Any function that has these input/output properties (performing a clustering) will suffice. The more appropriate the clustering method, the better chance your gap statistic results will be useful. |
K.max |
(Optional). A single positive integer value.
It indicates the maximum number of clusters that will be considered.
Value must be at least two.
This is passed to |
... |
(Optional). Additional named parameters
passed on to |
An object of S3 class "clusGap"
, basically a list with components.
See the clusGap
documentation for more details.
data("soilrep") sord = ordinate(soilrep, "PCoA", "bray") # Evaluate axes with scree plot plot_scree(sord) # Gap Statistic gs = gapstat_ord(sord, axes=1:3, verbose=FALSE) # plot_ordination(soilrep, sord, color="Treatment") plot_clusgap(gs) print(gs, method="Tibs2001SEmax")
data("soilrep") sord = ordinate(soilrep, "PCoA", "bray") # Evaluate axes with scree plot plot_scree(sord) # Gap Statistic gs = gapstat_ord(sord, axes=1:3, verbose=FALSE) # plot_ordination(soilrep, sord, color="Treatment") plot_clusgap(gs) print(gs, method="Tibs2001SEmax")
A general OTU trimming function for selecting OTUs that satisfy
some criteria within the distribution of each sample, and then
also an additional criteria for number of samples that must pass.
This is a genefilter-like function that only considers sample-wise
criteria. The number of acceptable samples is used
as the final criteria (set by the argument A
)
to determine whether or not the taxa should
be retained (TRUE
) or not (FALSE
). Just like with genefilter, a
logical having length equal to nrow()/ntaxa
is returned, indicating which
should be kept. This output can be provided
directly to OTU trimming function, prune_taxa
.
By contrast, genefilter
,
of the genefilter package in Bioconductor,
works only on the rows of a matrix. Note that, because otu_table-class
inherits directly from the matrix-class
, an unmodified
otu_table can be provided to genefilter
, but be mindful of the orientation
of the otu_table (use taxa_are_rows
),
and transpose (t
) if needed.
genefilter_sample(X, flist, A=1) ## S4 method for signature 'matrix' genefilter_sample(X, flist, A = 1) ## S4 method for signature 'otu_table' genefilter_sample(X, flist, A = 1) ## S4 method for signature 'phyloseq' genefilter_sample(X, flist, A = 1)
genefilter_sample(X, flist, A=1) ## S4 method for signature 'matrix' genefilter_sample(X, flist, A = 1) ## S4 method for signature 'otu_table' genefilter_sample(X, flist, A = 1) ## S4 method for signature 'phyloseq' genefilter_sample(X, flist, A = 1)
X |
The object that needs trimming. Can be matrix, otu_table, or higher- order phyloseq classes that contain an otu_table. |
flist |
An enclosure object, typically created with |
A |
An integer. The number of samples in which a taxa / OTUs passed the filter for it to be labeled TRUE in the output logical vector. |
A logical vector with names equal to taxa_names (or rownames, if matrix).
genefilter
, filterfun_sample
,
t
,
prune_taxa
# ## testOTU <- otu_table(matrix(sample(1:50, 25, replace=TRUE), 5, 5), taxa_are_rows=FALSE) ## f1 <- filterfun_sample(topk(2)) ## wh1 <- genefilter_sample(testOTU, f1, A=2) ## wh2 <- c(TRUE, TRUE, TRUE, FALSE, FALSE) ## prune_taxa(wh1, testOTU) ## prune_taxa(wh2, testOTU) ## ## tax_table1 <- tax_table(matrix("abc", 5, 5)) ## prune_taxa(wh1, tax_table1) ## prune_taxa(wh2, tax_table1)
# ## testOTU <- otu_table(matrix(sample(1:50, 25, replace=TRUE), 5, 5), taxa_are_rows=FALSE) ## f1 <- filterfun_sample(topk(2)) ## wh1 <- genefilter_sample(testOTU, f1, A=2) ## wh2 <- c(TRUE, TRUE, TRUE, FALSE, FALSE) ## prune_taxa(wh1, testOTU) ## prune_taxa(wh2, testOTU) ## ## tax_table1 <- tax_table(matrix("abc", 5, 5)) ## prune_taxa(wh1, tax_table1) ## prune_taxa(wh2, tax_table1)
i
.This is a simple accessor function for investigating a single species-of-interest.
get_sample(physeq, i) ## S4 method for signature 'otu_table' get_sample(physeq, i) ## S4 method for signature 'phyloseq' get_sample(physeq, i)
get_sample(physeq, i) ## S4 method for signature 'otu_table' get_sample(physeq, i) ## S4 method for signature 'phyloseq' get_sample(physeq, i)
physeq |
(Required). |
i |
(Required). A single taxa/species/OTU ID for which you want to know the abundance in each sample. |
An integer vector of the abundance values for
each sample in physeq
for species i
get_taxa
taxa_names
sample_names
data(esophagus) taxa_names(esophagus) get_sample(esophagus, "59_5_19")
data(esophagus) taxa_names(esophagus) get_sample(esophagus, "59_5_19")
i
.This is a simple accessor function for investigating a single sample-of-interest.
get_taxa(physeq, i) ## S4 method for signature 'otu_table' get_taxa(physeq, i) ## S4 method for signature 'phyloseq' get_taxa(physeq, i)
get_taxa(physeq, i) ## S4 method for signature 'otu_table' get_taxa(physeq, i) ## S4 method for signature 'phyloseq' get_taxa(physeq, i)
physeq |
(Required). |
i |
(Required). A single sample for which you want to know the abundance of each species. Can be integer for index value, or sample name. |
An integer vector of the abundance values for
each species in physeq
for sample i
get_sample
taxa_names
sample_names
data(esophagus) sample_names(esophagus) get_taxa(esophagus, "B")
data(esophagus) sample_names(esophagus) get_taxa(esophagus, "B")
This is a simple accessor function to make it more convenient to determine
the different taxa present for a particular taxonomic rank
in a given phyloseq-class
object.
get_taxa_unique(physeq, taxonomic.rank=rank_names(physeq)[1], errorIfNULL=TRUE)
get_taxa_unique(physeq, taxonomic.rank=rank_names(physeq)[1], errorIfNULL=TRUE)
physeq |
(Required). |
taxonomic.rank |
(Optional). Character. The taxonomic rank to use. Must select
from the set indicated by |
errorIfNULL |
(Optional). Logical. Should the accessor stop with
an error if the slot is empty ( |
Character vector. Unique vector of the observed taxa at a particular taxonomic rank
get_taxa
taxa_names
sample_names
data(enterotype) get_taxa_unique(enterotype) data(GlobalPatterns) get_taxa_unique(GlobalPatterns, "Family")
data(enterotype) get_taxa_unique(enterotype) data(GlobalPatterns) get_taxa_unique(GlobalPatterns, "Family")
This is a simple accessor function for streamlining access to values/vectors/factors/etc contained in the sample_data.
get_variable(physeq, varName)
get_variable(physeq, varName)
physeq |
(Required). |
varName |
(Required). Character string of the variable name in |
Data. The clas of the data depends on what the contents of sample_data.
get_taxa
taxa_names
sample_names
# Load the GlobalPatterns dataset into the workspace environment data(GlobalPatterns) # Look at the different values for SampleType get_variable(GlobalPatterns, "SampleType")
# Load the GlobalPatterns dataset into the workspace environment data(GlobalPatterns) # Look at the different values for SampleType get_variable(GlobalPatterns, "SampleType")
Like getSlots
, but returns the class name if argument
is component data object.
getslots.phyloseq(physeq)
getslots.phyloseq(physeq)
physeq |
A |
identical to getSlots. A named character vector of the slot classes
of a particular S4 class, where each element is named by the slot name it
represents. If physeq
is a component data object,
then a vector of length (1) is returned, named according to its slot name in
the phyloseq-class
.
merge_phyloseq
# data(GlobalPatterns) getslots.phyloseq(GlobalPatterns) data(esophagus) getslots.phyloseq(esophagus)
# data(GlobalPatterns) getslots.phyloseq(GlobalPatterns) data(esophagus) getslots.phyloseq(esophagus)
A user must still understand the additional arguments required for each
type of import data. Those arguments are described in detail at the
tool-specific import_*
links below. Each clustering tool / package / pipeline
has its own idiosyncratic set of file names / types, and it remains the
responsibility of the user to understand which file-path should be provided
to each argument for the particular importing submethod. This method
merely provides a central documentation and method-name, and the arguments
are passed along as-is.
import(pipelineName, ...)
import(pipelineName, ...)
pipelineName |
(Required). Character string. The name of the
analysis tool / pipeline / package
that created the OTU-cluster data or other data that you now want to import.
Current options are |
... |
(Required). Additional named arguments providing file paths, and possible other paramaters to the desired tool-specific import function. |
In most cases a phyloseq-class
will be returned, though
the included component data will vary by pipeline/tool, and also
by the types of data files provided.
The expected behavior is to return the most-comprehensive object possible,
given the provided arguments and pipeline/tool.
BIOM: http://www.biom-format.org/
mothur: http://www.mothur.org/wiki/Main_Page
PyroTagger: http://pyrotagger.jgi-psf.org/
QIIME: http://qiime.org/
RDP pipeline: http://pyro.cme.msu.edu/index.jsp
For BIOM format, see:
import_biom
For mothur, see:
import_mothur
Separate tools for mothur are also:
show_mothur_cutoffs
import_mothur_dist
export_mothur_dist
For PyroTagger, see:
import_pyrotagger_tab
For QIIME legacy format, see:
import_qiime
For RDP pipeline, see:
import_RDP_cluster
## See documentation of a specific import function
## See documentation of a specific import function
New versions of QIIME produce a more-comprehensive and formally-defined JSON file format, called biom file format:
import_biom(BIOMfilename, treefilename=NULL, refseqfilename=NULL, refseqFunction=readDNAStringSet, refseqArgs=NULL, parseFunction=parse_taxonomy_default, parallel=FALSE, version=1.0, ...)
import_biom(BIOMfilename, treefilename=NULL, refseqfilename=NULL, refseqFunction=readDNAStringSet, refseqArgs=NULL, parseFunction=parse_taxonomy_default, parallel=FALSE, version=1.0, ...)
BIOMfilename |
(Required). A character string indicating the
file location of the BIOM formatted file. This is a JSON formatted file,
specific to biological datasets, as described in
http://www.qiime.org/svn_documentation/documentation/biom_format.htmlthe biom-format home page.
In principle, this file should include you OTU abundance data (OTU table),
your taxonomic classification data (taxonomy table), as well as your
sample data, for instance what might be in your “sample map” in QIIME.
A phylogenetic tree is not yet supported by biom-format, and so is a
separate argument here. If, for some reason, your biom-format file is
missing one of these mentioned data types but you have it in a separate file,
you can first import the data that is in the biom file using this function,
|
treefilename |
(Optional). Default value is |
refseqfilename |
(Optional). Default |
refseqFunction |
(Optional).
Default is |
refseqArgs |
(Optional).
Default |
parseFunction |
(Optional). A function. It must be a function that
takes as its first argument a character vector of taxonomic rank labels
for a single OTU
and parses and names each element
(an optionally removes unwanted elements).
Further details and examples of acceptable functions are provided
in the documentation for |
parallel |
(Optional). Logical. Wrapper option for Finally, for many datasets a parallel import should not be necessary
because a serial import will be just as fast and the import is often only
performed one time; after which the data should be saved as an RData file
using the |
version |
(Optional). Numeric. The expected version number of the file.
As the BIOM format evolves, version-specific importers may be available
by adjusting the version value. Default is |
... |
Additional parameters passed on to |
“The biom file format (canonically pronounced ‘biome’) is designed to be a general-use format for representing counts of observations in one or more biological samples. BIOM is a recognized standard for the Earth Microbiome Project and is a Genomics Standards Consortium candidate project.”
A phyloseq-class
object.
# An included example of a rich dense biom file rich_dense_biom <- system.file("extdata", "rich_dense_otu_table.biom", package="phyloseq") import_biom(rich_dense_biom, parseFunction=parse_taxonomy_greengenes) # An included example of a sparse dense biom file rich_sparse_biom <- system.file("extdata", "rich_sparse_otu_table.biom", package="phyloseq") import_biom(rich_sparse_biom, parseFunction=parse_taxonomy_greengenes) # # # Example code for importing large file with parallel backend # library("doParallel") # registerDoParallel(cores=6) # import_biom("my/file/path/file.biom", parseFunction=parse_taxonomy_greengenes, parallel=TRUE)
# An included example of a rich dense biom file rich_dense_biom <- system.file("extdata", "rich_dense_otu_table.biom", package="phyloseq") import_biom(rich_dense_biom, parseFunction=parse_taxonomy_greengenes) # An included example of a sparse dense biom file rich_sparse_biom <- system.file("extdata", "rich_sparse_otu_table.biom", package="phyloseq") import_biom(rich_sparse_biom, parseFunction=parse_taxonomy_greengenes) # # # Example code for importing large file with parallel backend # library("doParallel") # registerDoParallel(cores=6) # import_biom("my/file/path/file.biom", parseFunction=parse_taxonomy_greengenes, parallel=TRUE)
Convenience wrapper function to read the environment-file, as formatted for input to the UniFrac server (http://bmf2.colorado.edu/unifrac/). The official format of these files is that each row specifies (in order) the sequence name, source sample, and (optionally) the number of times the sequence was observed.
import_env_file(envfilename, tree=NULL, sep="\t", ...)
import_env_file(envfilename, tree=NULL, sep="\t", ...)
envfilename |
(Required). A charater string of the ENV filename (relative or absolute) |
tree |
(Optional). |
sep |
A character string indicating the delimiter used in the file.
The default is |
... |
Additional parameters passed on to |
An otu_table-class
, or phyloseq-class
if
a phylo-class
argument is provided to tree
.
http://bmf2.colorado.edu/unifrac/
# import_env_file(myEnvFile, myTree)
# import_env_file(myEnvFile, myTree)
Technically all parameters are optional,
but if you don't provide any file connections, then nothing will be returned.
While the list
and group
files are the first two arguments
for legacy-compatibility reasons, we don't recommend that you use these
file types with modern (large) datasets. They are comically inefficient, as
they store the name of every sequencing read in both files. The mothur
package provides conversions utilities to create other more-efficient formats,
which we recommend, like
the shared file for an OTU table.
Alternatively, mothur also provides a utility to create a biom-format file
that is independent of OTU clustering platform. Biom-format files
should be imported not with this function, but with import_biom
.
The resulting objects after import should be identical
in R.
import_mothur(mothur_list_file = NULL, mothur_group_file = NULL, mothur_tree_file = NULL, cutoff = NULL, mothur_shared_file = NULL, mothur_constaxonomy_file = NULL, parseFunction = parse_taxonomy_default)
import_mothur(mothur_list_file = NULL, mothur_group_file = NULL, mothur_tree_file = NULL, cutoff = NULL, mothur_shared_file = NULL, mothur_constaxonomy_file = NULL, parseFunction = parse_taxonomy_default)
mothur_list_file |
(Optional). The list file name / location produced by mothur. |
mothur_group_file |
(Optional). The name/location of the group file produced
by mothur's |
mothur_tree_file |
(Optional).
A tree file, presumably produced by mothur,
and readable by |
cutoff |
(Optional). A character string indicating the cutoff value, (or |
mothur_shared_file |
(Optional). A shared file produced by mothur. |
mothur_constaxonomy_file |
(Optional). A consensus taxonomy file produced by mothur. |
parseFunction |
(Optional). A specific function used for parsing the taxonomy string.
See |
The object class depends on the provided arguments. A phyloseq object is returned if enough data types are provided. If only one data component can be created from the data, it is returned.
FASTER (recommended for larger data sizes):
If only a mothur_constaxonomy_file
is provided,
then a taxonomyTable-class
object is returned.
If only a mothur_shared_file
is provided,
then an otu_table
object is returned.
SLOWER (but fine for small file sizes):
The list and group file formats are extremely inefficient for large datasets,
and they are not recommended. The mothur software provides tools for
converting to other file formats, such as a so-called “shared” file.
You should provide a shared file, or group/list files, but not
both at the same time.
If only a list and group file are provided,
then an otu_table
object is returned.
Similarly, if only a list and tree file are provided,
then only a tree is returned (phylo
-class).
http://www.mothur.org/wiki/Main_Page
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.
# # The following example assumes you have downloaded the esophagus example # # dataset from the mothur wiki: # # "http://www.mothur.org/wiki/Esophageal_community_analysis" # # "http://www.mothur.org/w/images/5/55/Esophagus.zip" # # The path on your machine may (probably will) vary # mothur_list_file <- "~/Downloads/mothur/Esophagus/esophagus.an.list" # mothur_group_file <- "~/Downloads/mothur/Esophagus/esophagus.good.groups" # mothur_tree_file <- "~/Downloads/mothur/Esophagus/esophagus.tree" # # # Actual examples follow: # show_mothur_cutoffs(mothur_list_file) # test1 <- import_mothur(mothur_list_file, mothur_group_file, mothur_tree_file) # test2 <- import_mothur(mothur_list_file, mothur_group_file, mothur_tree_file, cutoff="0.02") # # Returns just a tree # import_mothur(mothur_list_file, mothur_tree_file=mothur_tree_file) # # Returns just an otu_table # import_mothur(mothur_list_file, mothur_group_file=mothur_group_file) # # Returns an error # import_mothur(mothur_list_file) # # Should return an "OMG, you must provide the list file" error # import_mothur()
# # The following example assumes you have downloaded the esophagus example # # dataset from the mothur wiki: # # "http://www.mothur.org/wiki/Esophageal_community_analysis" # # "http://www.mothur.org/w/images/5/55/Esophagus.zip" # # The path on your machine may (probably will) vary # mothur_list_file <- "~/Downloads/mothur/Esophagus/esophagus.an.list" # mothur_group_file <- "~/Downloads/mothur/Esophagus/esophagus.good.groups" # mothur_tree_file <- "~/Downloads/mothur/Esophagus/esophagus.tree" # # # Actual examples follow: # show_mothur_cutoffs(mothur_list_file) # test1 <- import_mothur(mothur_list_file, mothur_group_file, mothur_tree_file) # test2 <- import_mothur(mothur_list_file, mothur_group_file, mothur_tree_file, cutoff="0.02") # # Returns just a tree # import_mothur(mothur_list_file, mothur_tree_file=mothur_tree_file) # # Returns just an otu_table # import_mothur(mothur_list_file, mothur_group_file=mothur_group_file) # # Returns an error # import_mothur(mothur_list_file) # # Should return an "OMG, you must provide the list file" error # import_mothur()
The mothur application will produce a file containing the pairwise distances between all sequences in a dataset. This distance matrix can be the basis for OTU cluster designations. R also has many built-in or off-the-shelf tools for dealing with distance matrices.
import_mothur_dist(mothur_dist_file)
import_mothur_dist(mothur_dist_file)
mothur_dist_file |
Required. The distance file name / location produced by mothur. |
A distance matrix object describing all sequences in a dataset.
# # Take a look at the dataset shown here as an example: # # "http://www.mothur.org/wiki/Esophageal_community_analysis" # # find the file ending with extension ".dist", download to your system # # The location of your file may vary # mothur_dist_file <- "~/Downloads/mothur/Esophagus/esophagus.dist" # myNewDistObject <- import_mothur_dist(mothur_dist_file)
# # Take a look at the dataset shown here as an example: # # "http://www.mothur.org/wiki/Esophageal_community_analysis" # # find the file ending with extension ".dist", download to your system # # The location of your file may vary # mothur_dist_file <- "~/Downloads/mothur/Esophagus/esophagus.dist" # myNewDistObject <- import_mothur_dist(mothur_dist_file)
PyroTagger is a web-server that takes raw, barcoded 16S rRNA amplicon sequences
and returns an excel spreadsheet (".xls"
) with both abundance and
taxonomy data. It also includes some confidence information related to the
taxonomic assignment.
import_pyrotagger_tab(pyrotagger_tab_file, strict_taxonomy=FALSE, keep_potential_chimeras=FALSE)
import_pyrotagger_tab(pyrotagger_tab_file, strict_taxonomy=FALSE, keep_potential_chimeras=FALSE)
pyrotagger_tab_file |
(Required). A character string. The name of the tab-delimited pyrotagger output table. |
strict_taxonomy |
(Optional). Logical. Default |
keep_potential_chimeras |
(Optional). Logical. Default |
PyroTagger is created and maintained by the Joint Genome Institute
at "http://pyrotagger.jgi-psf.org/"
The typical output form PyroTagger is a spreadsheet format ".xls"
, which poses
additional import challenges. However, virtually all spreadsheet applications
support the ".xls"
format, and can further export this file in a
tab-delimited format. It is recommended that you convert the xls-file without
any modification (as tempting as it might be once you have loaded it) into a
tab-delimited text file. Deselect any options to encapsulate fields in quotes,
as extra quotes around each cell's contents might cause problems during
file processing. These quotes will also inflate the file-size, so leave them out
as much as possible, while also resisting any temptation to modify the xls-file
“by hand”.
A highly-functional and free spreadsheet application can be obtained as part
of the cross-platform OpenOffice
suite. It works for the above
required conversion. Go to "http://www.openoffice.org/"
.
It is regrettable that this importer does not take the xls-file directly
as input. However, because of the moving-target nature of spreadsheet
file formats, there is limited support for direct import of these formats into
R
. Rather than add to the dependency requirements of emphphyloseq
and the relative support of these xls-support packages, it seems more efficient
to choose an arbitrary delimited text format, and focus on the data
structure in the PyroTagger output. This will be easier to support in the
long-run.
An otuTax
object containing both the otu_table and TaxonomyTable data
components, parsed from the pyrotagger output.
http://pyrotagger.jgi-psf.org/
## New_otuTaxObject <- import_pyrotagger_tab(pyrotagger_tab_file)
## New_otuTaxObject <- import_pyrotagger_tab(pyrotagger_tab_file)
QIIME produces several files that can be directly imported by
the phyloseq-package
.
Originally, QIIME produced its own custom format table
that contained both OTU-abundance
and taxonomic identity information.
This function is still included in phyloseq mainly to accommodate these
now-outdated files. Recent versions of QIIME store output in the
biom-format, an emerging file format standard for microbiome data.
If your data is in the biom-format, if it ends with a .biom
file name extension, then you should use the import_biom
function instead.
import_qiime(otufilename = NULL, mapfilename = NULL, treefilename = NULL, refseqfilename = NULL, refseqFunction = readDNAStringSet, refseqArgs = NULL, parseFunction = parse_taxonomy_qiime, verbose = TRUE, ...)
import_qiime(otufilename = NULL, mapfilename = NULL, treefilename = NULL, refseqfilename = NULL, refseqFunction = readDNAStringSet, refseqArgs = NULL, parseFunction = parse_taxonomy_qiime, verbose = TRUE, ...)
otufilename |
(Optional). A character string indicating
the file location of the OTU file.
The combined OTU abundance and taxonomic identification file,
tab-delimited, as produced by QIIME under default output settings.
Default value is |
mapfilename |
(Optional). The QIIME map file is required
for processing barcoded primers in QIIME
as well as some of the post-clustering analysis. This is a required
input file for running QIIME. Its strict formatting specification should be
followed for correct parsing by this function.
Default value is |
treefilename |
(Optional). Default value is |
refseqfilename |
(Optional). Default |
refseqFunction |
(Optional).
Default is |
refseqArgs |
(Optional).
Default |
parseFunction |
(Optional). An optional custom function for parsing the
character string that contains the taxonomic assignment of each OTU.
The default parsing function is |
verbose |
(Optional). A |
... |
Additional arguments passed to |
Other related files include
the mapping-file that typically stores sample covariates,
converted naturally to the
sample_data-class
component data type in the phyloseq-package.
QIIME may also produce a
phylogenetic tree with a tip for each OTU, which can also be imported
specified here or imported separately using read_tree
.
See "http://www.qiime.org/" for details on using QIIME. While there are many complex dependencies, QIIME can be downloaded as a pre-installed linux virtual machine that runs “off the shelf”.
The different files useful for import to phyloseq are not collocated in a typical run of the QIIME pipeline. See the main phyloseq vignette for an example of where ot find the relevant files in the output directory.
A phyloseq-class
object.
“QIIME allows analysis of high-throughput community sequencing data.” J Gregory Caporaso, Justin Kuczynski, Jesse Stombaugh, Kyle Bittinger, Frederic D Bushman, Elizabeth K Costello, Noah Fierer, Antonio Gonzalez Pena, Julia K Goodrich, Jeffrey I Gordon, Gavin A Huttley, Scott T Kelley, Dan Knights, Jeremy E Koenig, Ruth E Ley, Catherine A Lozupone, Daniel McDonald, Brian D Muegge, Meg Pirrung, Jens Reeder, Joel R Sevinsky, Peter J Turnbaugh, William A Walters, Jeremy Widmann, Tanya Yatsunenko, Jesse Zaneveld and Rob Knight; Nature Methods, 2010; doi:10.1038/nmeth.f.303
otufile <- system.file("extdata", "GP_otu_table_rand_short.txt.gz", package="phyloseq") mapfile <- system.file("extdata", "master_map.txt", package="phyloseq") trefile <- system.file("extdata", "GP_tree_rand_short.newick.gz", package="phyloseq") import_qiime(otufile, mapfile, trefile)
otufile <- system.file("extdata", "GP_otu_table_rand_short.txt.gz", package="phyloseq") mapfile <- system.file("extdata", "master_map.txt", package="phyloseq") trefile <- system.file("extdata", "GP_tree_rand_short.newick.gz", package="phyloseq") import_qiime(otufile, mapfile, trefile)
Now a legacy-format, older versions of QIIME
produced an OTU file that typically contains both OTU-abundance
and taxonomic identity information in a tab-delimted table.
If your file ends with the extension .biom
, or if you happen to know
that it is a biom-format file, or if you used default settings in a version
of QIIME of 1.7
or greater,
then YOU SHOULD USE THE BIOM-IMPORT FUNCTION instead,
import_biom
.
import_qiime_otu_tax(file, parseFunction = parse_taxonomy_qiime, verbose = TRUE, parallel = FALSE)
import_qiime_otu_tax(file, parseFunction = parse_taxonomy_qiime, verbose = TRUE, parallel = FALSE)
file |
(Required). The path to the qiime-formatted file you want to
import into R. Can be compressed (e.g. |
parseFunction |
(Optional). An optional custom function for parsing the
character string that contains the taxonomic assignment of each OTU.
The default parsing function is |
verbose |
(Optional). A |
parallel |
(Optional). Logical. Should the parsing be performed in
parallel?. Default is |
This function uses chunking to perform both the reading and parsing in blocks of optional size, thus constrain the peak memory usage. feature should make this importer accessible to machines with modest memory, but with the caveat that the full numeric matrix must be a manageable size at the end, too. In principle, the final tables will be large, but much more efficiently represented than the character-stored numbers. If total memory for storing the numeric matrix becomes problematic, a switch to a sparse matrix representation of the abundance – which is typically well-suited to this data – might provide a solution.
A list of two matrices. $otutab
contains the OTU Table
as a numeric matrix, while $taxtab
contains a character matrix
of the taxonomy assignments.
otufile <- system.file("extdata", "GP_otu_table_rand_short.txt.gz", package="phyloseq") import_qiime_otu_tax(otufile)
otufile <- system.file("extdata", "GP_otu_table_rand_short.txt.gz", package="phyloseq") import_qiime_otu_tax(otufile)
sample_data
file from QIIME pipeline.QIIME produces several files that can be analyzed in the phyloseq-package, This includes the map-file, which is an important input to QIIME that can also indicate sample covariates. It is converted naturally to the sample_data component data type in phyloseq-package, based on the R data.frame.
import_qiime_sample_data(mapfilename)
import_qiime_sample_data(mapfilename)
mapfilename |
(Required). A character string or connection.
That is, any suitable |
See import_qiime
for more information about QIIME. It is also the
suggested function for importing QIIME-produced data files.
A sample_data
object.
mapfile <- system.file("extdata", "master_map.txt", package = "phyloseq") import_qiime_sample_data(mapfile)
mapfile <- system.file("extdata", "master_map.txt", package = "phyloseq") import_qiime_sample_data(mapfile)
The RDP cluster pipeline (specifically, the output of the complete linkage clustering step)
has no formal documentation for the ".clust"
file or its apparent sequence naming convention.
import_RDP_cluster(RDP_cluster_file)
import_RDP_cluster(RDP_cluster_file)
RDP_cluster_file |
A character string. The name of the |
http://pyro.cme.msu.edu/index.jsp
The cluster file itself contains
the names of all sequences contained in input alignment. If the upstream
barcode and aligment processing steps are also done with the RDP pipeline,
then the sequence names follow a predictable naming convention wherein each
sequence is named by its sample and sequence ID, separated by a "_"
as
delimiter:
"sampleName_sequenceIDnumber"
This import function assumes that the sequence names in the cluster file follow
this convention, and that the sample name does not contain any "_"
. It
is unlikely to work if this is not the case. It is likely to work if you used
the upstream steps in the RDP pipeline to process your raw (barcoded, untrimmed)
fasta/fastq data.
This function first loops through the ".clust"
file and collects all
of the sample names that appear. It secondly loops through each OTU ("cluster"
;
each row of the cluster file) and sums the number of sequences (reads) from
each sample. The resulting abundance table of OTU-by-sample is trivially
coerced to an otu_table
object, and returned.
An otu_table
object parsed from the ".clust"
file.
http://pyro.cme.msu.edu/index.jsp
Recently updated tools on RDP Pyro site make it easier to import Pyrosequencing output
into R. The modified tool “Cluster To R Formatter” can take a cluster file
(generated from RDP Clustering tools) to create a community data matrix file
for distance cutoff range you are interested in. The resulting output file
is a tab-delimited file containing the number of sequences for each sample
for each OTU. The OTU header naming convention is "OTU_"
followed by the OTU
number in the cluster file. It pads “0”s to make the OTU header easy to sort.
The OTU numbers are not necessarily in order.
import_RDP_otu(otufile)
import_RDP_otu(otufile)
otufile |
(Optional). A character string indicating the file location of the OTU file, produced/exported according to the instructions above. |
A otu_table-class
object.
An alternative “cluster” file importer for RDP results:
import_RDP_cluster
The main RDP-pyrosequencing website http://pyro.cme.msu.edu/index.jsp
otufile <- system.file("extdata", "rformat_dist_0.03.txt.gz", package="phyloseq") ### the gzipped file is automatically recognized, and read using R-connections ex_otu <- import_RDP_otu(otufile) class(ex_otu) ntaxa(ex_otu) nsamples(ex_otu) sample_sums(ex_otu) head(t(ex_otu))
otufile <- system.file("extdata", "rformat_dist_0.03.txt.gz", package="phyloseq") ### the gzipped file is automatically recognized, and read using R-connections ex_otu <- import_RDP_otu(otufile) class(ex_otu) ntaxa(ex_otu) nsamples(ex_otu) sample_sums(ex_otu) head(t(ex_otu))
UPARSE is an algorithm for OTU-clustering implemented within usearch.
At last check, the UPARSE algortihm was accessed via the
-cluster_otu
option flag.
For details about installing and running usearch, please refer to the
usearch website.
For details about the output format, please refer to the
uparse format definition.
import_uparse(upFile, omitChimeras = TRUE, countTable = TRUE, OTUtable = TRUE, verbose = TRUE)
import_uparse(upFile, omitChimeras = TRUE, countTable = TRUE, OTUtable = TRUE, verbose = TRUE)
upFile |
(Required). A file location character string
or |
omitChimeras |
(Optional). |
countTable |
(Optional). |
OTUtable |
(Optional). |
verbose |
(Optional). A |
Because UPARSE is an external (non-R) application, there is no direct way to continuously check that these suggested arguments and file formats will remain in their current state. If there is a problem, please verify your version of usearch, create a small reproducible example of the problem, and post it as an issue on the phyloseq issues tracker.
###
###
.uc
) to OTU tableUPARSE is an algorithm for OTU-clustering implemented within usearch.
At last check, the UPARSE algortihm was accessed via the
-cluster_otu
option flag.
For details about installing and running usearch, please refer to the
usearch website.
For details about the output format, please refer to the
uc format definition.
This importer is intended to read a particular table format output
that is generated by usearch,
its so-called “cluster format”,
a file format that is often given the .uc
extension
in usearch documentation.
import_usearch_uc(ucfile, colRead = 9, colOTU = 10, readDelimiter = "_", verbose = TRUE)
import_usearch_uc(ucfile, colRead = 9, colOTU = 10, readDelimiter = "_", verbose = TRUE)
ucfile |
(Required). A file location character string
or |
colRead |
(Optional). Numeric. The column index in the uc-table
file that holds the read IDs.
The default column index is |
colOTU |
(Optional). Numeric. The column index in the uc-table
file that holds OTU IDs.
The default column index is |
readDelimiter |
(Optional). An R |
verbose |
(Optional). A |
Because usearch is an external (non-R) application, there is no direct
way to continuously check that these suggested arguments and file formats will
remain in their current state.
If there is a problem, please verify your version of usearch,
create a small reproducible example of the problem,
and post it as an issue on the phyloseq issues tracker.
The version of usearch upon which this import function
was created is 7.0.109
.
Hopefully later versions of usearch maintain this function and format,
but the phyloseq team has no way to guarantee this,
and so any feedback about this will help maintain future functionality.
For instance, it is currently
assumed that the 9th and 10th columns of the .uc
table
hold the read-label and OTU ID, respectively;
and it is also assumed that the delimiter between sample-name and read
in the read-name entries is a single "_"
.
If this is not true, you may have to update these parameters,
or even modify the current implementation of this function.
Also note that there is now a UPARSE-specific output file format,
uparseout,
and it might make more sense to create and import that file
for use in phyloseq.
If so, you'll want to import using the
import_uparse()
function.
usearchfile <- system.file("extdata", "usearch.uc", package="phyloseq") import_usearch_uc(usearchfile)
usearchfile <- system.file("extdata", "usearch.uc", package="phyloseq") import_usearch_uc(usearchfile)
A specialized function for creating a network representation of microbiomes,
sample-wise or taxa-wise,
based on a user-defined ecological distance and (potentially arbitrary) threshold.
The graph is ultimately represented using the
igraph
-package.
make_network(physeq, type="samples", distance="jaccard", max.dist = 0.4, keep.isolates=FALSE, ...)
make_network(physeq, type="samples", distance="jaccard", max.dist = 0.4, keep.isolates=FALSE, ...)
physeq |
(Required). Default |
type |
(Optional). Default NOTE: not all distance methods are supported if |
distance |
(Optional). Default Alternatively, if you have already calculated the sample-wise distance
object, the resulting A third alternative is to provide a function that takes a sample-by-taxa matrix (typical vegan orientation) and returns a sample-wise distance matrix. |
max.dist |
(Optional). Default |
keep.isolates |
(Optional). Default |
... |
(Optional). Additional parameters passed on to |
A igraph
-class object.
# # Example plots with Enterotype Dataset data(enterotype) ig <- make_network(enterotype, max.dist=0.3) plot_network(ig, enterotype, color="SeqTech", shape="Enterotype", line_weight=0.3, label=NULL) # ig1 <- make_network(enterotype, max.dist=0.2) plot_network(ig1, enterotype, color="SeqTech", shape="Enterotype", line_weight=0.3, label=NULL) # # # Three methods of choosing/providing distance/distance-method # Provide method name available to distance() function ig <- make_network(enterotype, max.dist=0.3, distance="jaccard") # Provide distance object, already computed jaccdist <- distance(enterotype, "jaccard") ih <- make_network(enterotype, max.dist=0.3, distance=jaccdist) # Provide "custom" function. ii <- make_network(enterotype, max.dist=0.3, distance=function(x){vegan::vegdist(x, "jaccard")}) # The have equal results: all.equal(ig, ih) all.equal(ig, ii) # # Try out making a trivial "network" of the 3-sample esophagus data, # with weighted-UniFrac as distance data(esophagus) ij <- make_network(esophagus, "samples", "unifrac", weighted=TRUE)
# # Example plots with Enterotype Dataset data(enterotype) ig <- make_network(enterotype, max.dist=0.3) plot_network(ig, enterotype, color="SeqTech", shape="Enterotype", line_weight=0.3, label=NULL) # ig1 <- make_network(enterotype, max.dist=0.2) plot_network(ig1, enterotype, color="SeqTech", shape="Enterotype", line_weight=0.3, label=NULL) # # # Three methods of choosing/providing distance/distance-method # Provide method name available to distance() function ig <- make_network(enterotype, max.dist=0.3, distance="jaccard") # Provide distance object, already computed jaccdist <- distance(enterotype, "jaccard") ih <- make_network(enterotype, max.dist=0.3, distance=jaccdist) # Provide "custom" function. ii <- make_network(enterotype, max.dist=0.3, distance=function(x){vegan::vegdist(x, "jaccard")}) # The have equal results: all.equal(ig, ih) all.equal(ig, ii) # # Try out making a trivial "network" of the 3-sample esophagus data, # with weighted-UniFrac as distance data(esophagus) ij <- make_network(esophagus, "samples", "unifrac", weighted=TRUE)
Takes a comma-separated list of phyloseq objects as arguments, and returns the most-comprehensive single phyloseq object possible.
merge_phyloseq(...)
merge_phyloseq(...)
... |
a comma-separated list of phyloseq objects. |
Higher-order objects can be created if arguments are appropriate component data
types of different
classes, and this should mirror the behavior of the phyloseq
method,
which is the suggested method if the goal is simply to create a higher-order
phyloseq object from different data types (1 of each class) describing the same experiment.
By contrast, this method is intended for situations in which one wants to combine
multiple higher-order objects, or multiple core component data objects (e.g. more than one
otu_table
) that should be combined into one object.
Merges are performed by first separating higher-order objects into
a list of their component objects; then, merging any component objects of the same class
into one object according to the behavior desribed in merge_phyloseq_pair
;
and finally, building back up a merged-object according to the constructor
behavior of the phyloseq
method. If the arguments contain only a single
component type – several otu_table objects, for example – then a single merged object
of that component type is returned.
Merges are performed by first separating higher-order objects into
a list of their component objects; then, merging any component objects of the same class
into one object according to the behavior desribed in merge_phyloseq_pair
;
and finally, re-building a merged-object according to the constructor
behavior of the phyloseq
method. If the arguments contain only a single
component type – several otu_table objects, for example – then a single merged object
of the relevant component type is returned.
Merges between 2 or more tree objects are ultimately done using
consensus
from the ape package.
This has the potential to limit somewhat the final data object, because trees
don't merge with other trees in the same granular manner as data tables, and
ultimately the species/taxa in higher-order phyloseq objects will be clipped to
what is contained in the tree. If this an issue, the tree component should
be ommitted from the argument list.
# ## # Make a random complex object ## OTU1 <- otu_table(matrix(sample(0:5,250,TRUE),25,10), taxa_are_rows=TRUE) ## tax1 <- tax_table(matrix("abc", 30, 8)) ## map1 <- data.frame( matrix(sample(0:3,250,TRUE),25,10), ## matrix(sample(c("a","b","c"),150,TRUE), 25, 6) ) ## map1 <- sample_data(map1) ## exam1 <- phyloseq(OTU1, map1, tax1) ## x <- exam1 ## x <- phyloseq(exam1) ## y <- tax_table(exam1) ## merge_phyloseq(x, y) ## merge_phyloseq(y, y, y, y)
# ## # Make a random complex object ## OTU1 <- otu_table(matrix(sample(0:5,250,TRUE),25,10), taxa_are_rows=TRUE) ## tax1 <- tax_table(matrix("abc", 30, 8)) ## map1 <- data.frame( matrix(sample(0:3,250,TRUE),25,10), ## matrix(sample(c("a","b","c"),150,TRUE), 25, 6) ) ## map1 <- sample_data(map1) ## exam1 <- phyloseq(OTU1, map1, tax1) ## x <- exam1 ## x <- phyloseq(exam1) ## y <- tax_table(exam1) ## merge_phyloseq(x, y) ## merge_phyloseq(y, y, y, y)
Internal S4 methods to combine pairs of objects of classes specified in the
phyloseq package. These objects must be component data of the same type
(class). This is mainly an internal method, provided to illustrate how
merging is performed by the more general merge_phyloseq
function.
merge_phyloseq_pair(x, y) ## S4 method for signature 'otu_table,otu_table' merge_phyloseq_pair(x, y) ## S4 method for signature 'taxonomyTable,taxonomyTable' merge_phyloseq_pair(x, y) ## S4 method for signature 'sample_data,sample_data' merge_phyloseq_pair(x, y) ## S4 method for signature 'phylo,phylo' merge_phyloseq_pair(x, y) ## S4 method for signature 'XStringSet,XStringSet' merge_phyloseq_pair(x, y)
merge_phyloseq_pair(x, y) ## S4 method for signature 'otu_table,otu_table' merge_phyloseq_pair(x, y) ## S4 method for signature 'taxonomyTable,taxonomyTable' merge_phyloseq_pair(x, y) ## S4 method for signature 'sample_data,sample_data' merge_phyloseq_pair(x, y) ## S4 method for signature 'phylo,phylo' merge_phyloseq_pair(x, y) ## S4 method for signature 'XStringSet,XStringSet' merge_phyloseq_pair(x, y)
x |
A character vector of the species in object x that you want to
keep – OR alternatively – a logical vector where the kept species are TRUE, and length
is equal to the number of species in object x. If |
y |
Any |
The merge_phyloseq
function is recommended in general.
Special note: non-identical trees are merged using consensus
.
A single component data object that matches x
and y
arguments. The returned object will
contain the union of the species and/or samples of each. If there is redundant
information between a pair of arguments of the same class, the values in x
are
used by default. Abundance values are summed for otu_table
objects
for those elements that describe the same species and sample in x
and y
.
# ## # merge two simulated otu_table objects. ## x <- otu_table(matrix(sample(0:5,200,TRUE),20,10), taxa_are_rows=TRUE) ## y <- otu_table(matrix(sample(0:5,300,TRUE),30,10), taxa_are_rows=FALSE) ## xy <- merge_phyloseq_pair(x, y) ## yx <- merge_phyloseq_pair(y, x) ## # merge two simulated tax_table objects ## x <- tax_table(matrix("abc", 20, 6)) ## y <- tax_table(matrix("def", 30, 8)) ## xy <- merge_phyloseq_pair(x, y) ## # merge two simulated sample_data objects ## x <- data.frame( matrix(sample(0:3,250,TRUE),25,10), ## matrix(sample(c("a","b","c"),150,TRUE),25,6) ) ## x <- sample_data(x) ## y <- data.frame( matrix(sample(4:6,200,TRUE),20,10), ## matrix(sample(c("d","e","f"),120,TRUE),20,8) ) ## y <- sample_data(y) ## merge_phyloseq_pair(x, y) ## data.frame(merge_phyloseq_pair(x, y)) ## data.frame(merge_phyloseq_pair(y, x))
# ## # merge two simulated otu_table objects. ## x <- otu_table(matrix(sample(0:5,200,TRUE),20,10), taxa_are_rows=TRUE) ## y <- otu_table(matrix(sample(0:5,300,TRUE),30,10), taxa_are_rows=FALSE) ## xy <- merge_phyloseq_pair(x, y) ## yx <- merge_phyloseq_pair(y, x) ## # merge two simulated tax_table objects ## x <- tax_table(matrix("abc", 20, 6)) ## y <- tax_table(matrix("def", 30, 8)) ## xy <- merge_phyloseq_pair(x, y) ## # merge two simulated sample_data objects ## x <- data.frame( matrix(sample(0:3,250,TRUE),25,10), ## matrix(sample(c("a","b","c"),150,TRUE),25,6) ) ## x <- sample_data(x) ## y <- data.frame( matrix(sample(4:6,200,TRUE),20,10), ## matrix(sample(c("d","e","f"),120,TRUE),20,8) ) ## y <- sample_data(y) ## merge_phyloseq_pair(x, y) ## data.frame(merge_phyloseq_pair(x, y)) ## data.frame(merge_phyloseq_pair(y, x))
The purpose of this method is to merge/agglomerate the sample indices of a phyloseq object according to a categorical variable contained in a sample_data or a provided factor.
merge_samples(x, group, fun=mean) ## S4 method for signature 'sample_data' merge_samples(x, group, fun = mean) ## S4 method for signature 'otu_table' merge_samples(x, group) ## S4 method for signature 'phyloseq' merge_samples(x, group, fun = mean)
merge_samples(x, group, fun=mean) ## S4 method for signature 'sample_data' merge_samples(x, group, fun = mean) ## S4 method for signature 'otu_table' merge_samples(x, group) ## S4 method for signature 'phyloseq' merge_samples(x, group, fun = mean)
x |
(Required). An instance of a phyloseq class that has sample indices. This includes
|
group |
(Required). Either the a single character string matching a variable name in
the corresponding sample_data of |
fun |
(Optional). The function that will be used to merge the values that
correspond to the same group for each variable. It must take a numeric vector
as first argument and return a single value. Default is |
NOTE: (phylo
) trees and taxonomyTable-class
are not modified by this function, but returned in the output object as-is.
A phyloseq object that has had its sample indices merged according to
the factor indicated by the group
argument. The output class
matches x
.
merge_taxa
, codemerge_phyloseq
# data(GlobalPatterns) GP = GlobalPatterns mergedGP = merge_samples(GlobalPatterns, "SampleType") SD = merge_samples(sample_data(GlobalPatterns), "SampleType") print(SD) print(mergedGP) sample_names(GlobalPatterns) sample_names(mergedGP) identical(SD, sample_data(mergedGP)) # The OTU abundances of merged samples are summed # Let's investigate this ourselves looking at just the top10 most abundance OTUs... OTUnames10 = names(sort(taxa_sums(GP), TRUE)[1:10]) GP10 = prune_taxa(OTUnames10, GP) mGP10 = prune_taxa(OTUnames10, mergedGP) ocean_samples = sample_names(subset(sample_data(GP), SampleType=="Ocean")) print(ocean_samples) otu_table(GP10)[, ocean_samples] rowSums(otu_table(GP10)[, ocean_samples]) otu_table(mGP10)["Ocean", ]
# data(GlobalPatterns) GP = GlobalPatterns mergedGP = merge_samples(GlobalPatterns, "SampleType") SD = merge_samples(sample_data(GlobalPatterns), "SampleType") print(SD) print(mergedGP) sample_names(GlobalPatterns) sample_names(mergedGP) identical(SD, sample_data(mergedGP)) # The OTU abundances of merged samples are summed # Let's investigate this ourselves looking at just the top10 most abundance OTUs... OTUnames10 = names(sort(taxa_sums(GP), TRUE)[1:10]) GP10 = prune_taxa(OTUnames10, GP) mGP10 = prune_taxa(OTUnames10, mergedGP) ocean_samples = sample_names(subset(sample_data(GP), SampleType=="Ocean")) print(ocean_samples) otu_table(GP10)[, ocean_samples] rowSums(otu_table(GP10)[, ocean_samples]) otu_table(mGP10)["Ocean", ]
x
into one species/taxa/OTU.Takes as input an object that describes species/taxa
(e.g. phyloseq-class
, otu_table-class
,
phylo-class
, taxonomyTable-class
),
as well as
a vector of species that should be merged.
It is intended to be able to operate at a low-level such that
related methods, such as tip_glom
and tax_glom
can both reliably call merge_taxa
for their respective purposes.
merge_taxa(x, eqtaxa, archetype=1) ## S4 method for signature 'phyloseq' merge_taxa(x, eqtaxa, archetype = eqtaxa[which.max(taxa_sums(x)[eqtaxa])]) ## S4 method for signature 'sample_data' merge_taxa(x, eqtaxa, archetype = 1L) ## S4 method for signature 'otu_table' merge_taxa(x, eqtaxa, archetype = eqtaxa[which.max(taxa_sums(x)[eqtaxa])]) ## S4 method for signature 'phylo' merge_taxa(x, eqtaxa, archetype = 1L) ## S4 method for signature 'XStringSet' merge_taxa(x, eqtaxa, archetype = 1L) ## S4 method for signature 'taxonomyTable' merge_taxa(x, eqtaxa, archetype = 1L)
merge_taxa(x, eqtaxa, archetype=1) ## S4 method for signature 'phyloseq' merge_taxa(x, eqtaxa, archetype = eqtaxa[which.max(taxa_sums(x)[eqtaxa])]) ## S4 method for signature 'sample_data' merge_taxa(x, eqtaxa, archetype = 1L) ## S4 method for signature 'otu_table' merge_taxa(x, eqtaxa, archetype = eqtaxa[which.max(taxa_sums(x)[eqtaxa])]) ## S4 method for signature 'phylo' merge_taxa(x, eqtaxa, archetype = 1L) ## S4 method for signature 'XStringSet' merge_taxa(x, eqtaxa, archetype = 1L) ## S4 method for signature 'taxonomyTable' merge_taxa(x, eqtaxa, archetype = 1L)
x |
(Required). An object that describes species (taxa). This includes
|
eqtaxa |
(Required). The species names, or indices, that should be merged together.
If |
archetype |
(Optional). A single-length numeric or character.
The index of |
The object, x
, in its original class, but with the specified
species merged into one entry in all relevant components.
tip_glom
, tax_glom
, merge_phyloseq
,
merge_samples
# data(esophagus) tree <- phy_tree(esophagus) otu <- otu_table(esophagus) otutree0 <- phyloseq(otu, tree) # plot_tree(otutree0) otutree1 <- merge_taxa(otutree0, 1:8, 2) # plot_tree(esophagus, ladderize="left")
# data(esophagus) tree <- phy_tree(esophagus) otu <- otu_table(esophagus) otutree0 <- phyloseq(otu, tree) # plot_tree(otutree0) otutree1 <- merge_taxa(otutree0, 1:8, 2) # plot_tree(esophagus, ladderize="left")
Originally, this function was for accessing microbiome datasets from the microbio.me/qiime public repository from within R. As you can see by clicking on the above link, the QIIME-DB sever is down indefinitely. However, this function will remain supported here in case the FTP server goes back up, and also for phyloseq users that have downloaded one or more data packages prior to the server going down.
microbio_me_qiime(zipftp, ext = ".zip", parsef = parse_taxonomy_greengenes, ...)
microbio_me_qiime(zipftp, ext = ".zip", parsef = parse_taxonomy_greengenes, ...)
zipftp |
(Required). A character string that is the full URL
path to a zipped file that follows the file naming conventions used by
microbio.me/qiime.
Alternatively, you can simply provide the study number
as a single |
ext |
(Optional). A |
parsef |
(Optional). The type of taxonomic parsing to use for the
OTU taxonomic classification, in the |
... |
(Optional, for advanced users). Additional arguments passed to
|
A phyloseq-class
object if possible, a component if only a
component could be imported, or NULL
if nothing could be imported
after unzipping the file. Keep in mind there is a specific naming-convention
that is expected based on the current state of the
microbio.me/qiime
servers. Several helpful messages are cat
ted to standard out
to help let you know the ongoing status of the current
download and import process.
See download.file
and url
for details about URL formats –
including local file addresses – that might work here.
# This should return TRUE on your system if you have internet turned on # and a standard R installation. Indicates whether this is likely to # work on your system for a URL or local file, respectively. capabilities("http/ftp"); capabilities("fifo") # A working example with a local example file included in phyloseq zipfile = "study_816_split_library_seqs_and_mapping.zip" zipfile = system.file("extdata", zipfile, package="phyloseq") tarfile = "study_816_split_library_seqs_and_mapping.tar.gz" tarfile = system.file("extdata", tarfile, package="phyloseq") tarps = microbio_me_qiime(tarfile) zipps = microbio_me_qiime(zipfile) identical(tarps, zipps) tarps; zipps plot_heatmap(tarps) # An example that used to work, before the QIIME-DB server was turned off by its host. # # Smokers dataset # smokezip = "ftp://thebeast.colorado.edu/pub/QIIME_DB_Public_Studies/study_524_split_library_seqs_and_mapping.zip" # smokers1 = microbio_me_qiime(smokezip) # # Alternatively, just use the study number # smokers2 = microbio_me_qiime(524) # identical(smokers1, smokers2)
# This should return TRUE on your system if you have internet turned on # and a standard R installation. Indicates whether this is likely to # work on your system for a URL or local file, respectively. capabilities("http/ftp"); capabilities("fifo") # A working example with a local example file included in phyloseq zipfile = "study_816_split_library_seqs_and_mapping.zip" zipfile = system.file("extdata", zipfile, package="phyloseq") tarfile = "study_816_split_library_seqs_and_mapping.tar.gz" tarfile = system.file("extdata", tarfile, package="phyloseq") tarps = microbio_me_qiime(tarfile) zipps = microbio_me_qiime(zipfile) identical(tarps, zipps) tarps; zipps plot_heatmap(tarps) # An example that used to work, before the QIIME-DB server was turned off by its host. # # Smokers dataset # smokezip = "ftp://thebeast.colorado.edu/pub/QIIME_DB_Public_Studies/study_524_split_library_seqs_and_mapping.zip" # smokers1 = microbio_me_qiime(smokezip) # # Alternatively, just use the study number # smokers2 = microbio_me_qiime(524) # identical(smokers1, smokers2)
Please note that it is up to you to perform any necessary
normalizing / standardizing transformations prior to these tests.
See for instance transform_sample_counts
.
mt(physeq, classlabel, minPmaxT = "minP", method = "fdr", ...) ## S4 method for signature 'phyloseq,ANY' mt(physeq, classlabel, minPmaxT = "minP", method = "fdr", ...) ## S4 method for signature 'otu_table,integer' mt(physeq, classlabel, minPmaxT = "minP", method = "fdr", ...) ## S4 method for signature 'otu_table,numeric' mt(physeq, classlabel, minPmaxT = "minP", method = "fdr", ...) ## S4 method for signature 'otu_table,logical' mt(physeq, classlabel, minPmaxT = "minP", method = "fdr", ...) ## S4 method for signature 'otu_table,character' mt(physeq, classlabel, minPmaxT = "minP", method = "fdr", ...) ## S4 method for signature 'otu_table,factor' mt(physeq, classlabel, minPmaxT = "minP", method = "fdr", ...)
mt(physeq, classlabel, minPmaxT = "minP", method = "fdr", ...) ## S4 method for signature 'phyloseq,ANY' mt(physeq, classlabel, minPmaxT = "minP", method = "fdr", ...) ## S4 method for signature 'otu_table,integer' mt(physeq, classlabel, minPmaxT = "minP", method = "fdr", ...) ## S4 method for signature 'otu_table,numeric' mt(physeq, classlabel, minPmaxT = "minP", method = "fdr", ...) ## S4 method for signature 'otu_table,logical' mt(physeq, classlabel, minPmaxT = "minP", method = "fdr", ...) ## S4 method for signature 'otu_table,character' mt(physeq, classlabel, minPmaxT = "minP", method = "fdr", ...) ## S4 method for signature 'otu_table,factor' mt(physeq, classlabel, minPmaxT = "minP", method = "fdr", ...)
physeq |
(Required). |
classlabel |
(Required). A single character index of the sample-variable
in the NOTE: the default test applied to each taxa is a two-sample two-sided
|
minPmaxT |
(Optional). Character string. |
method |
(Optional). Additional multiple-hypthesis correction methods.
A character vector from the set |
... |
(Optional). Additional arguments, forwarded to
|
A dataframe with components specified in the documentation for
mt.maxT
or mt.minP
, respectively.
## # Simple example, testing genera that sig correlate with Enterotypes data(enterotype) # Filter samples that don't have Enterotype x <- subset_samples(enterotype, !is.na(Enterotype)) # (the taxa are at the genera level in this dataset) res = mt(x, "Enterotype", method="fdr", test="f", B=300) head(res, 10) ## # Not surprisingly, Prevotella and Bacteroides top the list. ## # Different test, multiple-adjusted t-test, whether samples are ent-2 or not. ## mt(x, get_variable(x, "Enterotype")==2)
## # Simple example, testing genera that sig correlate with Enterotypes data(enterotype) # Filter samples that don't have Enterotype x <- subset_samples(enterotype, !is.na(Enterotype)) # (the taxa are at the genera level in this dataset) res = mt(x, "Enterotype", method="fdr", test="f", B=300) head(res, 10) ## # Not surprisingly, Prevotella and Bacteroides top the list. ## # Different test, multiple-adjusted t-test, whether samples are ent-2 or not. ## mt(x, get_variable(x, "Enterotype")==2)
Unlike, nodeplotdefault
and nodeplotboot
,
this function does not return a function, but instead is provided
directly to the nodelabf
argument of plot_tree
to
ensure that node labels are not added to the graphic.
Please note that you do not need to create or obtain the arguments to
this function. Instead, you can provide this function directly to
plot_tree
and it will know what to do with it. Namely,
use it to avoid plotting any node labels.
nodeplotblank(p, nodelabdf)
nodeplotblank(p, nodelabdf)
p |
(Required). The |
nodelabdf |
(Required). The |
The same input object, p
, provided as input. Unmodified.
data("esophagus") plot_tree(esophagus) plot_tree(esophagus, nodelabf=nodeplotblank)
data("esophagus") plot_tree(esophagus) plot_tree(esophagus, nodelabf=nodeplotblank)
Is not a labeling function itself, but returns one.
The returned function is specialized for labeling bootstrap values.
Note that the function that
is returned has two completely different arguments from the four listed here:
the plot object already built by earlier steps in
plot_tree
, and the data.frame
that contains the relevant plotting data for the nodes
(especially x, y, label
),
respectively.
See nodeplotdefault
for a simpler example.
The main purpose of this and nodeplotdefault
is to
provide a useful default function generator for arbitrary and
bootstrap node labels, respectively, and also to act as
examples of functions that can successfully interact with
plot_tree
to add node labels to the graphic.
nodeplotboot(highthresh=95L, lowcthresh=50L, size=2L, hjust=-0.2)
nodeplotboot(highthresh=95L, lowcthresh=50L, size=2L, hjust=-0.2)
highthresh |
(Optional). A single integer between 0 and 100. Any bootstrap values above this threshold will be annotated as a black filled circle on the node, rather than the bootstrap percentage value itself. |
lowcthresh |
(Optional). A single integer between 0 and 100,
less than |
size |
(Optional). Numeric. Should be positive. The
size parameter used to control the text size of taxa labels.
Default is |
hjust |
(Optional). The horizontal justification of the
node labels. Default is |
A function that can add a bootstrap-values layer to the tree graphic. The values are represented in two ways; either as black filled circles indicating very high-confidence nodes, or the bootstrap value itself printed in small text next to the node on the tree.
nodeplotboot() nodeplotboot(3, -0.4)
nodeplotboot() nodeplotboot(3, -0.4)
Is not a labeling function itself, but returns one.
The returned function is capable of adding
whatever label is on a node. Note that the function that
is returned has two completely different arguments to those listed here:
the plot object already built by earlier steps in
plot_tree
, and the data.frame
that contains the relevant plotting data for the nodes
(especially x, y, label
),
respectively.
See nodeplotboot
for a more sophisticated example.
The main purpose of this and nodeplotboot
is to
provide a useful default function generator for arbitrary and
bootstrap node labels, respectively, and also to act as
examples of functions that will successfully interact with
plot_tree
to add node labels to the graphic.
nodeplotdefault(size=2L, hjust=-0.2)
nodeplotdefault(size=2L, hjust=-0.2)
size |
(Optional). Numeric. Should be positive. The
size parameter used to control the text size of taxa labels.
Default is |
hjust |
(Optional). The horizontal justification of the
node labels. Default is |
A function that can add a node-label layer to a graphic.
nodeplotdefault() nodeplotdefault(3, -0.4)
nodeplotdefault() nodeplotdefault(3, -0.4)
Get the number of samples.
nsamples(physeq) ## S4 method for signature 'ANY' nsamples(physeq) ## S4 method for signature 'phyloseq' nsamples(physeq) ## S4 method for signature 'otu_table' nsamples(physeq) ## S4 method for signature 'sample_data' nsamples(physeq)
nsamples(physeq) ## S4 method for signature 'ANY' nsamples(physeq) ## S4 method for signature 'phyloseq' nsamples(physeq) ## S4 method for signature 'otu_table' nsamples(physeq) ## S4 method for signature 'sample_data' nsamples(physeq)
physeq |
A |
An integer indicating the total number of samples.
taxa_names
, sample_names
,
ntaxa
# data("esophagus") tree <- phy_tree(esophagus) OTU1 <- otu_table(esophagus) nsamples(OTU1) physeq1 <- phyloseq(OTU1, tree) nsamples(physeq1)
# data("esophagus") tree <- phy_tree(esophagus) OTU1 <- otu_table(esophagus) nsamples(OTU1) physeq1 <- phyloseq(OTU1, tree) nsamples(physeq1)
Get the number of taxa/species.
ntaxa(physeq) ## S4 method for signature 'ANY' ntaxa(physeq) ## S4 method for signature 'phyloseq' ntaxa(physeq) ## S4 method for signature 'otu_table' ntaxa(physeq) ## S4 method for signature 'taxonomyTable' ntaxa(physeq) ## S4 method for signature 'phylo' ntaxa(physeq) ## S4 method for signature 'XStringSet' ntaxa(physeq)
ntaxa(physeq) ## S4 method for signature 'ANY' ntaxa(physeq) ## S4 method for signature 'phyloseq' ntaxa(physeq) ## S4 method for signature 'otu_table' ntaxa(physeq) ## S4 method for signature 'taxonomyTable' ntaxa(physeq) ## S4 method for signature 'phylo' ntaxa(physeq) ## S4 method for signature 'XStringSet' ntaxa(physeq)
physeq |
|
An integer indicating the number of taxa / species.
taxa_names
data("esophagus") ntaxa(esophagus) phy_tree(esophagus) ntaxa(phy_tree(esophagus))
data("esophagus") ntaxa(esophagus) phy_tree(esophagus) ntaxa(phy_tree(esophagus))
This function wraps several commonly-used ordination methods. The type of
ordination depends upon the argument to method
. Try
ordinate("help")
or ordinate("list")
for the currently
supported method options.
ordinate(physeq, method = "DCA", distance = "bray", formula = NULL, ...)
ordinate(physeq, method = "DCA", distance = "bray", formula = NULL, ...)
physeq |
(Required). Phylogenetic sequencing data
( |
method |
(Optional). A character string. Default is Currently supported method options are:
|
distance |
(Optional). A character string. Default is Any supported |
formula |
(Optional). A model |
... |
(Optional). Additional arguments to supporting functions. For
example, the additional argument |
An ordination object. The specific class of the returned object depends upon the
ordination method, as well as the function/package that is called internally
to perform it.
As a general rule, any of the ordination classes
returned by this function will be recognized by downstream tools in the
phyloseq
package, for example the ordination plotting
function, plot_ordination
.
Related component ordination functions described within phyloseq:
Described/provided by other packages:
cca
/rda
, decorana
, metaMDS
,
pcoa
, capscale
NMDS and MDS/PCoA both operate on distance matrices, typically based on some
pairwise comparison of the microbiomes in an experiment/project. There are
a number of common methods to use to calculate these pairwise distances, and
the most convenient function (from a phyloseq
point of view) for calculating
these distance matrices is the
function. It can be
thought of as a distance / dissimilarity-index companion function for
ordinate
, and indeed the distance options provided to ordinate
are often simply passed on to distance
.
A good quick summary of ordination is provided in the introductory vignette for vegan:
The following R
task views are also useful for understanding the
available tools in R
:
Analysis of Ecological and Environmental Data
# See http://joey711.github.io/phyloseq/plot_ordination-examples # for many more examples. # plot_ordination(GP, ordinate(GP, "DCA"), "samples", color="SampleType")
# See http://joey711.github.io/phyloseq/plot_ordination-examples # for many more examples. # plot_ordination(GP, ordinate(GP, "DCA"), "samples", color="SampleType")
This is the suggested method for both constructing and accessing
Operational Taxonomic Unit (OTU) abundance (otu_table-class
) objects.
When the first
argument is a matrix, otu_table() will attempt to create and return an
otu_table-class object,
which further depends on whether or not taxa_are_rows
is provided as an
additional argument.
Alternatively, if the first argument is an experiment-level (phyloseq-class
)
object, then the corresponding otu_table
is returned.
otu_table(object, taxa_are_rows, errorIfNULL=TRUE) ## S4 method for signature 'phyloseq' otu_table(object, errorIfNULL = TRUE) ## S4 method for signature 'otu_table' otu_table(object, errorIfNULL = TRUE) ## S4 method for signature 'matrix' otu_table(object, taxa_are_rows) ## S4 method for signature 'data.frame' otu_table(object, taxa_are_rows) ## S4 method for signature 'ANY' otu_table(object, errorIfNULL = TRUE)
otu_table(object, taxa_are_rows, errorIfNULL=TRUE) ## S4 method for signature 'phyloseq' otu_table(object, errorIfNULL = TRUE) ## S4 method for signature 'otu_table' otu_table(object, errorIfNULL = TRUE) ## S4 method for signature 'matrix' otu_table(object, taxa_are_rows) ## S4 method for signature 'data.frame' otu_table(object, taxa_are_rows) ## S4 method for signature 'ANY' otu_table(object, errorIfNULL = TRUE)
object |
(Required). An integer matrix, |
taxa_are_rows |
(Conditionally optional). Logical; of length 1. Ignored
unless |
errorIfNULL |
(Optional). Logical. Should the accessor stop with
an error if the slot is empty ( |
An otu_table-class
object.
phy_tree
, sample_data
, tax_table
phyloseq
, merge_phyloseq
# # data(GlobalPatterns) # otu_table(GlobalPatterns)
# # data(GlobalPatterns) # otu_table(GlobalPatterns)
Because orientation of these tables can vary by method, the orientation is
defined explicitly in the taxa_are_rows
slot (a logical).
The otu_table
class inherits the matrix
class to store
abundance values.
Various standard subset and assignment nomenclature has been extended to apply
to the otu_table
class, including square-bracket, t
, etc.
A single logical specifying the orientation of the abundance table.
This slot is inherited from the matrix
class.
x
Assign a new OTU Table to x
otu_table(x) <- value ## S4 replacement method for signature 'phyloseq,otu_table' otu_table(x) <- value ## S4 replacement method for signature 'otu_table,otu_table' otu_table(x) <- value ## S4 replacement method for signature 'phyloseq,phyloseq' otu_table(x) <- value
otu_table(x) <- value ## S4 replacement method for signature 'phyloseq,otu_table' otu_table(x) <- value ## S4 replacement method for signature 'otu_table,otu_table' otu_table(x) <- value ## S4 replacement method for signature 'phyloseq,phyloseq' otu_table(x) <- value
x |
(Required). |
value |
(Required).
|
# data(GlobalPatterns) # # An example of pruning to just the first 100 taxa in GlobalPatterns. # ex2a <- prune_taxa(taxa_names(GlobalPatterns)[1:100], GlobalPatterns) # # The following 3 lines produces an ex2b that is equal to ex2a # ex2b <- GlobalPatterns # OTU <- otu_table(GlobalPatterns)[1:100, ] # otu_table(ex2b) <- OTU # identical(ex2a, ex2b) # print(ex2b) # # Relace otu_table by implying the component in context. # ex2c <- GlobalPatterns # otu_table(ex2c) <- ex2b # identical(ex2a, ex2c)
# data(GlobalPatterns) # # An example of pruning to just the first 100 taxa in GlobalPatterns. # ex2a <- prune_taxa(taxa_names(GlobalPatterns)[1:100], GlobalPatterns) # # The following 3 lines produces an ex2b that is equal to ex2a # ex2b <- GlobalPatterns # OTU <- otu_table(GlobalPatterns)[1:100, ] # otu_table(ex2b) <- OTU # identical(ex2a, ex2b) # print(ex2b) # # Relace otu_table by implying the component in context. # ex2c <- GlobalPatterns # otu_table(ex2c) <- ex2b # identical(ex2a, ex2c)
These are provided as both example and default functions for
parsing a character vector of taxonomic rank information for a single taxa.
As default functions, these are intended for cases where the data adheres to
the naming convention used by greengenes
(http://greengenes.lbl.gov/cgi-bin/nph-index.cgi)
or where the convention is unknown, respectively.
To work, these functions – and any similar custom function you may want to
create and use – must take as input a single character vector of taxonomic
ranks for a single OTU, and return a named character vector that has
been modified appropriately (according to known naming conventions,
desired length limits, etc.
The length (number of elements) of the output named vector does not
need to be equal to the input, which is useful for the cases where the
source data files have extra meaningless elements that should probably be
removed, like the ubiquitous
“Root” element often found in greengenes/QIIME taxonomy labels.
In the case of parse_taxonomy_default
, no naming convention is assumed and
so dummy rank names are added to the vector.
More usefully if your taxonomy data is based on greengenes, the
parse_taxonomy_greengenes
function clips the first 3 characters that
identify the rank, and uses these to name the corresponding element according
to the appropriate taxonomic rank name used by greengenes
(e.g. "p__"
at the beginning of an element means that element is
the name of the phylum to which this OTU belongs).
Most importantly, the expectations for these functions described above
make them compatible to use during data import,
specifcally the import_biom
function, but
it is a flexible structure that will be implemented soon for all phyloseq
import functions that deal with taxonomy (e.g. import_qiime
).
parse_taxonomy_default(char.vec) parse_taxonomy_greengenes(char.vec) parse_taxonomy_qiime(char.vec)
parse_taxonomy_default(char.vec) parse_taxonomy_greengenes(char.vec) parse_taxonomy_qiime(char.vec)
char.vec |
(Required). A single character vector of taxonomic ranks for a single OTU, unprocessed (ugly). |
A character vector in which each element is a different
taxonomic rank of the same OTU, and each element name is the name of
the rank level. For example, an element might be "Firmicutes"
and named "phylum"
.
These parsed, named versions of the taxonomic vector should
reflect embedded information, naming conventions,
desired length limits, etc; or in the case of parse_taxonomy_default
,
not modified at all and given dummy rank names to each element.
taxvec1 = c("Root", "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Staphylococcaceae") parse_taxonomy_default(taxvec1) parse_taxonomy_greengenes(taxvec1) taxvec2 = c("Root;k__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Staphylococcaceae") parse_taxonomy_qiime(taxvec2)
taxvec1 = c("Root", "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Staphylococcaceae") parse_taxonomy_default(taxvec1) parse_taxonomy_greengenes(taxvec1) taxvec2 = c("Root;k__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Staphylococcaceae") parse_taxonomy_qiime(taxvec2)
phylo
-class) from object.This is the suggested method
for accessing
the phylogenetic tree, (phylo
-class) from a phyloseq-class
.
Like other accessors (see See Also, below), the default behavior of this method
is to stop with an
error if physeq
is a phyloseq-class
but does not
contain a phylogenetic tree (the component data you are trying to access in this case).
phy_tree(physeq, errorIfNULL=TRUE) ## S4 method for signature 'ANY' phy_tree(physeq, errorIfNULL = TRUE) ## S4 method for signature 'phylo' phy_tree(physeq)
phy_tree(physeq, errorIfNULL=TRUE) ## S4 method for signature 'ANY' phy_tree(physeq, errorIfNULL = TRUE) ## S4 method for signature 'phylo' phy_tree(physeq)
physeq |
(Required). An instance of phyloseq-class that contains a phylogenetic tree. If physeq is a phylogenetic tree (a component data class), then it is returned as-is. |
errorIfNULL |
(Optional). Logical. Should the accessor stop with
an error if the slot is empty ( |
Note that the tip labels should be named to match the
taxa_names
of the other objects to which it is going to be paired.
The phyloseq
constructor automatically checks for
exact agreement in the set of species described by the phlyogenetic tree
and the other components (taxonomyTable, otu_table),
and trims as-needed. Thus, the tip.labels in a phylo object
must be named to match the results of
taxa_names
of the other objects to which it will ultimately be paired.
The phylo
-class object contained within physeq
;
or NULL if physeq
does not have a tree.
This method stops with an error in the latter NULL case be default, which
can be over-ridden by changing the value of errorIfNULL
to FALSE
.
otu_table
, sample_data
, tax_table
refseq
,
phyloseq
, merge_phyloseq
data(GlobalPatterns) phy_tree(GlobalPatterns)
data(GlobalPatterns) phy_tree(GlobalPatterns)
x
Assign a (new) phylogenetic tree to x
phy_tree(x) <- value ## S4 replacement method for signature 'phyloseq,phylo' phy_tree(x) <- value ## S4 replacement method for signature 'phyloseq,phyloseq' phy_tree(x) <- value
phy_tree(x) <- value ## S4 replacement method for signature 'phyloseq,phylo' phy_tree(x) <- value ## S4 replacement method for signature 'phyloseq,phyloseq' phy_tree(x) <- value
x |
(Required). |
value |
(Required). |
# data("esophagus") # An example of pruning to just the first 20 taxa in esophagus ex2a <- prune_taxa(taxa_names(esophagus)[1:20], esophagus) # The following 3 lines produces an ex2b that is equal to ex2a ex2b <- ex2a phy_tree(ex2b) <- phy_tree(esophagus) identical(ex2a, ex2b)
# data("esophagus") # An example of pruning to just the first 20 taxa in esophagus ex2a <- prune_taxa(taxa_names(esophagus)[1:20], esophagus) # The following 3 lines produces an ex2b that is equal to ex2a ex2b <- ex2a phy_tree(ex2b) <- phy_tree(esophagus) identical(ex2a, ex2b)
See the ape
package for details about this type of
representation of a phylogenetic tree.
It is used throughout the ape package.
phyloseq()
is a constructor method, This is the main method
suggested for constructing an experiment-level (phyloseq-class
)
object from its component data
(component data classes: otu_table-class
, sample_data-class
,
taxonomyTable-class
, phylo-class
).
phyloseq(...)
phyloseq(...)
... |
One or more component objects among the set of classes
defined by the phyloseq package, as well as |
The class of the returned object depends on the argument class(es). For an experiment-level object, two or more component data objects must be provided. Otherwise, if a single component-class is provided, it is simply returned as-is. The order of arguments does not matter.
data(esophagus) x1 = phyloseq(otu_table(esophagus), phy_tree(esophagus)) identical(x1, esophagus) # # data(GlobalPatterns) # # GP <- GlobalPatterns # # phyloseq(sample_data(GP), otu_table(GP)) # # phyloseq(otu_table(GP), phy_tree(GP)) # # phyloseq(tax_table(GP), otu_table(GP)) # # phyloseq(phy_tree(GP), otu_table(GP), sample_data(GP)) # # phyloseq(otu_table(GP), tax_table(GP), sample_data(GP)) # # phyloseq(otu_table(GP), phy_tree(GP), tax_table(GP), sample_data(GP))
data(esophagus) x1 = phyloseq(otu_table(esophagus), phy_tree(esophagus)) identical(x1, esophagus) # # data(GlobalPatterns) # # GP <- GlobalPatterns # # phyloseq(sample_data(GP), otu_table(GP)) # # phyloseq(otu_table(GP), phy_tree(GP)) # # phyloseq(tax_table(GP), otu_table(GP)) # # phyloseq(phy_tree(GP), otu_table(GP), sample_data(GP)) # # phyloseq(otu_table(GP), tax_table(GP), sample_data(GP)) # # phyloseq(otu_table(GP), phy_tree(GP), tax_table(GP), sample_data(GP))
No testing is performed by this function. The phyloseq data is converted
to the relevant DESeqDataSet
object, which can then be
tested in the negative binomial generalized linear model framework
of the DESeq
function in DESeq2 package.
See the
phyloseq-extensions
tutorials for more details.
phyloseq_to_deseq2(physeq, design, ...)
phyloseq_to_deseq2(physeq, design, ...)
physeq |
(Required). |
design |
(Required). A
|
... |
(Optional). Additional named arguments passed to |
A DESeqDataSet
object.
vignette("phyloseq-mixture-models")
The phyloseq-extensions tutorials.
# Check out the vignette phyloseq-mixture-models for more details. # vignette("phyloseq-mixture-models") data(soilrep) phyloseq_to_deseq2(soilrep, ~warmed)
# Check out the vignette phyloseq-mixture-models for more details. # vignette("phyloseq-mixture-models") data(soilrep) phyloseq_to_deseq2(soilrep, ~warmed)
No testing is performed by this function. The phyloseq data is converted
to the relevant MRexperiment-class
object, which can then be
tested in the zero-inflated mixture model framework
(e.g. fitZig
)
in the metagenomeSeq package.
See the
phyloseq-extensions
tutorials for more details.
phyloseq_to_metagenomeSeq(physeq, ...)
phyloseq_to_metagenomeSeq(physeq, ...)
physeq |
(Required). |
... |
(Optional). Additional named arguments passed
to |
A MRexperiment-class
object.
fitTimeSeries
fitLogNormal
fitZig
MRtable
MRfulltable
# Check out the vignette metagenomeSeq for more details. # vignette("metagenomeSeq") data(soilrep) phyloseq_to_metagenomeSeq(soilrep)
# Check out the vignette metagenomeSeq for more details. # vignette("metagenomeSeq") data(soilrep) phyloseq_to_metagenomeSeq(soilrep)
Contains all currently-supported component data classes:
otu_table-class
,
sample_data-class
,
taxonomyTable-class
("tax_table"
slot),
phylo
-class ("phy_tree"
slot),
and the XStringSet-class
("refseq"
slot).
There are several advantages
to storing your phylogenetic sequencing experiment as an instance of the
phyloseq class, not the least of which is that it is easy to return to the
data later and feel confident that the different data types “belong” to
one another. Furthermore, the phyloseq
constructor ensures that
the different data components have compatible indices (e.g. OTUs and samples),
and performs the necessary trimming automatically when you create your
“experiment-level” object. Downstream analyses are aware of which data
classes they require – and where to find them – often making your
phyloseq-class
object the only data argument required for analysis and plotting
functions (although there are many options and parameter arguments available
to you).
In the case of missing component data, the slots are set to NULL
. As
soon as a phyloseq-class
object is to be updated with new component
data (previously missing/NULL
or not), the indices of all components
are re-checked for compatibility and trimmed if necessary. This is to ensure
by design that components describe the same taxa/samples, and also that these
trimming/validity checks do not need to be repeated in downstream analyses.
slots:
a single object of class otu_table.
a single object of class sample_data.
a single object of class taxonomyTable.
a single object of the phylo
-class, from the ape package.
a biological sequence set object of a class that
inherits from the XStringSet-class
, from the Biostrings package.
The constructor, phyloseq
,
the merger merge_phyloseq
, and also the component
constructor/accessors otu_table
, sample_data
,
tax_table
, phy_tree
, and refseq
.
These will be migrated to "defunct"
status in the next release,
and removed completely in the release after that.
These functions are provided for compatibility with older version of
the phyloseq package. They may eventually be completely
removed.
deprecated_phyloseq_function(x, value, ...)
deprecated_phyloseq_function(x, value, ...)
x |
For assignment operators, the object that will undergo a replacement (object inside parenthesis). |
value |
For assignment operators, the value to replace with (the right side of the assignment). |
... |
For functions other than assignment operators, parameters to be passed to the modern version of the function (see table). |
plot_taxa_bar |
now a synonym for plot_bar
|
taxaplot |
now a synonym for plot_bar
|
taxtab |
now a synonym for tax_table
|
taxTab |
now a synonym for tax_table
|
sampleData |
now a synonym for sample_data
|
samData |
now a synonym for sample_data
|
sam_data |
now a synonym for sample_data
|
speciesSums |
now a synonym for taxa_sums
|
sampleSums |
now a synonym for sample_sums
|
nspecies |
now a synonym for ntaxa
|
species.names |
now a synonym for taxa_names
|
sampleNames |
now a synonym for sample_names
|
sample.names |
now a synonym for sample_names
|
getSamples |
now a synonym for get_sample
|
getSpecies |
now a synonym for get_taxa
|
rank.names |
now a synonym for rank_names
|
getTaxa |
now a synonym for get_taxa_unique
|
sample.variables |
now a synonym for sample_variables
|
getVariable |
now a synonym for get_variable
|
merge_species |
now a synonym for merge_taxa
|
otuTable |
now a synonym for otu_table
|
speciesarerows |
now a synonym for taxa_are_rows
|
speciesAreRows |
now a synonym for taxa_are_rows
|
plot_richness_estimates |
now a synonym for plot_richness
|
import_qiime_sampleData |
now a synonym for import_qiime_sample_data
|
filterfunSample |
now a synonym for filterfun_sample
|
genefilterSample |
now a synonym for genefilter_sample
|
prune_species |
now a synonym for prune_taxa
|
subset_species |
now a synonym for subset_taxa
|
tipglom |
now a synonym for tip_glom
|
taxglom |
now a synonym for tax_glom
|
tre |
now a synonym for phy_tree
|
show_mothur_list_cutoffs |
now a synonym for show_mothur_cutoffs
|
sam_data<- |
now a synonym for sample_data<-
|
sampleData<- |
now a synonym for sample_data<-
|
tre<- |
now a synonym for phy_tree<-
|
speciesAreRows<- |
now a synonym for taxa_are_rows<-
|
otuTable<- |
now a synonym for otu_table<-
|
taxTab<- |
now a synonym for tax_table<-
|
There are many useful examples of phyloseq barplot graphics in the
phyloseq online tutorials.
This function wraps ggplot2
plotting, and returns a ggplot2
graphic object
that can be saved or further modified with additional layers, options, etc.
The main purpose of this function is to quickly and easily create informative
summary graphics of the differences in taxa abundance between samples in
an experiment.
plot_bar(physeq, x="Sample", y="Abundance", fill=NULL, title=NULL, facet_grid=NULL)
plot_bar(physeq, x="Sample", y="Abundance", fill=NULL, title=NULL, facet_grid=NULL)
physeq |
(Required). An |
x |
(Optional). Optional, but recommended, especially if your data
is comprised of many samples. A character string.
The variable in the melted-data that should be mapped to the x-axis.
See |
y |
(Optional). A character string.
The variable in the melted-data that should be mapped to the y-axis.
Typically this will be |
fill |
(Optional). A character string. Indicates which sample variable
should be used to map to the fill color of the bars.
The default is |
title |
(Optional). Default |
facet_grid |
(Optional). A formula object.
It should describe the faceting you want in exactly the same way as for
|
A ggplot
2 graphic object – rendered in the graphical device
as the default print
/show
method.
data("GlobalPatterns") gp.ch = subset_taxa(GlobalPatterns, Phylum == "Chlamydiae") plot_bar(gp.ch) plot_bar(gp.ch, fill="Genus") plot_bar(gp.ch, x="SampleType", fill="Genus") plot_bar(gp.ch, "SampleType", fill="Genus", facet_grid=~Family) # See additional examples in the plot_bar online tutorial. Link above.
data("GlobalPatterns") gp.ch = subset_taxa(GlobalPatterns, Phylum == "Chlamydiae") plot_bar(gp.ch) plot_bar(gp.ch, fill="Genus") plot_bar(gp.ch, x="SampleType", fill="Genus") plot_bar(gp.ch, "SampleType", fill="Genus", facet_grid=~Family) # See additional examples in the plot_bar online tutorial. Link above.
Create a ggplot summary of gap statistic results
plot_clusgap(clusgap, title = "Gap Statistic results")
plot_clusgap(clusgap, title = "Gap Statistic results")
clusgap |
(Required).
An object of S3 class |
title |
(Optional). Character string.
The main title for the graphic.
Default is |
A ggplot
plot object.
The rendered graphic should be a plot of the gap statistic score
versus values for k
, the number of clusters.
# Load and process data data("soilrep") soilr = rarefy_even_depth(soilrep, rngseed=888) print(soilr) sample_variables(soilr) # Ordination sord = ordinate(soilr, "DCA") # Gap Statistic gs = gapstat_ord(sord, axes=1:4, verbose=FALSE) # Evaluate results with plots, etc. plot_scree(sord) plot_ordination(soilr, sord, color="Treatment") plot_clusgap(gs) print(gs, method="Tibs2001SEmax") # Non-ordination example, use cluster::clusGap function directly library("cluster") pam1 = function(x, k){list(cluster = pam(x, k, cluster.only=TRUE))} gs.pam.RU = clusGap(ruspini, FUN = pam1, K.max = 8, B = 60) gs.pam.RU plot(gs.pam.RU, main = "Gap statistic for the 'ruspini' data") mtext("k = 4 is best .. and k = 5 pretty close") plot_clusgap(gs.pam.RU)
# Load and process data data("soilrep") soilr = rarefy_even_depth(soilrep, rngseed=888) print(soilr) sample_variables(soilr) # Ordination sord = ordinate(soilr, "DCA") # Gap Statistic gs = gapstat_ord(sord, axes=1:4, verbose=FALSE) # Evaluate results with plots, etc. plot_scree(sord) plot_ordination(soilr, sord, color="Treatment") plot_clusgap(gs) print(gs, method="Tibs2001SEmax") # Non-ordination example, use cluster::clusGap function directly library("cluster") pam1 = function(x, k){list(cluster = pam(x, k, cluster.only=TRUE))} gs.pam.RU = clusGap(ruspini, FUN = pam1, K.max = 8, B = 60) gs.pam.RU plot(gs.pam.RU, main = "Gap statistic for the 'ruspini' data") mtext("k = 4 is best .. and k = 5 pretty close") plot_clusgap(gs.pam.RU)
There are many useful examples of phyloseq heatmap graphics in the
phyloseq online tutorials.
In a 2010 article in BMC Genomics, Rajaram and Oono show describe an
approach to creating a heatmap using ordination methods to organize the
rows and columns instead of (hierarchical) cluster analysis. In many cases
the ordination-based ordering does a much better job than h-clustering.
An immediately useful example of their approach is provided in the NeatMap
package for R. The NeatMap package can be used directly on the abundance
table (otu_table-class
) of phylogenetic-sequencing data, but
the NMDS or PCA ordination options that it supports are not based on ecological
distances. To fill this void, phyloseq provides the plot_heatmap()
function as an ecology-oriented variant of the NeatMap approach to organizing
a heatmap and build it using ggplot2 graphics tools.
The distance
and method
arguments are the same as for the
plot_ordination
function, and support large number of
distances and ordination methods, respectively, with a strong leaning toward
ecology.
This function also provides the options to re-label the OTU and sample
axis-ticks with a taxonomic name and/or sample variable, respectively,
in the hope that this might hasten your interpretation of the patterns
(See the sample.label
and taxa.label
documentation, below).
Note that this function makes no attempt to overlay hierarchical
clustering trees on the axes, as hierarchical clustering is not used to
organize the plot. Also note that each re-ordered axis repeats at the edge,
and so apparent clusters at the far right/left or top/bottom of the
heat-map may actually be the same. For now, the placement of this edge
can be considered arbitrary, so beware of this artifact of this graphical
representation. If you benefit from this phyloseq-specific implementation
of the NeatMap approach, please cite both our packages/articles.
plot_heatmap(physeq, method = "NMDS", distance = "bray", sample.label = NULL, taxa.label = NULL, low = "#000033", high = "#66CCFF", na.value = "black", trans = log_trans(4), max.label = 250, title = NULL, sample.order = NULL, taxa.order = NULL, first.sample = NULL, first.taxa = NULL, ...)
plot_heatmap(physeq, method = "NMDS", distance = "bray", sample.label = NULL, taxa.label = NULL, low = "#000033", high = "#66CCFF", na.value = "black", trans = log_trans(4), max.label = 250, title = NULL, sample.order = NULL, taxa.order = NULL, first.sample = NULL, first.taxa = NULL, ...)
physeq |
(Required). The data, in the form of an instance of the
|
method |
(Optional). The ordination method to use for organizing the heatmap. A great deal of the usefulness of a heatmap graphic depends upon the way in which the rows and columns are ordered. |
distance |
(Optional). A character string.
The ecological distance method to use in the ordination.
See |
sample.label |
(Optional). A character string. The sample variable by which you want to re-label the sample (horizontal) axis. |
taxa.label |
(Optional). A character string.
The name of the taxonomic rank by which you want to
re-label the taxa/species/OTU (vertical) axis.
You can see available options in your data using
|
low |
(Optional). A character string. An R color.
See |
high |
(Optional). A character string. An R color.
See |
na.value |
(Optional). A character string. An R color.
See |
trans |
(Optional). |
max.label |
(Optional). Integer. Default is 250. The maximum number of labeles to fit on a given axis (either x or y). If number of taxa or samples exceeds this value, the corresponding axis will be stripped of any labels. This supercedes any arguments provided to
|
title |
(Optional). Default |
sample.order |
(Optional). Default |
taxa.order |
(Optional). Default |
first.sample |
(Optional). Default |
first.taxa |
(Optional). Default |
... |
(Optional). Additional parameters passed to |
This approach borrows heavily from the heatmap1
function in the
NeatMap
package. Highly recommended, and we are grateful for their
package and ideas, which we have adapted for our specific purposes here,
but did not use an explicit dependency. At the time of the first version
of this implementation, the NeatMap package depends on the rgl-package,
which is not needed in phyloseq, at present. Although likely a transient
issue, the rgl-package has some known installation issues that have further
influenced to avoid making NeatMap a formal dependency (Although we love
both NeatMap and rgl!).
A heatmap plot, in the form of a ggplot2
plot object,
which can be further saved and modified.
Because this function relies so heavily in principle, and in code, on some of the functionality in NeatMap, please site their article if you use this function in your work.
Rajaram, S., & Oono, Y. (2010). NeatMap–non-clustering heat map alternatives in R. BMC Bioinformatics, 11, 45.
Please see further examples in the phyloseq online tutorials.
data("GlobalPatterns") gpac <- subset_taxa(GlobalPatterns, Phylum=="Crenarchaeota") # FYI, the base-R function uses a non-ecological ordering scheme, # but does add potentially useful hclust dendrogram to the sides... gpac <- subset_taxa(GlobalPatterns, Phylum=="Crenarchaeota") # Remove the nearly-empty samples (e.g. 10 reads or less) gpac = prune_samples(sample_sums(gpac) > 50, gpac) # Arbitrary order if method set to NULL plot_heatmap(gpac, method=NULL, sample.label="SampleType", taxa.label="Family") # Use ordination plot_heatmap(gpac, sample.label="SampleType", taxa.label="Family") # Use ordination for OTUs, but not sample-order plot_heatmap(gpac, sample.label="SampleType", taxa.label="Family", sample.order="SampleType") # Specifying both orders omits any attempt to use ordination. The following should be the same. p0 = plot_heatmap(gpac, sample.label="SampleType", taxa.label="Family", taxa.order="Phylum", sample.order="SampleType") p1 = plot_heatmap(gpac, method=NULL, sample.label="SampleType", taxa.label="Family", taxa.order="Phylum", sample.order="SampleType") #expect_equivalent(p0, p1) # Example: Order matters. Random ordering of OTU indices is difficult to interpret, even with structured sample order rando = sample(taxa_names(gpac), size=ntaxa(gpac), replace=FALSE) plot_heatmap(gpac, method=NULL, sample.label="SampleType", taxa.label="Family", taxa.order=rando, sample.order="SampleType") # # Select the edges of each axis. # First, arbitrary edge, ordering plot_heatmap(gpac, method=NULL) # Second, biological-ordering (instead of default ordination-ordering), but arbitrary edge plot_heatmap(gpac, taxa.order="Family", sample.order="SampleType") # Third, biological ordering, selected edges plot_heatmap(gpac, taxa.order="Family", sample.order="SampleType", first.taxa="546313", first.sample="NP2") # Fourth, add meaningful labels plot_heatmap(gpac, sample.label="SampleType", taxa.label="Family", taxa.order="Family", sample.order="SampleType", first.taxa="546313", first.sample="NP2")
data("GlobalPatterns") gpac <- subset_taxa(GlobalPatterns, Phylum=="Crenarchaeota") # FYI, the base-R function uses a non-ecological ordering scheme, # but does add potentially useful hclust dendrogram to the sides... gpac <- subset_taxa(GlobalPatterns, Phylum=="Crenarchaeota") # Remove the nearly-empty samples (e.g. 10 reads or less) gpac = prune_samples(sample_sums(gpac) > 50, gpac) # Arbitrary order if method set to NULL plot_heatmap(gpac, method=NULL, sample.label="SampleType", taxa.label="Family") # Use ordination plot_heatmap(gpac, sample.label="SampleType", taxa.label="Family") # Use ordination for OTUs, but not sample-order plot_heatmap(gpac, sample.label="SampleType", taxa.label="Family", sample.order="SampleType") # Specifying both orders omits any attempt to use ordination. The following should be the same. p0 = plot_heatmap(gpac, sample.label="SampleType", taxa.label="Family", taxa.order="Phylum", sample.order="SampleType") p1 = plot_heatmap(gpac, method=NULL, sample.label="SampleType", taxa.label="Family", taxa.order="Phylum", sample.order="SampleType") #expect_equivalent(p0, p1) # Example: Order matters. Random ordering of OTU indices is difficult to interpret, even with structured sample order rando = sample(taxa_names(gpac), size=ntaxa(gpac), replace=FALSE) plot_heatmap(gpac, method=NULL, sample.label="SampleType", taxa.label="Family", taxa.order=rando, sample.order="SampleType") # # Select the edges of each axis. # First, arbitrary edge, ordering plot_heatmap(gpac, method=NULL) # Second, biological-ordering (instead of default ordination-ordering), but arbitrary edge plot_heatmap(gpac, taxa.order="Family", sample.order="SampleType") # Third, biological ordering, selected edges plot_heatmap(gpac, taxa.order="Family", sample.order="SampleType", first.taxa="546313", first.sample="NP2") # Fourth, add meaningful labels plot_heatmap(gpac, sample.label="SampleType", taxa.label="Family", taxa.order="Family", sample.order="SampleType", first.taxa="546313", first.sample="NP2")
There are many useful examples of phyloseq network graphics in the
phyloseq online tutorials.
A custom plotting function for displaying networks
using advanced ggplot
2 formatting.
Note that this function is a performance and interface revision to
plot_network
, which requires an igraph
object as its first argument.
This new function is more in-line with other
plot_*
functions in the phyloseq-package
, in that its
first/main argument is a phyloseq-class
instance.
Edges in the network are created if the distance between
nodes is below a (potentially arbitrary) threshold,
and special care should be given to considering the choice of this threshold.
However, network line thickness and opacity is scaled according to the
similarity of vertices (either samples or taxa),
helping to temper, somewhat, the effect of the threshold.
Also note that the choice of network layout algorithm can have a large effect
on the impression and interpretability of the network graphic,
and you may want to familiarize yourself with some of these options
(see the laymeth
argument).
plot_net(physeq, distance = "bray", type = "samples", maxdist = 0.7, laymeth = "fruchterman.reingold", color = NULL, shape = NULL, rescale = FALSE, point_size = 5, point_alpha = 1, point_label = NULL, hjust = 1.35, title = NULL)
plot_net(physeq, distance = "bray", type = "samples", maxdist = 0.7, laymeth = "fruchterman.reingold", color = NULL, shape = NULL, rescale = FALSE, point_size = 5, point_alpha = 1, point_label = NULL, hjust = 1.35, title = NULL)
physeq |
(Required).
The |
distance |
(Optional). Default is |
type |
(Optional). Default |
maxdist |
(Optional). Default |
laymeth |
(Optional). Default |
color |
(Optional). Default |
shape |
(Optional). Default |
rescale |
(Optional). Logical. Default |
point_size |
(Optional). Default |
point_alpha |
(Optional). Default |
point_label |
(Optional). Default |
hjust |
(Optional). Default |
title |
(Optional). Default |
A ggplot
2 network plot.
Will render to default graphic device automatically as print side effect.
Can also be saved, further manipulated, or rendered to
a vector or raster file using ggsave
.
Original network plotting functions:
data(enterotype) plot_net(enterotype, color="SeqTech", maxdist = 0.3) plot_net(enterotype, color="SeqTech", maxdist = 0.3, laymeth = "auto") plot_net(enterotype, color="SeqTech", maxdist = 0.3, laymeth = "svd") plot_net(enterotype, color="SeqTech", maxdist = 0.3, laymeth = "circle") plot_net(enterotype, color="SeqTech", shape="Enterotype", maxdist = 0.3, laymeth = "circle")
data(enterotype) plot_net(enterotype, color="SeqTech", maxdist = 0.3) plot_net(enterotype, color="SeqTech", maxdist = 0.3, laymeth = "auto") plot_net(enterotype, color="SeqTech", maxdist = 0.3, laymeth = "svd") plot_net(enterotype, color="SeqTech", maxdist = 0.3, laymeth = "circle") plot_net(enterotype, color="SeqTech", shape="Enterotype", maxdist = 0.3, laymeth = "circle")
There are many useful examples of phyloseq network graphics in the
phyloseq online tutorials.
A custom plotting function for displaying networks
using advanced ggplot
2 formatting.
The network itself should be represented using
the igraph
package.
For the phyloseq-package
it is suggested that the network object
(argument g
)
be created using the
make_network
function,
and based upon sample-wise or taxa-wise microbiome ecological distances
calculated from a phylogenetic sequencing experiment
(phyloseq-class
).
In this case, edges in the network are created if the distance between
nodes is below a potentially arbitrary threshold,
and special care should be given to considering the choice of this threshold.
plot_network(g, physeq=NULL, type="samples", color=NULL, shape=NULL, point_size=4, alpha=1, label="value", hjust = 1.35, line_weight=0.5, line_color=color, line_alpha=0.4, layout.method=layout.fruchterman.reingold, title=NULL)
plot_network(g, physeq=NULL, type="samples", color=NULL, shape=NULL, point_size=4, alpha=1, label="value", hjust = 1.35, line_weight=0.5, line_color=color, line_alpha=0.4, layout.method=layout.fruchterman.reingold, title=NULL)
g |
(Required). An |
physeq |
(Optional). Default |
type |
(Optional). Default |
color |
(Optional). Default |
shape |
(Optional). Default |
point_size |
(Optional). Default |
alpha |
(Optional). Default |
label |
(Optional). Default |
hjust |
(Optional). Default |
line_weight |
(Optional). Default |
line_color |
(Optional). Default |
line_alpha |
(Optional). Default |
layout.method |
(Optional). Default |
title |
(Optional). Default |
A ggplot
2 plot representing the network,
with optional mapping of variable(s) to point color or shape.
This code was adapted from a repo original hosted on GitHub by Scott Chamberlain: https://github.com/SChamberlain/gggraph
The code most directly used/modified was first posted here: http://www.r-bloggers.com/basic-ggplot2-network-graphs/
data(enterotype) ig <- make_network(enterotype, max.dist=0.3) plot_network(ig, enterotype, color="SeqTech", shape="Enterotype", line_weight=0.3, label=NULL) # Change distance parameter ig <- make_network(enterotype, max.dist=0.2) plot_network(ig, enterotype, color="SeqTech", shape="Enterotype", line_weight=0.3, label=NULL)
data(enterotype) ig <- make_network(enterotype, max.dist=0.3) plot_network(ig, enterotype, color="SeqTech", shape="Enterotype", line_weight=0.3, label=NULL) # Change distance parameter ig <- make_network(enterotype, max.dist=0.2) plot_network(ig, enterotype, color="SeqTech", shape="Enterotype", line_weight=0.3, label=NULL)
There are many useful examples of phyloseq ordination graphics in the
phyloseq online tutorials.
Convenience wrapper for plotting ordination results as a
ggplot2
-graphic, including
additional annotation in the form of shading, shape, and/or labels of
sample variables.
plot_ordination(physeq, ordination, type = "samples", axes = 1:2, color = NULL, shape = NULL, label = NULL, title = NULL, justDF = FALSE)
plot_ordination(physeq, ordination, type = "samples", axes = 1:2, color = NULL, shape = NULL, label = NULL, title = NULL, justDF = FALSE)
physeq |
(Required). |
ordination |
(Required). An ordination object. Many different classes
of ordination are defined by |
type |
(Optional). The plot type. Default is |
axes |
(Optional). A 2-element vector indicating the axes of the
ordination that should be used for plotting.
Can be |
color |
(Optional). Default Note that the color scheme is chosen automatically
by |
shape |
(Optional). Default The shape scale is chosen automatically by |
label |
(Optional). Default |
title |
(Optional). Default |
justDF |
(Optional). Default |
A ggplot
plot object, graphically summarizing
the ordination result for the specified axes.
Many more examples are included in the phyloseq online tutorials.
Also see the general wrapping function:
# See other examples at # http://joey711.github.io/phyloseq/plot_ordination-examples data(GlobalPatterns) GP = prune_taxa(names(sort(taxa_sums(GlobalPatterns), TRUE)[1:50]), GlobalPatterns) gp_bray_pcoa = ordinate(GP, "CCA", "bray") plot_ordination(GP, gp_bray_pcoa, "samples", color="SampleType")
# See other examples at # http://joey711.github.io/phyloseq/plot_ordination-examples data(GlobalPatterns) GP = prune_taxa(names(sort(taxa_sums(GlobalPatterns), TRUE)[1:50]), GlobalPatterns) gp_bray_pcoa = ordinate(GP, "CCA", "bray") plot_ordination(GP, gp_bray_pcoa, "samples", color="SampleType")
There are many useful examples of phyloseq graphics functions in the
phyloseq online tutorials.
The specific plot type is chosen according to available non-empty slots.
This is mainly for syntactic convenience and quick-plotting. See links below
for some examples of available graphics tools available in the
phyloseq-package
.
plot_phyloseq(physeq, ...) ## S4 method for signature 'phyloseq' plot_phyloseq(physeq, ...)
plot_phyloseq(physeq, ...) ## S4 method for signature 'phyloseq' plot_phyloseq(physeq, ...)
physeq |
(Required). |
... |
(Optional). Additional parameters to be passed on to the respective specific plotting function. See below for different plotting functions that might be called by this generic plotting wrapper. |
A plot is created. The nature and class of the plot depends on
the physeq
argument, specifically, which component data classes
are present.
plot_ordination
plot_heatmap
plot_tree
plot_network
plot_bar
plot_richness
data(esophagus) plot_phyloseq(esophagus)
data(esophagus) plot_phyloseq(esophagus)
There are many useful examples of alpha-diversity graphics in the
phyloseq online tutorials.
This function estimates a number of alpha-diversity metrics using the
estimate_richness
function,
and returns a ggplot
plotting object.
The plot generated by this function will include every sample
in physeq
, but they can be further grouped on the horizontal axis
through the argument to x
,
and shaded according to the argument to color
(see below).
You must use untrimmed, non-normalized count data for meaningful results,
as many of these estimates are highly dependent on the number of singletons.
You can always trim the data later on if needed,
just not before using this function.
plot_richness(physeq, x = "samples", color = NULL, shape = NULL, title = NULL, scales = "free_y", nrow = 1, shsi = NULL, measures = NULL, sortby = NULL)
plot_richness(physeq, x = "samples", color = NULL, shape = NULL, title = NULL, scales = "free_y", nrow = 1, shsi = NULL, measures = NULL, sortby = NULL)
physeq |
(Required). |
x |
(Optional). A variable to map to the horizontal axis. The vertical
axis will be mapped to the alpha diversity index/estimate
and have units of total taxa, and/or index value (dimensionless).
This parameter ( The default value is |
color |
(Optional). Default |
shape |
(Optional). Default |
title |
(Optional). Default |
scales |
(Optional). Default |
nrow |
(Optional). Default is |
shsi |
(Deprecated). No longer supported. Instead see 'measures' below. |
measures |
(Optional). Default is |
sortby |
(Optional). A character string subset of |
NOTE: Because this plotting function incorporates the output from
estimate_richness
, the variable names of that output should
not be used as x
or color
(even if it works, the resulting
plot might be kindof strange, and not the intended behavior of this function).
The following are the names you will want to avoid using in x
or color
:
c("Observed", "Chao1", "ACE", "Shannon", "Simpson", "InvSimpson", "Fisher")
.
A ggplot
plot object summarizing
the richness estimates, and their standard error.
There are many more interesting examples at the phyloseq online tutorials.
## There are many more interesting examples at the phyloseq online tutorials. ## http://joey711.github.io/phyloseq/plot_richness-examples data("soilrep") plot_richness(soilrep, measures=c("InvSimpson", "Fisher")) plot_richness(soilrep, "Treatment", "warmed", measures=c("Chao1", "ACE", "InvSimpson"), nrow=3) data("GlobalPatterns") plot_richness(GlobalPatterns, x="SampleType", measures=c("InvSimpson")) plot_richness(GlobalPatterns, x="SampleType", measures=c("Chao1", "ACE", "InvSimpson"), nrow=3) plot_richness(GlobalPatterns, x="SampleType", measures=c("Chao1", "ACE", "InvSimpson"), nrow=3, sortby = "Chao1")
## There are many more interesting examples at the phyloseq online tutorials. ## http://joey711.github.io/phyloseq/plot_richness-examples data("soilrep") plot_richness(soilrep, measures=c("InvSimpson", "Fisher")) plot_richness(soilrep, "Treatment", "warmed", measures=c("Chao1", "ACE", "InvSimpson"), nrow=3) data("GlobalPatterns") plot_richness(GlobalPatterns, x="SampleType", measures=c("InvSimpson")) plot_richness(GlobalPatterns, x="SampleType", measures=c("Chao1", "ACE", "InvSimpson"), nrow=3) plot_richness(GlobalPatterns, x="SampleType", measures=c("Chao1", "ACE", "InvSimpson"), nrow=3, sortby = "Chao1")
Convenience wrapper for plotting ordination eigenvalues (if available)
using a ggplot2
-graphic.
plot_scree(ordination, title = NULL)
plot_scree(ordination, title = NULL)
ordination |
(Required). An ordination object. Many different classes
of ordination are defined by |
title |
(Optional). Default |
A ggplot
plot object, graphically summarizing
the ordination result for the specified axes.
# First load and trim a dataset data("GlobalPatterns") GP = prune_taxa(names(sort(taxa_sums(GlobalPatterns), TRUE)[1:50]), GlobalPatterns) # Test plots (preforms ordination in-line, then makes scree plot) plot_scree(ordinate(GP, "DPCoA", "bray")) plot_scree(ordinate(GP, "PCoA", "bray")) # Empty return with message plot_scree(ordinate(GP, "NMDS", "bray")) # Constrained ordinations plot_scree(ordinate(GP, "CCA", formula=~SampleType)) plot_scree(ordinate(GP, "RDA", formula=~SampleType)) plot_scree(ordinate(GP, "CAP", formula=~SampleType)) # Deprecated example of constrained ordination (emits a warning) #plot_scree(ordinate(GP ~ SampleType, "RDA")) plot_scree(ordinate(GP, "DCA")) plot_ordination(GP, ordinate(GP, "DCA"), type="scree")
# First load and trim a dataset data("GlobalPatterns") GP = prune_taxa(names(sort(taxa_sums(GlobalPatterns), TRUE)[1:50]), GlobalPatterns) # Test plots (preforms ordination in-line, then makes scree plot) plot_scree(ordinate(GP, "DPCoA", "bray")) plot_scree(ordinate(GP, "PCoA", "bray")) # Empty return with message plot_scree(ordinate(GP, "NMDS", "bray")) # Constrained ordinations plot_scree(ordinate(GP, "CCA", formula=~SampleType)) plot_scree(ordinate(GP, "RDA", formula=~SampleType)) plot_scree(ordinate(GP, "CAP", formula=~SampleType)) # Deprecated example of constrained ordination (emits a warning) #plot_scree(ordinate(GP ~ SampleType, "RDA")) plot_scree(ordinate(GP, "DCA")) plot_ordination(GP, ordinate(GP, "DCA"), type="scree")
There are many useful examples of phyloseq tree graphics in the
phyloseq online tutorials.
This function is intended to facilitate easy graphical investigation of
the phylogenetic tree, as well as sample data. Note that for phylogenetic
sequencing of samples with large richness, some of the options in this
function will be prohibitively slow to render, or too dense to be
interpretable. A rough “rule of thumb” is to use subsets of data
with not many more than 200 OTUs per plot, sometimes less depending on the
complexity of the additional annotations being mapped to the tree. It is
usually possible to create an unreadable, uninterpretable tree with modern
datasets. However, the goal should be toward parameter settings and data
subsets that convey (honestly, accurately) some biologically relevant
feature of the data. One of the goals of the phyloseq-package
is to make the determination of these features/settings as easy as possible.
plot_tree(physeq, method = "sampledodge", nodelabf = NULL, color = NULL, shape = NULL, size = NULL, min.abundance = Inf, label.tips = NULL, text.size = NULL, sizebase = 5, base.spacing = 0.02, ladderize = FALSE, plot.margin = 0.2, title = NULL, treetheme = NULL, justify = "jagged")
plot_tree(physeq, method = "sampledodge", nodelabf = NULL, color = NULL, shape = NULL, size = NULL, min.abundance = Inf, label.tips = NULL, text.size = NULL, sizebase = 5, base.spacing = 0.02, ladderize = FALSE, plot.margin = 0.2, title = NULL, treetheme = NULL, justify = "jagged")
physeq |
(Required). The data about which you want to
plot and annotate a phylogenetic tree, in the form of a
single instance of the |
method |
(Optional). Character string. Default |
nodelabf |
(Optional). A function. Default |
color |
(Optional). Character string. Default |
shape |
(Optional). Character string. Default |
size |
(Optional). Character string. Default |
min.abundance |
(Optional). Numeric.
The minimum number of individuals required to label a point
with the precise number.
Default is |
label.tips |
(Optional). Character string. Default is |
text.size |
(Optional). Numeric. Should be positive. The
size parameter used to control the text size of taxa labels.
Default is |
sizebase |
(Optional). Numeric. Should be positive. The base of the logarithm used to scale point sizes to graphically represent abundance of species in a given sample. Default is 5. |
base.spacing |
(Optional). Numeric. Default is |
ladderize |
(Optional). Boolean or character string (either
|
plot.margin |
(Optional). Numeric. Default is |
title |
(Optional). Default |
treetheme |
(Optional).
A custom |
justify |
(Optional). A character string indicating the
type of justification to use on dodged points and tip labels.
A value of |
This function received an early development contribution from the work of
Gregory Jordan via the ggphylo package.
plot_tree
has since been re-written.
For details see tree_layout
.
A ggplot
2 plot.
There are many useful examples of phyloseq tree graphics in the phyloseq online tutorials.
# # Using plot_tree() with the esophagus dataset. # # Please note that many more interesting examples are shown # # in the online tutorials" # # http://joey711.github.io/phyloseq/plot_tree-examples data(esophagus) # plot_tree(esophagus) # plot_tree(esophagus, color="Sample") # plot_tree(esophagus, size="Abundance") # plot_tree(esophagus, size="Abundance", color="samples") plot_tree(esophagus, size="Abundance", color="Sample", base.spacing=0.03) plot_tree(esophagus, size="abundance", color="samples", base.spacing=0.03)
# # Using plot_tree() with the esophagus dataset. # # Please note that many more interesting examples are shown # # in the online tutorials" # # http://joey711.github.io/phyloseq/plot_tree-examples data(esophagus) # plot_tree(esophagus) # plot_tree(esophagus, color="Sample") # plot_tree(esophagus, size="Abundance") # plot_tree(esophagus, size="Abundance", color="samples") plot_tree(esophagus, size="Abundance", color="Sample", base.spacing=0.03) plot_tree(esophagus, size="abundance", color="samples", base.spacing=0.03)
An S4 Generic method for pruning/filtering unwanted samples by defining those you want to keep.
prune_samples(samples, x) ## S4 method for signature 'character,otu_table' prune_samples(samples, x) ## S4 method for signature 'character,sample_data' prune_samples(samples, x) ## S4 method for signature 'character,phyloseq' prune_samples(samples, x) ## S4 method for signature 'logical,ANY' prune_samples(samples, x)
prune_samples(samples, x) ## S4 method for signature 'character,otu_table' prune_samples(samples, x) ## S4 method for signature 'character,sample_data' prune_samples(samples, x) ## S4 method for signature 'character,phyloseq' prune_samples(samples, x) ## S4 method for signature 'logical,ANY' prune_samples(samples, x)
samples |
(Required). A character vector of the samples in object x that you want to
keep – OR alternatively – a logical vector where the kept samples are TRUE, and length
is equal to the number of samples in object x. If |
x |
A phyloseq object. |
The class of the object returned by prune_samples
matches
the class of the phyloseq object, x
.
data(GlobalPatterns) # Subset to just the Chlamydiae phylum. GP.chl <- subset_taxa(GlobalPatterns, Phylum=="Chlamydiae") # Remove the samples that have less than 20 total reads from Chlamydiae GP.chl <- prune_samples(sample_sums(GP.chl)>=20, GP.chl) # (p <- plot_tree(GP.chl, color="SampleType", shape="Family", label.tips="Genus", size="abundance"))
data(GlobalPatterns) # Subset to just the Chlamydiae phylum. GP.chl <- subset_taxa(GlobalPatterns, Phylum=="Chlamydiae") # Remove the samples that have less than 20 total reads from Chlamydiae GP.chl <- prune_samples(sample_sums(GP.chl)>=20, GP.chl) # (p <- plot_tree(GP.chl, color="SampleType", shape="Family", label.tips="Genus", size="abundance"))
An S4 Generic method for removing (pruning) unwanted OTUs/taxa from phylogenetic
objects, including phylo-class trees, as well as native phyloseq package
objects. This is particularly useful for pruning a phyloseq object that has
more than one component that describes OTUs.
Credit: the phylo
-class version is adapted from
prune.sample.
prune_taxa(taxa, x) ## S4 method for signature ''NULL',ANY' prune_taxa(taxa, x) ## S4 method for signature 'logical,ANY' prune_taxa(taxa, x) ## S4 method for signature 'character,phylo' prune_taxa(taxa, x) ## S4 method for signature 'character,otu_table' prune_taxa(taxa, x) ## S4 method for signature 'character,sample_data' prune_taxa(taxa, x) ## S4 method for signature 'character,phyloseq' prune_taxa(taxa, x) ## S4 method for signature 'character,taxonomyTable' prune_taxa(taxa, x) ## S4 method for signature 'character,XStringSet' prune_taxa(taxa, x)
prune_taxa(taxa, x) ## S4 method for signature ''NULL',ANY' prune_taxa(taxa, x) ## S4 method for signature 'logical,ANY' prune_taxa(taxa, x) ## S4 method for signature 'character,phylo' prune_taxa(taxa, x) ## S4 method for signature 'character,otu_table' prune_taxa(taxa, x) ## S4 method for signature 'character,sample_data' prune_taxa(taxa, x) ## S4 method for signature 'character,phyloseq' prune_taxa(taxa, x) ## S4 method for signature 'character,taxonomyTable' prune_taxa(taxa, x) ## S4 method for signature 'character,XStringSet' prune_taxa(taxa, x)
taxa |
(Required). A character vector of the taxa in object x that you want to
keep – OR alternatively – a logical vector where the kept taxa are TRUE, and length
is equal to the number of taxa in object x. If |
x |
(Required). A phylogenetic object, including |
The class of the object returned by prune_taxa
matches
the class of the argument, x
.
data("esophagus") esophagus plot(sort(taxa_sums(esophagus), TRUE), type="h", ylim=c(0, 50)) x1 = prune_taxa(taxa_sums(esophagus) > 10, esophagus) x2 = prune_taxa(names(sort(taxa_sums(esophagus), TRUE))[1:9], esophagus) identical(x1, x2)
data("esophagus") esophagus plot(sort(taxa_sums(esophagus), TRUE), type="h", ylim=c(0, 50)) x1 = prune_taxa(taxa_sums(esophagus) > 10, esophagus) x2 = prune_taxa(names(sort(taxa_sums(esophagus), TRUE))[1:9], esophagus) identical(x1, x2)
The psmelt function is a specialized melt function for melting phyloseq objects
(instances of the phyloseq class), usually for producing graphics
with ggplot2
. psmelt
relies heavily on the
melt
and merge
functions.
The naming conventions used in downstream phyloseq graphics functions
have reserved the following variable names that should not be used
as the names of sample_variables
or taxonomic rank_names
.
These reserved names are c("Sample", "Abundance", "OTU")
.
Also, you should not have identical names for
sample variables and taxonomic ranks.
That is, the intersection of the output of the following two functions
sample_variables
, rank_names
should be an empty vector
(e.g. intersect(sample_variables(physeq), rank_names(physeq))
).
All of these potential name collisions are checked-for
and renamed automtically with a warning.
However, if you (re)name your variables accordingly ahead of time,
it will reduce confusion and eliminate the warnings.
psmelt(physeq)
psmelt(physeq)
physeq |
(Required). An |
Note that
“melted” phyloseq data is stored much less efficiently,
and so RAM storage issues could arise with a smaller dataset
(smaller number of samples/OTUs/variables) than one might otherwise expect.
For common sizes of graphics-ready datasets, however,
this should not be a problem.
Because the number of OTU entries has a large effect on the RAM requirement,
methods to reduce the number of separate OTU entries –
for instance by agglomerating OTUs based on phylogenetic distance
using tip_glom
–
can help alleviate RAM usage problems.
This function is made user-accessible for flexibility,
but is also used extensively by plot functions in phyloseq.
A data.frame
-class table.
data("GlobalPatterns") gp.ch = subset_taxa(GlobalPatterns, Phylum == "Chlamydiae") mdf = psmelt(gp.ch) nrow(mdf) ncol(mdf) colnames(mdf) head(rownames(mdf)) # Create a ggplot similar to library("ggplot2") p = ggplot(mdf, aes(x=SampleType, y=Abundance, fill=Genus)) p = p + geom_bar(color="black", stat="identity", position="stack") print(p)
data("GlobalPatterns") gp.ch = subset_taxa(GlobalPatterns, Phylum == "Chlamydiae") mdf = psmelt(gp.ch) nrow(mdf) ncol(mdf) colnames(mdf) head(rownames(mdf)) # Create a ggplot similar to library("ggplot2") p = ggplot(mdf, aes(x=SampleType, y=Abundance, fill=Genus)) p = p + geom_bar(color="black", stat="identity", position="stack") print(p)
This is a simple accessor function to make it more convenient to determine
the taxonomic ranks that are available in a given phyloseq-class
object.
rank_names(physeq, errorIfNULL=TRUE)
rank_names(physeq, errorIfNULL=TRUE)
physeq |
(Required).
|
errorIfNULL |
(Optional). Logical. Should the accessor stop with
an error if the slot is empty ( |
Character vector. The names of the available taxonomic ranks.
get_taxa
taxa_names
sample_names
data(enterotype) rank_names(enterotype)
data(enterotype) rank_names(enterotype)
Please note that the authors of phyloseq do not advocate using this
as a normalization procedure, despite its recent popularity.
Our justifications for using alternative approaches to address
disparities in library sizes have been made available as
an article in PLoS Computational Biology.
See phyloseq_to_deseq2
for a recommended alternative to rarefying
directly supported in the phyloseq package, as well as
the supplemental materials for the PLoS-CB article
and the phyloseq extensions repository on GitHub.
Nevertheless, for comparison and demonstration, the rarefying procedure is implemented
here in good faith and with options we hope are useful.
This function uses the standard R sample
function to
resample from the abundance values
in the otu_table
component of the first argument,
physeq
.
Often one of the major goals of this procedure is to achieve parity in
total number of counts between samples, as an alternative to other formal
normalization procedures, which is why a single value for the
sample.size
is expected.
This kind of resampling can be performed with and without replacement,
with replacement being the more computationally-efficient, default setting.
See the replace
parameter documentation for more details.
We recommended that you explicitly select a random number generator seed
before invoking this function, or, alternatively, that you
explicitly provide a single positive integer argument as rngseed
.
rarefy_even_depth(physeq, sample.size = min(sample_sums(physeq)), rngseed = FALSE, replace = TRUE, trimOTUs = TRUE, verbose = TRUE)
rarefy_even_depth(physeq, sample.size = min(sample_sums(physeq)), rngseed = FALSE, replace = TRUE, trimOTUs = TRUE, verbose = TRUE)
physeq |
(Required). A |
sample.size |
(Optional). A single integer value equal to the number
of reads being simulated, also known as the depth,
and also equal to each value returned by |
rngseed |
(Optional). A single integer value passed to
|
replace |
(Optional). Logical. Whether to sample with replacement
( |
trimOTUs |
(Optional). |
verbose |
(Optional). Logical. Default is |
This approach is sometimes mistakenly called “rarefaction”, which
in physics refers to a form of wave decompression;
but in this context, ecology, the term refers to a
repeated sampling procedure to assess species richness,
first proposed in 1968 by Howard Sanders.
In contrast, the procedure implemented here is used as an ad hoc means to
normalize microbiome counts that have
resulted from libraries of widely-differing sizes.
Here we have intentionally adopted an alternative
name, rarefy
, that has also been used recently
to describe this process
and, to our knowledge, not previously used in ecology.
Make sure to use set.seed
for exactly-reproducible results
of the random subsampling.
An object of class phyloseq
.
Only the otu_table
component is modified.
# Test with esophagus dataset data("esophagus") esorepT = rarefy_even_depth(esophagus, replace=TRUE) esorepF = rarefy_even_depth(esophagus, replace=FALSE) sample_sums(esophagus) sample_sums(esorepT) sample_sums(esorepF) ## NRun Manually: Too slow! # data("GlobalPatterns") # GPrepT = rarefy_even_depth(GlobalPatterns, 1E5, replace=TRUE) ## Actually just this one is slow # system.time(GPrepF <- rarefy_even_depth(GlobalPatterns, 1E5, replace=FALSE))
# Test with esophagus dataset data("esophagus") esorepT = rarefy_even_depth(esophagus, replace=TRUE) esorepF = rarefy_even_depth(esophagus, replace=FALSE) sample_sums(esophagus) sample_sums(esorepT) sample_sums(esorepF) ## NRun Manually: Too slow! # data("GlobalPatterns") # GPrepT = rarefy_even_depth(GlobalPatterns, 1E5, replace=TRUE) ## Actually just this one is slow # system.time(GPrepF <- rarefy_even_depth(GlobalPatterns, 1E5, replace=FALSE))
This function is a convenience wrapper around the
read.tree
(Newick-format) and
read.nexus
(Nexus-format) importers provided by
the ape-package
. This function attempts to return a valid
tree if possible using either format importer. If it fails, it silently
returns NULL
by default, rather than throwing a show-stopping error.
read_tree(treefile, errorIfNULL=FALSE, ...)
read_tree(treefile, errorIfNULL=FALSE, ...)
treefile |
(Required). A character string implying a file |
errorIfNULL |
(Optional). Logical. Should an error be thrown if no tree
can be extracted from the connection?
Default is |
... |
(Optional). Additional parameter(s) passed to the relevant tree-importing function. |
If successful, returns a phylo
-class object as defined
in the ape-package
. Returns NULL if neither tree-reading function worked.
read_tree(system.file("extdata", "esophagus.tree.gz", package="phyloseq")) read_tree(system.file("extdata", "GP_tree_rand_short.newick.gz", package="phyloseq"))
read_tree(system.file("extdata", "esophagus.tree.gz", package="phyloseq")) read_tree(system.file("extdata", "GP_tree_rand_short.newick.gz", package="phyloseq"))
In principal, this is a standard newick format, that can be imported
into R using read_tree
,
which in-turn utilizes read.tree
.
However, read.tree
has failed to import
recent (October 2012 and later) releases of the GreenGenes tree,
and this problem has been traced to the additional annotations
added to some internal nodes
that specify taxonomic classification between single-quotes.
To solve this problem and create a clear container
for fixing future problems with the format of GreenGenes-released trees,
this function is available in phyloseq and exported for users.
It is also referenced in the documentation of the import functions
for QIIME legacy and BIOM format importers –
import_qiime
and import_biom
, respectively.
However, since the precise format of the tree is not restricted to GreenGenes trees
by QIIME or for the biom-format, this function is not called
automatically by those aforementioned import functions.
If your tree is formatted like, or is one of, the official GreenGenes
release trees, then you should use this function and provide its output
to your relevant import function.
read_tree_greengenes(treefile)
read_tree_greengenes(treefile)
treefile |
(Required). A character string implying
a file |
A tree, represented as a phylo
object.
# Read the May 2013, 73% similarity official tree, # included as extra data in phyloseq. treefile = system.file("extdata", "gg13-5-73.tree.gz", package="phyloseq") x = read_tree_greengenes(treefile) x class(x) y = read_tree(treefile) y class(y) ## Not run, causes an error: # library("ape") # read.tree(treefile)
# Read the May 2013, 73% similarity official tree, # included as extra data in phyloseq. treefile = system.file("extdata", "gg13-5-73.tree.gz", package="phyloseq") x = read_tree_greengenes(treefile) x class(x) y = read_tree(treefile) y class(y) ## Not run, causes an error: # library("ape") # read.tree(treefile)
XStringSet
-class) from object.This is the suggested method
for accessing
the phylogenetic tree, (XStringSet
-class)
from a phyloseq data object (phyloseq-class
).
Like other accessors (see See Also, below), the default behavior of this method
is to stop with an
error if physeq
is a phyloseq-class
but does not
contain reference sequences (the component data type you are trying to access in this case).
refseq(physeq, errorIfNULL=TRUE) ## S4 method for signature 'ANY' refseq(physeq, errorIfNULL = TRUE) ## S4 method for signature 'XStringSet' refseq(physeq)
refseq(physeq, errorIfNULL=TRUE) ## S4 method for signature 'ANY' refseq(physeq, errorIfNULL = TRUE) ## S4 method for signature 'XStringSet' refseq(physeq)
physeq |
(Required). An instance of phyloseq-class that contains a phylogenetic tree. If physeq is a phylogenetic tree (a component data class), then it is returned as-is. |
errorIfNULL |
(Optional). Logical. Should the accessor stop with
an error if the slot is empty ( |
The phylo
-class object contained within physeq
;
or NULL if physeq
does not have a tree.
This method stops with an error in the latter NULL case be default, which
can be over-ridden by changing the value of errorIfNULL
to FALSE
.
otu_table
, sample_data
, tax_table
phy_tree
,
phyloseq
, merge_phyloseq
data(GlobalPatterns) refseq(GlobalPatterns, FALSE)
data(GlobalPatterns) refseq(GlobalPatterns, FALSE)
This is for removing overly-abundant outlier taxa, not for trimming low-abundance taxa.
rm_outlierf(f, na.rm=TRUE)
rm_outlierf(f, na.rm=TRUE)
f |
Single numeric value between 0 and 1. The maximum fractional abundance value that a taxa will be allowed to have in a sample without being marked for trimming. |
na.rm |
Logical. Should we remove NA values. Default |
A function (enclosure), suitable for filterfun_sample
.
topk
, topf
,
topp
, rm_outlierf
t1 <- 1:10; names(t1)<-paste("t", 1:10, sep="") rm_outlierf(0.15)(t1) ## Use simulated abundance matrix set.seed(711) testOTU <- otu_table(matrix(sample(1:50, 25, replace=TRUE), 5, 5), taxa_are_rows=FALSE) taxa_sums(testOTU) f1 <- filterfun_sample(rm_outlierf(0.1)) (wh1 <- genefilter_sample(testOTU, f1, A=1)) wh2 <- c(TRUE, TRUE, TRUE, FALSE, FALSE) prune_taxa(wh1, testOTU) prune_taxa(wh2, testOTU)
t1 <- 1:10; names(t1)<-paste("t", 1:10, sep="") rm_outlierf(0.15)(t1) ## Use simulated abundance matrix set.seed(711) testOTU <- otu_table(matrix(sample(1:50, 25, replace=TRUE), 5, 5), taxa_are_rows=FALSE) taxa_sums(testOTU) f1 <- filterfun_sample(rm_outlierf(0.1)) (wh1 <- genefilter_sample(testOTU, f1, A=1)) wh2 <- c(TRUE, TRUE, TRUE, FALSE, FALSE) prune_taxa(wh1, testOTU) prune_taxa(wh2, testOTU)
This is the suggested method for both constructing and accessing a table
of sample-level variables (sample_data-class
),
which in the phyloseq-package
is represented as a special
extension of the data.frame-class
.
When the
argument is a data.frame
, sample_data
will create
a sample_data-class object.
In this case, the rows should be named to match the
sample_names
of the other objects to which it will ultimately be paired.
Alternatively, if the first argument is an experiment-level (phyloseq-class
)
object, then the corresponding sample_data
is returned.
Like other accessors (see See Also, below), the default behavior of this method
is to stop with an
error if object
is a phyloseq-class
but does not
contain a sample_data
.
sample_data(object, errorIfNULL=TRUE) ## S4 method for signature 'ANY' sample_data(object, errorIfNULL = TRUE) ## S4 method for signature 'data.frame' sample_data(object)
sample_data(object, errorIfNULL=TRUE) ## S4 method for signature 'ANY' sample_data(object, errorIfNULL = TRUE) ## S4 method for signature 'data.frame' sample_data(object)
object |
(Required). A |
errorIfNULL |
(Optional). Logical. Should the accessor stop with
an error if the slot is empty ( |
A sample_data-class
object
representing the sample variates of an experiment.
phy_tree
, tax_table
, otu_table
phyloseq
, merge_phyloseq
# data(soilrep) head(sample_data(soilrep))
# data(soilrep) head(sample_data(soilrep))
Row indices represent samples, while column indices represent experimental categories, variables (and so forth) that describe the samples.
data-frame data, inherited from the data.frame class.
Also inherited from the data.frame class; it should contain the sample names.
Inherited from the data.frame class.
x
This replaces the current sample_data
component of x
with
value
, if value
is a sample_data-class
. However,
if value
is a data.frame
, then value
is first coerced to
a sample_data-class
, and then assigned. Alternatively, if
value
is phyloseq-class
, then the
sample_data
component will first be accessed from value
and then assigned. This makes possible some concise assignment/replacement
statements when adjusting, modifying, or building subsets of
experiment-level data. See some examples below.
Internally, this re-builds the phyloseq-class
object using
the standard phyloseq
constructor. Thus, index mismatches
between sample-describing components will not be allowed, and subsetting
will occurr automatically such that only the intersection of sample IDs
are included in any components. This has the added benefit of re-checking
(internally) for any other issues.
sample_data(x) <- value
sample_data(x) <- value
x |
(Required). |
value |
(Required). Either a |
No return. This is an assignment statement.
data(soilrep) soilrep head(sample_data(soilrep)) sample_data(soilrep)$Time <- as.integer(substr(sample_data(soilrep)$Sample, 1, 1)) head(sample_data(soilrep))
data(soilrep) soilrep head(sample_data(soilrep)) sample_data(soilrep)$Time <- as.integer(substr(sample_data(soilrep)$Sample, 1, 1)) head(sample_data(soilrep))
Get sample names.
sample_names(physeq) ## S4 method for signature 'ANY' sample_names(physeq) ## S4 method for signature 'phyloseq' sample_names(physeq) ## S4 method for signature 'sample_data' sample_names(physeq) ## S4 method for signature 'otu_table' sample_names(physeq)
sample_names(physeq) ## S4 method for signature 'ANY' sample_names(physeq) ## S4 method for signature 'phyloseq' sample_names(physeq) ## S4 method for signature 'sample_data' sample_names(physeq) ## S4 method for signature 'otu_table' sample_names(physeq)
physeq |
(Required). A |
A character vector. The names of the samples in physeq
.
data(esophagus) sample_names(esophagus)
data(esophagus) sample_names(esophagus)
Replace OTU identifier names
sample_names(x) <- value ## S4 replacement method for signature 'ANY,ANY' sample_names(x) <- value ## S4 replacement method for signature 'ANY,character' sample_names(x) <- value ## S4 replacement method for signature 'otu_table,character' sample_names(x) <- value ## S4 replacement method for signature 'sample_data,character' sample_names(x) <- value ## S4 replacement method for signature 'phyloseq,character' sample_names(x) <- value
sample_names(x) <- value ## S4 replacement method for signature 'ANY,ANY' sample_names(x) <- value ## S4 replacement method for signature 'ANY,character' sample_names(x) <- value ## S4 replacement method for signature 'otu_table,character' sample_names(x) <- value ## S4 replacement method for signature 'sample_data,character' sample_names(x) <- value ## S4 replacement method for signature 'phyloseq,character' sample_names(x) <- value
x |
(Required). An object defined by the |
value |
(Required). A character vector
to replace the current |
data("esophagus") sample_names(esophagus) # plot_tree(esophagus, color="sample_names", ladderize="left") sample_names(esophagus) <- paste("Sa-", sample_names(esophagus), sep="") sample_names(esophagus) # plot_tree(esophagus, color="sample_names", ladderize="left") ## non-characters are first coerced to characters. sample_names(esophagus) <- 1:nsamples(esophagus) sample_names(esophagus) # plot_tree(esophagus, color="sample_names", ladderize="left") ## Cannot assign non-unique or differently-lengthed name vectors. Error. # sample_names(esophagus) <- sample(c(TRUE, FALSE), nsamples(esophagus), TRUE) # sample_names(esophagus) <- sample(sample_names(esophagus), nsamples(esophagus)-1, FALSE)
data("esophagus") sample_names(esophagus) # plot_tree(esophagus, color="sample_names", ladderize="left") sample_names(esophagus) <- paste("Sa-", sample_names(esophagus), sep="") sample_names(esophagus) # plot_tree(esophagus, color="sample_names", ladderize="left") ## non-characters are first coerced to characters. sample_names(esophagus) <- 1:nsamples(esophagus) sample_names(esophagus) # plot_tree(esophagus, color="sample_names", ladderize="left") ## Cannot assign non-unique or differently-lengthed name vectors. Error. # sample_names(esophagus) <- sample(c(TRUE, FALSE), nsamples(esophagus), TRUE) # sample_names(esophagus) <- sample(sample_names(esophagus), nsamples(esophagus)-1, FALSE)
A convenience function equivalent to rowSums or colSums, but where the orientation of the otu_table is automatically handled.
sample_sums(x)
sample_sums(x)
x |
A named numeric-class
length equal to the number of samples
in the x
, name indicating the sample ID, and value equal to the sum of
all individuals observed for each sample in x
.
data(enterotype) sample_sums(enterotype) data(esophagus) sample_sums(esophagus)
data(enterotype) sample_sums(enterotype) data(esophagus) sample_sums(esophagus)
This is a simple accessor function to make it more convenient to determine
the sample variable names of a particular phyloseq-class
object.
sample_variables(physeq, errorIfNULL=TRUE)
sample_variables(physeq, errorIfNULL=TRUE)
physeq |
(Required). |
errorIfNULL |
(Optional). Logical. Should the accessor stop with
an error if the slot is empty ( |
Character vector. The names of the variables in the sample_data data.frame. Essentially the column names. Useful for selecting model and graphics parameters that interact with sample_data.
get_taxa
taxa_names
sample_names
data(enterotype) sample_variables(enterotype)
data(enterotype) sample_variables(enterotype)
This is a helper function to report back to the user the different cutoff values available in a given mothur file – for instance, a list or shared file.
show_mothur_cutoffs(mothur_list_file)
show_mothur_cutoffs(mothur_list_file)
mothur_list_file |
The file name and/or location as produced by mothur. |
A character vector of the different cutoff values contained in the file.
For a given set of arguments to the cluster()
command from within
mothur, a number of OTU-clustering results are returned in the same
file. The exact cutoff values used by mothur can vary depending
on the input data/parameters. This simple function returns the cutoffs that were actually
included in the mothur output. This an important extra step prior to
importing data with the import_mothur
function.
See the general documentation of show
method for
expected behavior.
## S4 method for signature 'otu_table' show(object) ## S4 method for signature 'sample_data' show(object) ## S4 method for signature 'taxonomyTable' show(object) ## S4 method for signature 'phyloseq' show(object)
## S4 method for signature 'otu_table' show(object) ## S4 method for signature 'sample_data' show(object) ## S4 method for signature 'taxonomyTable' show(object) ## S4 method for signature 'phyloseq' show(object)
object |
Any R object |
# data(GlobalPatterns) # show(GlobalPatterns) # GlobalPatterns
# data(GlobalPatterns) # show(GlobalPatterns) # GlobalPatterns
Easily retrieve a plot-derived data.frame
with a subset of points
according to a threshold and method. The meaning of the threshold depends
upon the method. See argument description below.
There are many useful examples of phyloseq ordination graphics in the
phyloseq online tutorials.
subset_ord_plot(p, threshold=0.05, method="farthest")
subset_ord_plot(p, threshold=0.05, method="farthest")
p |
(Required). A |
threshold |
(Optional). A numeric scalar. Default is |
method |
(Optional). A character string. One of
|
A data.frame
suitable for creating a
ggplot
plot object, graphically summarizing
the ordination result according to previously-specified parameters.
phyloseq online tutorial for this function.
## See the online tutorials. ## http://joey711.github.io/phyloseq/subset_ord_plot-examples
## See the online tutorials. ## http://joey711.github.io/phyloseq/subset_ord_plot-examples
This is a convenience wrapper around the subset
function.
It is intended to allow subsetting complex experimental objects with one
function call.
Subsetting is based on an expression for which the context first includes
the variables contained in sample_data
.
The samples
retained in the dataset is equivalent to
x[subset & !is.na(subset)]
, where x
is the vector of sample IDs
and subset
is the logical that results from your subsetting expression.
This is important to keep in mind, as users are often unaware that this
subsetting step also removes/omits samples that have a missing value, NA
,
somewhere in the expression.
subset_samples(physeq, ...)
subset_samples(physeq, ...)
physeq |
A |
... |
The subsetting expression that should be applied to the
|
A subsetted object with the same class as physeq
.
# data(GlobalPatterns) # subset_samples(GlobalPatterns, SampleType=="Ocean")
# data(GlobalPatterns) # subset_samples(GlobalPatterns, SampleType=="Ocean")
This is a convenience wrapper around the subset
function.
It is intended to speed subsetting complex experimental objects with one
function call. In the case of subset_taxa
, the subsetting will be
based on an expression related to the columns and values within the
tax_table
(taxonomyTable
component) slot of physeq
.
The OTUs
retained in the dataset is equivalent to
x[subset & !is.na(subset)]
, where x
is the vector of OTU IDs
and subset
is the logical that results from your subsetting expression.
This is important to keep in mind, as users are often unaware that this
subsetting step also removes/omits OTUs that have a missing value result, NA
,
somewhere in the expression.
subset_taxa(physeq, ...)
subset_taxa(physeq, ...)
physeq |
A |
... |
The subsetting expression that should be applied to the
|
A subsetted object with the same class as physeq
.
## ex3 <- subset_taxa(GlobalPatterns, Phylum=="Bacteroidetes")
## ex3 <- subset_taxa(GlobalPatterns, Phylum=="Bacteroidetes")
otu_table-class
or phyloseq-class
Extends the base transpose method, t
.
t(x) ## S4 method for signature 'otu_table' t(x) ## S4 method for signature 'phyloseq' t(x)
t(x) ## S4 method for signature 'otu_table' t(x) ## S4 method for signature 'phyloseq' t(x)
x |
An |
The class of the object returned by t
matches
the class of the argument, x
. The otu_table
is
transposed, and taxa_are_rows
value is toggled.
data(GlobalPatterns) otu_table(GlobalPatterns) t( otu_table(GlobalPatterns) )
data(GlobalPatterns) otu_table(GlobalPatterns) t( otu_table(GlobalPatterns) )
This method merges species that have the same taxonomy at a certain
taxonomic rank.
Its approach is analogous to tip_glom
, but uses categorical data
instead of a tree. In principal, other categorical data known for all taxa
could also be used in place of taxonomy,
but for the moment, this must be stored in the taxonomyTable
of the data. Also, columns/ranks to the right of the rank chosen to use
for agglomeration will be replaced with NA
,
because they should be meaningless following agglomeration.
tax_glom(physeq, taxrank=rank_names(physeq)[1], NArm=TRUE, bad_empty=c(NA, "", " ", "\t"))
tax_glom(physeq, taxrank=rank_names(physeq)[1], NArm=TRUE, bad_empty=c(NA, "", " ", "\t"))
physeq |
(Required). |
taxrank |
A character string specifying the taxonomic level
that you want to agglomerate over.
Should be among the results of |
NArm |
(Optional). Logical, length equal to one. Default is |
bad_empty |
(Optional). Character vector. Default: |
A taxonomically-agglomerated, optionally-pruned, object with class matching
the class of physeq
.
# data(GlobalPatterns) # ## print the available taxonomic ranks # colnames(tax_table(GlobalPatterns)) # ## agglomerate at the Family taxonomic rank # (x1 <- tax_glom(GlobalPatterns, taxrank="Family") ) # ## How many taxa before/after agglomeration? # ntaxa(GlobalPatterns); ntaxa(x1) # ## Look at enterotype dataset... # data(enterotype) # ## print the available taxonomic ranks. Shows only 1 rank available, not useful for tax_glom # colnames(tax_table(enterotype))
# data(GlobalPatterns) # ## print the available taxonomic ranks # colnames(tax_table(GlobalPatterns)) # ## agglomerate at the Family taxonomic rank # (x1 <- tax_glom(GlobalPatterns, taxrank="Family") ) # ## How many taxa before/after agglomeration? # ntaxa(GlobalPatterns); ntaxa(x1) # ## Look at enterotype dataset... # data(enterotype) # ## print the available taxonomic ranks. Shows only 1 rank available, not useful for tax_glom # colnames(tax_table(enterotype))
This is the suggested method for both constructing and accessing a table of
taxonomic names, organized with ranks as columns (taxonomyTable-class
).
When the argument is a character matrix, tax_table() will create and return a
taxonomyTable-class
object.
In this case, the rows should be named to match the
species.names
of the other objects to which it will ultimately be paired.
Alternatively, if the first argument is an experiment-level (phyloseq-class
)
object, then the corresponding taxonomyTable
is returned.
Like other accessors (see See Also, below), the default behavior of this method
is to stop with an
error if object
is a phyloseq-class
but does not
contain a taxonomyTable
.
tax_table(object, errorIfNULL=TRUE) ## S4 method for signature 'ANY' tax_table(object, errorIfNULL = TRUE) ## S4 method for signature 'matrix' tax_table(object) ## S4 method for signature 'data.frame' tax_table(object)
tax_table(object, errorIfNULL=TRUE) ## S4 method for signature 'ANY' tax_table(object, errorIfNULL = TRUE) ## S4 method for signature 'matrix' tax_table(object) ## S4 method for signature 'data.frame' tax_table(object)
object |
An object among the set of classes defined by the phyloseq package that contain taxonomyTable. |
errorIfNULL |
(Optional). Logical. Should the accessor stop with
an error if the slot is empty ( |
A taxonomyTable-class
object.
It is either grabbed from the relevant slot
if object
is complex, or built anew if object
is a
character matrix representing the taxonomic classification of
species in the experiment.
phy_tree
, sample_data
, otu_table
phyloseq
, merge_phyloseq
# # tax1 <- tax_table(matrix("abc", 30, 8)) # data(GlobalPatterns) # tax_table(GlobalPatterns)
# # tax1 <- tax_table(matrix("abc", 30, 8)) # data(GlobalPatterns) # tax_table(GlobalPatterns)
x
Assign a (new) Taxonomy Table to x
tax_table(x) <- value ## S4 replacement method for signature 'phyloseq,taxonomyTable' tax_table(x) <- value ## S4 replacement method for signature 'phyloseq,ANY' tax_table(x) <- value ## S4 replacement method for signature 'taxonomyTable,taxonomyTable' tax_table(x) <- value ## S4 replacement method for signature 'taxonomyTable,ANY' tax_table(x) <- value
tax_table(x) <- value ## S4 replacement method for signature 'phyloseq,taxonomyTable' tax_table(x) <- value ## S4 replacement method for signature 'phyloseq,ANY' tax_table(x) <- value ## S4 replacement method for signature 'taxonomyTable,taxonomyTable' tax_table(x) <- value ## S4 replacement method for signature 'taxonomyTable,ANY' tax_table(x) <- value
x |
(Required). |
value |
(Required). |
# data(GlobalPatterns) # # An example of pruning to just the first 100 taxa in GlobalPatterns. # ex2a <- prune_taxa(taxa_names(GlobalPatterns)[1:100], GlobalPatterns) # # The following 3 lines produces an ex2b that is equal to ex2a # ex2b <- GlobalPatterns # TT <- tax_table(GlobalPatterns)[1:100, ] # tax_table(ex2b) <- TT # identical(ex2a, ex2b) # print(ex2b) # # 2 examples adding a tax_table component from phyloseq or matrix classes # ex2c <- phyloseq(otu_table(ex2b), sample_data(ex2b), phy_tree(ex2b)) # tax_table(ex2c) <- ex2b # identical(ex2a, ex2c) # ex2c <- phyloseq(otu_table(ex2b), sample_data(ex2b), phy_tree(ex2b)) # tax_table(ex2c) <- as(tax_table(ex2b), "matrix") # identical(ex2a, ex2c)
# data(GlobalPatterns) # # An example of pruning to just the first 100 taxa in GlobalPatterns. # ex2a <- prune_taxa(taxa_names(GlobalPatterns)[1:100], GlobalPatterns) # # The following 3 lines produces an ex2b that is equal to ex2a # ex2b <- GlobalPatterns # TT <- tax_table(GlobalPatterns)[1:100, ] # tax_table(ex2b) <- TT # identical(ex2a, ex2b) # print(ex2b) # # 2 examples adding a tax_table component from phyloseq or matrix classes # ex2c <- phyloseq(otu_table(ex2b), sample_data(ex2b), phy_tree(ex2b)) # tax_table(ex2c) <- ex2b # identical(ex2a, ex2c) # ex2c <- phyloseq(otu_table(ex2b), sample_data(ex2b), phy_tree(ex2b)) # tax_table(ex2c) <- as(tax_table(ex2b), "matrix") # identical(ex2a, ex2c)
Access taxa_are_rows slot from otu_table objects.
taxa_are_rows(physeq) ## S4 method for signature 'ANY' taxa_are_rows(physeq) ## S4 method for signature 'otu_table' taxa_are_rows(physeq) ## S4 method for signature 'phyloseq' taxa_are_rows(physeq)
taxa_are_rows(physeq) ## S4 method for signature 'ANY' taxa_are_rows(physeq) ## S4 method for signature 'otu_table' taxa_are_rows(physeq) ## S4 method for signature 'phyloseq' taxa_are_rows(physeq)
physeq |
(Required). |
A logical indicating the orientation of the otu_table.
The taxa_are_rows slot is a logical indicating the orientation of the
abundance table contained in object x
.
taxa_are_rows(x) <- value ## S4 replacement method for signature 'otu_table,logical' taxa_are_rows(x) <- value ## S4 replacement method for signature 'phyloseq,logical' taxa_are_rows(x) <- value
taxa_are_rows(x) <- value ## S4 replacement method for signature 'otu_table,logical' taxa_are_rows(x) <- value ## S4 replacement method for signature 'phyloseq,logical' taxa_are_rows(x) <- value
x |
|
value |
A logical of length equal to 1. If |
data(esophagus) taxa_are_rows(esophagus) taxa_are_rows(otu_table(esophagus))
data(esophagus) taxa_are_rows(esophagus) taxa_are_rows(otu_table(esophagus))
Get species / taxa names.
taxa_names(physeq) ## S4 method for signature 'ANY' taxa_names(physeq) ## S4 method for signature 'phyloseq' taxa_names(physeq) ## S4 method for signature 'otu_table' taxa_names(physeq) ## S4 method for signature 'taxonomyTable' taxa_names(physeq) ## S4 method for signature 'sample_data' taxa_names(physeq) ## S4 method for signature 'phylo' taxa_names(physeq) ## S4 method for signature 'XStringSet' taxa_names(physeq)
taxa_names(physeq) ## S4 method for signature 'ANY' taxa_names(physeq) ## S4 method for signature 'phyloseq' taxa_names(physeq) ## S4 method for signature 'otu_table' taxa_names(physeq) ## S4 method for signature 'taxonomyTable' taxa_names(physeq) ## S4 method for signature 'sample_data' taxa_names(physeq) ## S4 method for signature 'phylo' taxa_names(physeq) ## S4 method for signature 'XStringSet' taxa_names(physeq)
physeq |
|
A character vector of the names of the species in physeq
.
ntaxa
# data("esophagus") tree <- phy_tree(esophagus) OTU1 <- otu_table(esophagus) taxa_names(tree) taxa_names(OTU1) physeq1 <- phyloseq(OTU1, tree) taxa_names(physeq1)
# data("esophagus") tree <- phy_tree(esophagus) OTU1 <- otu_table(esophagus) taxa_names(tree) taxa_names(OTU1) physeq1 <- phyloseq(OTU1, tree) taxa_names(physeq1)
Replace OTU identifier names
taxa_names(x) <- value ## S4 replacement method for signature 'ANY,ANY' taxa_names(x) <- value ## S4 replacement method for signature 'ANY,character' taxa_names(x) <- value ## S4 replacement method for signature 'otu_table,character' taxa_names(x) <- value ## S4 replacement method for signature 'taxonomyTable,character' taxa_names(x) <- value ## S4 replacement method for signature 'phylo,character' taxa_names(x) <- value ## S4 replacement method for signature 'XStringSet,character' taxa_names(x) <- value ## S4 replacement method for signature 'phyloseq,character' taxa_names(x) <- value
taxa_names(x) <- value ## S4 replacement method for signature 'ANY,ANY' taxa_names(x) <- value ## S4 replacement method for signature 'ANY,character' taxa_names(x) <- value ## S4 replacement method for signature 'otu_table,character' taxa_names(x) <- value ## S4 replacement method for signature 'taxonomyTable,character' taxa_names(x) <- value ## S4 replacement method for signature 'phylo,character' taxa_names(x) <- value ## S4 replacement method for signature 'XStringSet,character' taxa_names(x) <- value ## S4 replacement method for signature 'phyloseq,character' taxa_names(x) <- value
x |
(Required). An object defined by the |
value |
(Required). A character vector
to replace the current |
data("esophagus") taxa_names(esophagus) # plot_tree(esophagus, label.tips="taxa_names", ladderize="left") taxa_names(esophagus) <- paste("OTU-", taxa_names(esophagus), sep="") taxa_names(esophagus) # plot_tree(esophagus, label.tips="taxa_names", ladderize="left") ## non-characters are first coerced to characters. taxa_names(esophagus) <- 1:ntaxa(esophagus) taxa_names(esophagus) # plot_tree(esophagus, label.tips="taxa_names", ladderize="left") ## Cannot assign non-unique or differently-lengthed name vectors. Error. # taxa_names(esophagus) <- sample(c(TRUE, FALSE), ntaxa(esophagus), TRUE) # taxa_names(esophagus) <- sample(taxa_names(esophagus), ntaxa(esophagus)-5, FALSE)
data("esophagus") taxa_names(esophagus) # plot_tree(esophagus, label.tips="taxa_names", ladderize="left") taxa_names(esophagus) <- paste("OTU-", taxa_names(esophagus), sep="") taxa_names(esophagus) # plot_tree(esophagus, label.tips="taxa_names", ladderize="left") ## non-characters are first coerced to characters. taxa_names(esophagus) <- 1:ntaxa(esophagus) taxa_names(esophagus) # plot_tree(esophagus, label.tips="taxa_names", ladderize="left") ## Cannot assign non-unique or differently-lengthed name vectors. Error. # taxa_names(esophagus) <- sample(c(TRUE, FALSE), ntaxa(esophagus), TRUE) # taxa_names(esophagus) <- sample(taxa_names(esophagus), ntaxa(esophagus)-5, FALSE)
A convenience function equivalent to rowSums or colSums, but where the orientation of the otu_table is automatically handled.
taxa_sums(x)
taxa_sums(x)
x |
A numeric-class
with length equal to the number of species
in the table, name indicated the taxa ID, and value equal to the sum of
all individuals observed for each taxa in x
.
data(enterotype) taxa_sums(enterotype) data(esophagus) taxa_sums(esophagus)
data(enterotype) taxa_sums(enterotype) data(esophagus) taxa_sums(esophagus)
Row indices represent taxa, columns represent taxonomic classifiers.
This slot is inherited from the matrix
class.
The lowest thresh
values in x
all get the value 'thresh'.
threshrank(x, thresh, keep0s=FALSE, ...)
threshrank(x, thresh, keep0s=FALSE, ...)
x |
(Required). Numeric vector to transform. |
thresh |
A single numeric value giving the threshold. |
keep0s |
A logical determining whether 0's in |
... |
Further arguments passes to the |
A ranked, (optionally) thresholded numeric vector with length equal to
x
. Default arguments to rank
are used, unless provided as
additional arguments.
transform_sample_counts
, rank
, threshrankfun
# (a_vector <- sample(0:10, 100, TRUE)) threshrank(a_vector, 5, keep0s=TRUE) data(GlobalPatterns) GP <- GlobalPatterns ## These three approaches result in identical otu_table (x1 <- transform_sample_counts( otu_table(GP), threshrankfun(500)) ) (x2 <- otu_table(apply(otu_table(GP), 2, threshrankfun(500)), taxa_are_rows(GP)) ) identical(x1, x2) (x3 <- otu_table(apply(otu_table(GP), 2, threshrank, thresh=500), taxa_are_rows(GP)) ) identical(x1, x3)
# (a_vector <- sample(0:10, 100, TRUE)) threshrank(a_vector, 5, keep0s=TRUE) data(GlobalPatterns) GP <- GlobalPatterns ## These three approaches result in identical otu_table (x1 <- transform_sample_counts( otu_table(GP), threshrankfun(500)) ) (x2 <- otu_table(apply(otu_table(GP), 2, threshrankfun(500)), taxa_are_rows(GP)) ) identical(x1, x2) (x3 <- otu_table(apply(otu_table(GP), 2, threshrank, thresh=500), taxa_are_rows(GP)) ) identical(x1, x3)
threshrank
function.Takes the same arguments as threshrank
, except for x
,
because the output is a single-argument function rather than a rank-transformed numeric.
This is useful for higher-order functions that require a single-argument function as input,
like transform_sample_counts
.
threshrankfun(thresh, keep0s=FALSE, ...)
threshrankfun(thresh, keep0s=FALSE, ...)
thresh |
A single numeric value giving the threshold. |
keep0s |
A logical determining whether 0's in |
... |
Further arguments passes to the |
A single-argument function with the options to threshrank
set.
transform_sample_counts
, threshrankfun
,
threshrank
data(esophagus) x1 = transform_sample_counts(esophagus, threshrankfun(50)) otu_table(x1) x2 = transform_sample_counts(esophagus, rank) otu_table(x2) identical(x1, x2)
data(esophagus) x1 = transform_sample_counts(esophagus, threshrankfun(50)) otu_table(x1) x2 = transform_sample_counts(esophagus, rank) otu_table(x2) identical(x1, x2)
All tips of the tree separated by a cophenetic distance smaller than
h
will be agglomerated into one taxa using merge_taxa
.
tip_glom(physeq, h = 0.2, hcfun = agnes, ...)
tip_glom(physeq, h = 0.2, hcfun = agnes, ...)
physeq |
(Required). A |
h |
(Optional). Numeric scalar of the height where the tree should be cut.
This refers to the tree resulting from hierarchical clustering
of |
hcfun |
(Optional). A function.
The (agglomerative, hierarchical) clustering function to use.
Good examples are
|
... |
(Optional). Additional named arguments to pass
to |
Can be used to create a non-trivial OTU Table, if a phylogenetic tree is available.
For now, a simple, “greedy”, single-linkage clustering is used. In future releases
it should be possible to specify different clustering approaches available in R
,
in particular, complete-linkage clustering appears to be used more commonly for OTU
clustering applications.
An instance of the phyloseq-class
.
Or alternatively, a phylo
object if the
physeq
argument was just a tree.
In the expected-use case, the number of OTUs will be fewer
(see ntaxa
),
after merging OTUs that are related enough to be called
the same OTU.
data("esophagus") # for speed esophagus = prune_taxa(taxa_names(esophagus)[1:25], esophagus) plot_tree(esophagus, label.tips="taxa_names", size="abundance", title="Before tip_glom()") plot_tree(tip_glom(esophagus, h=0.2), label.tips="taxa_names", size="abundance", title="After tip_glom()")
data("esophagus") # for speed esophagus = prune_taxa(taxa_names(esophagus)[1:25], esophagus) plot_tree(esophagus, label.tips="taxa_names", size="abundance", title="Before tip_glom()") plot_tree(tip_glom(esophagus, h=0.2), label.tips="taxa_names", size="abundance", title="After tip_glom()")
As opposed to topp
, which gives the
most abundant p fraction of observed taxa (richness, instead of cumulative
abundance. Said another way, topf ensures a certain
fraction of the total sequences are retained, while topp ensures
that a certain fraction of taxa/species/OTUs are retained.
topf(f, na.rm=TRUE)
topf(f, na.rm=TRUE)
f |
Single numeric value between 0 and 1. |
na.rm |
Logical. Should we remove NA values. Default |
A function (enclosure), suitable for filterfun_sample
,
that will return TRUE
for each element in the taxa comprising the most abundant f fraction of individuals.
topk
, topf
,
topp
, rm_outlierf
t1 <- 1:10; names(t1)<-paste("t", 1:10, sep="") topf(0.6)(t1) ## Use simulated abundance matrix set.seed(711) testOTU <- otu_table(matrix(sample(1:50, 25, replace=TRUE), 5, 5), taxa_are_rows=FALSE) f1 <- filterfun_sample(topf(0.4)) (wh1 <- genefilter_sample(testOTU, f1, A=1)) wh2 <- c(TRUE, TRUE, TRUE, FALSE, FALSE) prune_taxa(wh1, testOTU) prune_taxa(wh2, testOTU)
t1 <- 1:10; names(t1)<-paste("t", 1:10, sep="") topf(0.6)(t1) ## Use simulated abundance matrix set.seed(711) testOTU <- otu_table(matrix(sample(1:50, 25, replace=TRUE), 5, 5), taxa_are_rows=FALSE) f1 <- filterfun_sample(topf(0.4)) (wh1 <- genefilter_sample(testOTU, f1, A=1)) wh2 <- c(TRUE, TRUE, TRUE, FALSE, FALSE) prune_taxa(wh1, testOTU) prune_taxa(wh2, testOTU)
k
taxaMake filter fun. the most abundant k
taxa
topk(k, na.rm=TRUE)
topk(k, na.rm=TRUE)
k |
An integer, indicating how many of the most abundant taxa should be kept. |
na.rm |
A logical. Should |
Returns a function (enclosure) that will return TRUE for each element in the most abundant k values.
topk
, topf
,
topp
, rm_outlierf
## Use simulated abundance matrix set.seed(711) testOTU <- otu_table(matrix(sample(1:50, 25, replace=TRUE), 5, 5), taxa_are_rows=FALSE) f1 <- filterfun_sample(topk(2)) wh1 <- genefilter_sample(testOTU, f1, A=2) wh2 <- c(TRUE, TRUE, TRUE, FALSE, FALSE) prune_taxa(wh1, testOTU) prune_taxa(wh2, testOTU)
## Use simulated abundance matrix set.seed(711) testOTU <- otu_table(matrix(sample(1:50, 25, replace=TRUE), 5, 5), taxa_are_rows=FALSE) f1 <- filterfun_sample(topk(2)) wh1 <- genefilter_sample(testOTU, f1, A=2) wh2 <- c(TRUE, TRUE, TRUE, FALSE, FALSE) prune_taxa(wh1, testOTU) prune_taxa(wh2, testOTU)
p
fraction of taxaMake filter fun. that returns the most abundant p
fraction of taxa
topp(p, na.rm=TRUE)
topp(p, na.rm=TRUE)
p |
A numeric of length 1, indicating what fraction of the most abundant taxa should be kept. |
na.rm |
A logical. Should |
A function (enclosure), suitable for filterfun_sample
,
that will return TRUE
for each element in the most abundant p fraction of taxa.
topk
, topf
,
topp
, rm_outlierf
## Use simulated abundance matrix set.seed(711) testOTU <- otu_table(matrix(sample(1:50, 25, replace=TRUE), 5, 5), taxa_are_rows=FALSE) sample_sums(testOTU) f1 <- filterfun_sample(topp(0.2)) (wh1 <- genefilter_sample(testOTU, f1, A=1)) wh2 <- c(TRUE, TRUE, TRUE, FALSE, FALSE) prune_taxa(wh1, testOTU) prune_taxa(wh2, testOTU)
## Use simulated abundance matrix set.seed(711) testOTU <- otu_table(matrix(sample(1:50, 25, replace=TRUE), 5, 5), taxa_are_rows=FALSE) sample_sums(testOTU) f1 <- filterfun_sample(topp(0.2)) (wh1 <- genefilter_sample(testOTU, f1, A=1)) wh2 <- c(TRUE, TRUE, TRUE, FALSE, FALSE) prune_taxa(wh1, testOTU) prune_taxa(wh2, testOTU)
otu_table
, sample-by-sample.This function transforms the sample counts of a taxa abundance matrix according to a user-provided function. The counts of each sample will be transformed individually. No sample-sample interaction/comparison is possible by this method.
transform_sample_counts(physeq, fun, ...) transformSampleCounts(physeq, fun, ...)
transform_sample_counts(physeq, fun, ...) transformSampleCounts(physeq, fun, ...)
physeq |
(Required). |
fun |
(Required). A single-argument function that will be applied
to the abundance counts of each sample.
Can be an anonymous |
... |
(Optional). Additional, optionally-named, arguments passed to
|
A transformed otu_table
– or phyloseq
object with its
transformed otu_table
.
In general, trimming is not expected by this
method, so it is suggested that the user provide only functions that return
a full-length vector. Filtering/trimming can follow, for which the
genefilter_sample
and prune_taxa
functions
are suggested.
# data(esophagus) x1 = transform_sample_counts(esophagus, threshrankfun(50)) head(otu_table(x1), 10) x2 = transform_sample_counts(esophagus, rank) head(otu_table(x2), 10) identical(x1, x2) x3 = otu_table(esophagus) + 5 x3 = transform_sample_counts(x3, log) head(otu_table(x3), 10) x4 = transform_sample_counts(esophagus, function(x) round(x^2.2, 0)) head(otu_table(x4), 10)
# data(esophagus) x1 = transform_sample_counts(esophagus, threshrankfun(50)) head(otu_table(x1), 10) x2 = transform_sample_counts(esophagus, rank) head(otu_table(x2), 10) identical(x1, x2) x3 = otu_table(esophagus) + 5 x3 = transform_sample_counts(x3, log) head(otu_table(x3), 10) x4 = transform_sample_counts(esophagus, function(x) round(x^2.2, 0)) head(otu_table(x4), 10)
This function takes a phylo
or phyloseq-class
object
and returns a list of two data.table
s suitable for plotting
a phylogenetic tree with ggplot
2.
tree_layout(phy, ladderize = FALSE)
tree_layout(phy, ladderize = FALSE)
phy |
(Required). The |
ladderize |
(Optional). Boolean or character string (either
|
A list of two data.table
s, containing respectively
a data.table
of edge segment coordinates, named edgeDT
,
and a data.table
of vertical connecting segments, named vertDT
.
See example
below for a simple demonstration.
An early example of this functionality was borrowed directly, with permission,
from the package called ggphylo
,
released on GitHub at:
https://github.com/gjuggler/ggphylo
by its author Gregory Jordan [email protected].
That original phyloseq internal function, tree.layout
, has been
completely replaced by this smaller and much faster user-accessible
function that utilizes performance enhancements from standard
data.table
magic as well as ape-package
internal C code.
library("ggplot2") data("esophagus") phy = phy_tree(esophagus) phy <- ape::root(phy, "65_2_5", resolve.root=TRUE) treeSegs0 = tree_layout(phy) treeSegs1 = tree_layout(esophagus) edgeMap = aes(x=xleft, xend=xright, y=y, yend=y) vertMap = aes(x=x, xend=x, y=vmin, yend=vmax) p0 = ggplot(treeSegs0$edgeDT, edgeMap) + geom_segment() + geom_segment(vertMap, data=treeSegs0$vertDT) p1 = ggplot(treeSegs1$edgeDT, edgeMap) + geom_segment() + geom_segment(vertMap, data=treeSegs1$vertDT) print(p0) print(p1) plot_tree(esophagus, "treeonly") plot_tree(esophagus, "treeonly", ladderize="left")
library("ggplot2") data("esophagus") phy = phy_tree(esophagus) phy <- ape::root(phy, "65_2_5", resolve.root=TRUE) treeSegs0 = tree_layout(phy) treeSegs1 = tree_layout(esophagus) edgeMap = aes(x=xleft, xend=xright, y=y, yend=y) vertMap = aes(x=x, xend=x, y=vmin, yend=vmax) p0 = ggplot(treeSegs0$edgeDT, edgeMap) + geom_segment() + geom_segment(vertMap, data=treeSegs0$vertDT) p1 = ggplot(treeSegs1$edgeDT, edgeMap) + geom_segment() + geom_segment(vertMap, data=treeSegs1$vertDT) print(p0) print(p1) plot_tree(esophagus, "treeonly") plot_tree(esophagus, "treeonly", ladderize="left")
This function calculates the (Fast) UniFrac distance for all sample-pairs
in a phyloseq-class
object.
UniFrac(physeq, weighted=FALSE, normalized=TRUE, parallel=FALSE, fast=TRUE) ## S4 method for signature 'phyloseq' UniFrac(physeq, weighted = FALSE, normalized = TRUE, parallel = FALSE, fast = TRUE)
UniFrac(physeq, weighted=FALSE, normalized=TRUE, parallel=FALSE, fast=TRUE) ## S4 method for signature 'phyloseq' UniFrac(physeq, weighted = FALSE, normalized = TRUE, parallel = FALSE, fast = TRUE)
physeq |
(Required). |
weighted |
(Optional). Logical. Should use weighted-UniFrac calculation?
Weighted-UniFrac takes into account the relative abundance of species/taxa
shared between samples, whereas unweighted-UniFrac only considers
presence/absence. Default is |
normalized |
(Optional). Logical. Should the output be normalized such that values
range from 0 to 1 independent of branch length values? Default is |
parallel |
(Optional). Logical. Should execute calculation in parallel,
using multiple CPU cores simultaneously? This can dramatically hasten the
computation time for this function. However, it also requires that the user
has registered a parallel “backend” prior to calling this function.
Default is |
fast |
(Optional). Logical. DEPRECATED.
Do you want to use the “Fast UniFrac”
algorithm? Implemented natively in the |
UniFrac()
accesses the abundance
(otu_table-class
) and a phylogenetic tree (phylo-class
)
data within an experiment-level (phyloseq-class
) object.
If the tree and contingency table are separate objects, suggested solution
is to combine them into an experiment-level class
using the phyloseq
function. For example, the following code
phyloseq(myotu_table, myTree)
returns a phyloseq
-class object that has been pruned and comprises
the minimum arguments necessary for UniFrac()
.
Parallelization is possible for UniFrac calculated with the phyloseq-package
,
and is encouraged in the instances of large trees, many samples, or both.
Parallelization has been implemented via the foreach-package
.
This means that parallel calls need to be preceded by 2 or more commands
that register the parallel “backend”. This is acheived via your choice of
helper packages. One of the simplest seems to be the doParallel package.
For more information, see the following links on registering the “backend”:
foreach package manual:
http://cran.r-project.org/web/packages/foreach/index.html
Notes on parallel computing in R
. Skip to the section describing
the foreach Framework. It gives off-the-shelf examples for registering
a parallel backend using the doMC, doSNOW, or doMPI packages:
http://trg.apbionet.org/euasiagrid/docs/parallelR.notes.pdf
Furthermore, as of R
version 2.14.0
and higher, a parallel package
is included as part of the core installation, parallel-package
,
and this can be used as the parallel backend with the foreach-package
using the adaptor package “doParallel”.
http://cran.r-project.org/web/packages/doParallel/index.html
See the vignette for some simple examples for using doParallel. http://cran.r-project.org/web/packages/doParallel/vignettes/gettingstartedParallel.pdf
UniFrac-specific examples for doParallel are provided in the example code below.
a sample-by-sample distance matrix, suitable for NMDS, etc.
http://bmf.colorado.edu/unifrac/
The main implementation (Fast UniFrac) is adapted from the algorithm's description in:
Hamady, Lozupone, and Knight, “Fast UniFrac: facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and PhyloChip data.” The ISME Journal (2010) 4, 17–27.
See also additional descriptions of UniFrac in the following articles:
Lozupone, Hamady and Knight, “UniFrac - An Online Tool for Comparing Microbial Community Diversity in a Phylogenetic Context.”, BMC Bioinformatics 2006, 7:371
Lozupone, Hamady, Kelley and Knight, “Quantitative and qualitative (beta) diversity measures lead to different insights into factors that structure microbial communities.” Appl Environ Microbiol. 2007
Lozupone C, Knight R. “UniFrac: a new phylogenetic method for comparing microbial communities.” Appl Environ Microbiol. 2005 71 (12):8228-35.
unifrac
in the picante package.
################################################################################ # Perform UniFrac on esophagus data ################################################################################ data("esophagus") (y <- UniFrac(esophagus, TRUE)) UniFrac(esophagus, TRUE, FALSE) UniFrac(esophagus, FALSE) # ################################################################################ # # Now try a parallel implementation using doParallel, which leverages the # # new 'parallel' core package in R 2.14.0+ # # Note that simply loading the 'doParallel' package is not enough, you must # # call a function that registers the backend. In general, this is pretty easy # # with the 'doParallel package' (or one of the alternative 'do*' packages) # # # # Also note that the esophagus example has only 3 samples, and a relatively small # # tree. This is fast to calculate even sequentially and does not warrant # # parallelized computation, but provides a good quick example for using UniFrac() # # in a parallel fashion. The number of cores you should specify during the # # backend registration, using registerDoParallel(), depends on your system and # # needs. 3 is chosen here for convenience. If your system has only 2 cores, this # # will probably fault or run slower than necessary. # ################################################################################ # library(doParallel) # data(esophagus) # # For SNOW-like functionality (works on Windows): # cl <- makeCluster(3) # registerDoParallel(cl) # UniFrac(esophagus, TRUE) # # Force to sequential backed: # registerDoSEQ() # # For multicore-like functionality (will probably not work on windows), # # register the backend like this: # registerDoParallel(cores=3) # UniFrac(esophagus, TRUE) ################################################################################
################################################################################ # Perform UniFrac on esophagus data ################################################################################ data("esophagus") (y <- UniFrac(esophagus, TRUE)) UniFrac(esophagus, TRUE, FALSE) UniFrac(esophagus, FALSE) # ################################################################################ # # Now try a parallel implementation using doParallel, which leverages the # # new 'parallel' core package in R 2.14.0+ # # Note that simply loading the 'doParallel' package is not enough, you must # # call a function that registers the backend. In general, this is pretty easy # # with the 'doParallel package' (or one of the alternative 'do*' packages) # # # # Also note that the esophagus example has only 3 samples, and a relatively small # # tree. This is fast to calculate even sequentially and does not warrant # # parallelized computation, but provides a good quick example for using UniFrac() # # in a parallel fashion. The number of cores you should specify during the # # backend registration, using registerDoParallel(), depends on your system and # # needs. 3 is chosen here for convenience. If your system has only 2 cores, this # # will probably fault or run slower than necessary. # ################################################################################ # library(doParallel) # data(esophagus) # # For SNOW-like functionality (works on Windows): # cl <- makeCluster(3) # registerDoParallel(cl) # UniFrac(esophagus, TRUE) # # Force to sequential backed: # registerDoSEQ() # # For multicore-like functionality (will probably not work on windows), # # register the backend like this: # registerDoParallel(cores=3) # UniFrac(esophagus, TRUE) ################################################################################