mia
implements tools
for microbiome analysis based on the SummarizedExperiment
(Morgan et al. 2020),
SingleCellExperiment
(Amezquita et
al. 2020) and TreeSummarizedExperiment
(Huang 2021) infrastructure. Data wrangling and
analysis are the main scope of this package.
To install mia
, install BiocManager
first,
if it is not installed. Afterwards use the install
function
from BiocManager
.
TreeSummarizedExperiment
objectA few example datasets are available via mia
. For this
vignette the GlobalPatterns
dataset is loaded first.
## class: TreeSummarizedExperiment
## dim: 19216 26
## metadata(0):
## assays(1): counts
## rownames(19216): 549322 522457 ... 200359 271582
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(26): CL3 CC1 ... Even2 Even3
## colData names(7): X.SampleID Primer ... SampleType Description
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (19216 rows)
## rowTree: 1 phylo tree(s) (19216 leaves)
## colLinks: NULL
## colTree: NULL
One of the main topics for analysing microbiome data is the application of taxonomic data to describe features measured. The interest lies in the connection between individual bacterial species and their relation to each other.
mia
does not rely on a specific object type to hold
taxonomic data, but uses specific columns in the rowData
of
a TreeSummarizedExperiment
object.
taxonomyRanks
can be used to construct a
character
vector of available taxonomic levels. This can be
used, for example, for subsetting.
## [1] "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species"
## [1] "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species"
## DataFrame with 19216 rows and 7 columns
## Kingdom Phylum Class Order Family
## <character> <character> <character> <character> <character>
## 549322 Archaea Crenarchaeota Thermoprotei NA NA
## 522457 Archaea Crenarchaeota Thermoprotei NA NA
## 951 Archaea Crenarchaeota Thermoprotei Sulfolobales Sulfolobaceae
## 244423 Archaea Crenarchaeota Sd-NA NA NA
## 586076 Archaea Crenarchaeota Sd-NA NA NA
## ... ... ... ... ... ...
## 278222 Bacteria SR1 NA NA NA
## 463590 Bacteria SR1 NA NA NA
## 535321 Bacteria SR1 NA NA NA
## 200359 Bacteria SR1 NA NA NA
## 271582 Bacteria SR1 NA NA NA
## Genus Species
## <character> <character>
## 549322 NA NA
## 522457 NA NA
## 951 Sulfolobus Sulfolobusacidocalda..
## 244423 NA NA
## 586076 NA NA
## ... ... ...
## 278222 NA NA
## 463590 NA NA
## 535321 NA NA
## 200359 NA NA
## 271582 NA NA
The columns are recognized case insensitive. Additional functions are available to check for validity of taxonomic information or generate labels based on the taxonomic information.
##
## FALSE TRUE
## 1413 17803
## [1] "Class:Thermoprotei" "Class:Thermoprotei_1"
## [3] "Species:Sulfolobusacidocaldarius" "Class:Sd-NA"
## [5] "Class:Sd-NA_1" "Class:Sd-NA_2"
For more details see the man page ?taxonomyRanks
.
Agglomeration of data based on these taxonomic descriptors can be
performed using functions implemented in mia
. In addition
to the aggValue
functions provide by
TreeSummarizedExperiment
agglomerateByRank
is
available. agglomerateByRank
does not require tree data to
be present.
agglomerateByRank
constructs a factor
to
guide merging from the available taxonomic information. For more
information on merging have a look at the man page via
?mergeFeatures
.
# agglomerate at the Family taxonomic rank
x1 <- agglomerateByRank(tse, rank = "Family")
## How many taxa before/after agglomeration?
nrow(tse)
## [1] 19216
## [1] 341
Tree data can also be shrunk alongside agglomeration, but this is turned of by default.
# with agglomeration of the tree
x2 <- agglomerateByRank(tse, rank = "Family",
agglomerateTree = TRUE)
nrow(x2) # same number of rows, but
## [1] 341
##
## Phylogenetic tree with 19216 tips and 19215 internal nodes.
##
## Tip labels:
## 549322, 522457, 951, 244423, 586076, 246140, ...
## Node labels:
## , 0.858.4, 1.000.154, 0.764.3, 0.995.2, 1.000.2, ...
##
## Rooted; includes branch length(s).
##
## Phylogenetic tree with 341 tips and 340 internal nodes.
##
## Tip labels:
## 951, 215972, 138353, 546313, 173903, 202347, ...
## Node labels:
## , 0.858.4, 0.764.3, 0.985.6, 1.000.112, 0.978.18, ...
##
## Rooted; includes branch length(s).
For agglomerateByRank
to work, taxonomic data must be
present. Even though only one rank is available for the
enterotype
dataset, agglomeration can be performed
effectively de-duplicating entries for the genus level.
## [1] "Genus"
## class: TreeSummarizedExperiment
## dim: 551 280
## metadata(0):
## assays(1): counts
## rownames(551): Prosthecochloris Chloroflexus ... Syntrophococcus
## Mogibacterium
## rowData names(1): Genus
## colnames(280): AM.AD.1 AM.AD.2 ... TS98_V2 TS99.2_V2
## colData names(9): Enterotype Sample_ID ... Age ClinicalStatus
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL
To keep data tidy, the agglomerated data can be stored as an alternative experiment in the object of origin. With this synchronized sample subsetting becomes very easy.
Keep in mind, that if you set empty.rm = TRUE
, rows with
NA
or similar value (defined via the
empty.fields
argument) will be removed. Depending on these
settings different number of rows will be returned.
x1 <- agglomerateByRank(tse, rank = "Species", empty.rm = TRUE)
altExp(tse,"species") <- agglomerateByRank(tse, rank = "Species", empty.rm = FALSE)
dim(x1)
## [1] 944 26
## [1] 2307 26
For convenience the function agglomerateByRanks
is
available, which agglomerates data on all ranks
selected.
By default all available ranks will be used. The output is compatible to
be stored as alternative experiments.
## class: TreeSummarizedExperiment
## dim: 19216 26
## metadata(0):
## assays(1): counts
## rownames(19216): 549322 522457 ... 200359 271582
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(26): CL3 CC1 ... Even2 Even3
## colData names(7): X.SampleID Primer ... SampleType Description
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(9): family species ... Genus Species
## rowLinks: a LinkDataFrame (19216 rows)
## rowTree: 1 phylo tree(s) (19216 leaves)
## colLinks: NULL
## colTree: NULL
## [1] "family" "species" "Kingdom" "Phylum" "Class" "Order" "Family"
## [8] "Genus" "Species"
Constructing a taxonomic tree from taxonomic data stored in
rowData
is quite straightforward and uses mostly functions
implemented in TreeSummarizedExperiment
.
taxa <- rowData(altExp(tse,"Species"))[,taxonomyRanks(tse)]
taxa_res <- resolveLoop(as.data.frame(taxa))
taxa_tree <- toTree(data = taxa_res)
taxa_tree$tip.label <- getTaxonomyLabels(altExp(tse,"Species"))
rowNodeLab <- getTaxonomyLabels(altExp(tse,"Species"), make.unique = FALSE)
altExp(tse,"Species") <- changeTree(altExp(tse,"Species"),
rowTree = taxa_tree,
rowNodeLab = rowNodeLab)
Transformation of count data stored in assays
is also a
main task when work with microbiome data. transformAssay
can be used for this and offers a few choices of available
transformations. A modified object is returned and the transformed
counts are stored in a new assay
.
## [1] "counts"
## [1] "counts"
For more details have a look at the man page
?transformAssay
.
Sub-sampling to equal number of counts per sample. Also known as rarefying.
data(GlobalPatterns, package = "mia")
tse.subsampled <- rarefyAssay(GlobalPatterns,
sample = 60000,
name = "subsampled",
replace = TRUE,
seed = 1938)
tse.subsampled
## class: TreeSummarizedExperiment
## dim: 12444 25
## metadata(1): rarefyAssay_sample
## assays(2): counts subsampled
## rownames(12444): 549322 255340 ... 200359 271582
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(25): CL3 CC1 ... Even2 Even3
## colData names(7): X.SampleID Primer ... SampleType Description
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (12444 rows)
## rowTree: 1 phylo tree(s) (19216 leaves)
## colLinks: NULL
## colTree: NULL
Alternatively, one can save both original TreeSE and subsampled TreeSE within a MultiAssayExperiment object.
library(MultiAssayExperiment)
mae <- MultiAssayExperiment(c("originalTreeSE" = GlobalPatterns,
"subsampledTreeSE" = tse.subsampled))
mae
## A MultiAssayExperiment object of 2 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 2:
## [1] originalTreeSE: TreeSummarizedExperiment with 19216 rows and 26 columns
## [2] subsampledTreeSE: TreeSummarizedExperiment with 12444 rows and 25 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
## class: TreeSummarizedExperiment
## dim: 12444 25
## metadata(1): rarefyAssay_sample
## assays(2): counts subsampled
## rownames(12444): 549322 255340 ... 200359 271582
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(25): CL3 CC1 ... Even2 Even3
## colData names(7): X.SampleID Primer ... SampleType Description
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (12444 rows)
## rowTree: 1 phylo tree(s) (19216 leaves)
## colLinks: NULL
## colTree: NULL
In the field of microbiome ecology several indices to describe samples and community of samples are available. In this vignette we just want to give a very brief introduction.
Functions for calculating alpha and beta diversity indices are
available. Using addAlpha
multiple diversity indices are
calculated by default and results are stored automatically in
colData
. Selected indices can be calculated individually by
setting index = "shannon"
for example.
## [1] "shannon"
Beta diversity indices are used to describe inter-sample connections.
Technically they are calculated as dist
object and reduced
dimensions can be extracted using cmdscale
. This is wrapped
up in the runMDS
function of the scater
package, but can be easily used to calculated beta diversity indices
using the established functions from the vegan
package or
any other package using comparable inputs.
library(scater)
altExp(tse,"Genus") <- runMDS(altExp(tse,"Genus"),
FUN = vegan::vegdist,
method = "bray",
name = "BrayCurtis",
ncomponents = 5,
assay.type = "counts",
keep_dist = TRUE)
JSD and UniFrac are implemented in mia
as well.
getJSD
can be used as a drop-in replacement in the example
above (omit the method
argument as well) to calculate the
JSD. For calculating the UniFrac distance via getUniFrac
either a TreeSummarizedExperiment
must be used or a tree
supplied via the tree
argument. For more details see
?getJSD
, ?getUnifrac
or
?getDPCoA
.
runMDS
performs the decomposition. Alternatively
addNMDS
can also be used.
estimateDominance
and estimateEvenness
implement other sample-wise indices. The function behave equivalently to
estimateDiversity
. For more information see the
corresponding man pages.
To make migration and adoption as easy as possible several utility functions are available.
Functions to load data from biom
files,
QIIME2
output, DADA2
objects (Callahan et al. 2016) or phyloseq
objects are available.
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 58 taxa and 3 samples ]
## phy_tree() Phylogenetic Tree: [ 58 tips and 57 internal nodes ]
## class: TreeSummarizedExperiment
## dim: 58 3
## metadata(0):
## assays(1): counts
## rownames(58): 59_8_22 59_5_13 ... 65_9_9 59_2_6
## rowData names(0):
## colnames(3): B C D
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (58 rows)
## rowTree: 1 phylo tree(s) (58 leaves)
## colLinks: NULL
## colTree: NULL
For more details have a look at the man page, for examples
?convert
.
Row-wise or column-wise assay data subsetting.
## CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr M31Tong M11Tong
## 522457 0 0 0 0 0 0 0 0 0 0
## LMEpi24M SLEpi20M AQC1cm AQC4cm AQC7cm NP2 NP3 NP5 TRRsed1 TRRsed2
## 522457 0 0 0 2 6 0 0 0 0 0
## TRRsed3 TS28 TS29 Even1 Even2 Even3
## 522457 0 0 0 0 0 0
## CC1
## 549322 0
## 522457 0
## 951 0
## 244423 0
## 586076 0
## 246140 0
getTop
returns a vector of the most top
abundant feature IDs.
data(esophagus, package = "mia")
top_taxa <- getTop(esophagus,
method = "mean",
top = 5,
assay.type = "counts")
top_taxa
## [1] "59_2_6" "59_7_6" "59_8_22" "59_5_19" "65_6_2"
To generate tidy data as used and required in most of the tidyverse,
meltAssay
can be used. A data.frame
in the
long format will be returned.
## # A tibble: 499,616 × 18
## FeatureID SampleID counts Kingdom Phylum Class Order Family Genus Species
## <fct> <fct> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 549322 CL3 0 Archaea Crenarcha… Ther… <NA> <NA> <NA> <NA>
## 2 549322 CC1 0 Archaea Crenarcha… Ther… <NA> <NA> <NA> <NA>
## 3 549322 SV1 0 Archaea Crenarcha… Ther… <NA> <NA> <NA> <NA>
## 4 549322 M31Fcsw 0 Archaea Crenarcha… Ther… <NA> <NA> <NA> <NA>
## 5 549322 M11Fcsw 0 Archaea Crenarcha… Ther… <NA> <NA> <NA> <NA>
## 6 549322 M31Plmr 0 Archaea Crenarcha… Ther… <NA> <NA> <NA> <NA>
## 7 549322 M11Plmr 0 Archaea Crenarcha… Ther… <NA> <NA> <NA> <NA>
## 8 549322 F21Plmr 0 Archaea Crenarcha… Ther… <NA> <NA> <NA> <NA>
## 9 549322 M31Tong 0 Archaea Crenarcha… Ther… <NA> <NA> <NA> <NA>
## 10 549322 M11Tong 0 Archaea Crenarcha… Ther… <NA> <NA> <NA> <NA>
## # ℹ 499,606 more rows
## # ℹ 8 more variables: X.SampleID <fct>, Primer <fct>, Final_Barcode <fct>,
## # Barcode_truncated_plus_T <fct>, Barcode_full_length <fct>,
## # SampleType <fct>, Description <fct>, shannon <dbl>
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] phyloseq_1.51.0 scater_1.35.0
## [3] ggplot2_3.5.1 scuttle_1.17.0
## [5] mia_1.15.6 TreeSummarizedExperiment_2.15.0
## [7] Biostrings_2.75.2 XVector_0.47.0
## [9] SingleCellExperiment_1.29.1 MultiAssayExperiment_1.33.1
## [11] SummarizedExperiment_1.37.0 Biobase_2.67.0
## [13] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
## [15] IRanges_2.41.2 S4Vectors_0.45.2
## [17] BiocGenerics_0.53.3 generics_0.1.3
## [19] MatrixGenerics_1.19.0 matrixStats_1.4.1
## [21] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] sys_3.4.3 rstudioapi_0.17.1
## [3] jsonlite_1.8.9 magrittr_2.0.3
## [5] ggbeeswarm_0.7.2 nloptr_2.1.1
## [7] rmarkdown_2.29 fs_1.6.5
## [9] zlibbioc_1.52.0 vctrs_0.6.5
## [11] multtest_2.63.0 minqa_1.2.8
## [13] DelayedMatrixStats_1.29.0 base64enc_0.1-3
## [15] htmltools_0.5.8.1 S4Arrays_1.7.1
## [17] BiocNeighbors_2.1.2 Rhdf5lib_1.29.0
## [19] rhdf5_2.51.1 SparseArray_1.7.2
## [21] Formula_1.2-5 sass_0.4.9
## [23] bslib_0.8.0 htmlwidgets_1.6.4
## [25] plyr_1.8.9 DECIPHER_3.3.1
## [27] sandwich_3.1-1 zoo_1.8-12
## [29] cachem_1.1.0 buildtools_1.0.0
## [31] igraph_2.1.2 iterators_1.0.14
## [33] lifecycle_1.0.4 pkgconfig_2.0.3
## [35] rsvd_1.0.5 Matrix_1.7-1
## [37] R6_2.5.1 fastmap_1.2.0
## [39] GenomeInfoDbData_1.2.13 digest_0.6.37
## [41] colorspace_2.1-1 irlba_2.3.5.1
## [43] Hmisc_5.2-1 vegan_2.6-8
## [45] beachmat_2.23.4 fansi_1.0.6
## [47] mgcv_1.9-1 httr_1.4.7
## [49] abind_1.4-8 compiler_4.4.2
## [51] withr_3.0.2 htmlTable_2.4.3
## [53] backports_1.5.0 BiocParallel_1.41.0
## [55] viridis_0.6.5 DBI_1.2.3
## [57] MASS_7.3-61 DelayedArray_0.33.3
## [59] biomformat_1.35.0 bluster_1.17.0
## [61] permute_0.9-7 tools_4.4.2
## [63] vipor_0.4.7 foreign_0.8-87
## [65] beeswarm_0.4.0 ape_5.8-1
## [67] nnet_7.3-19 glue_1.8.0
## [69] rhdf5filters_1.19.0 nlme_3.1-166
## [71] grid_4.4.2 checkmate_2.3.2
## [73] ade4_1.7-22 cluster_2.1.8
## [75] reshape2_1.4.4 lpSolve_5.6.23
## [77] gtable_0.3.6 mediation_4.5.0
## [79] tidyr_1.3.1 data.table_1.16.4
## [81] BiocSingular_1.23.0 ScaledMatrix_1.15.0
## [83] utf8_1.2.4 foreach_1.5.2
## [85] ggrepel_0.9.6 pillar_1.9.0
## [87] stringr_1.5.1 yulab.utils_0.1.8
## [89] splines_4.4.2 dplyr_1.1.4
## [91] treeio_1.31.0 lattice_0.22-6
## [93] survival_3.7-0 tidyselect_1.2.1
## [95] DirichletMultinomial_1.49.0 maketools_1.3.1
## [97] knitr_1.49 gridExtra_2.3
## [99] xfun_0.49 rbiom_1.0.3
## [101] stringi_1.8.4 UCSC.utils_1.3.0
## [103] lazyeval_0.2.2 yaml_2.3.10
## [105] boot_1.3-31 evaluate_1.0.1
## [107] codetools_0.2-20 tibble_3.2.1
## [109] BiocManager_1.30.25 cli_3.6.3
## [111] RcppParallel_5.1.9 rpart_4.1.23
## [113] munsell_0.5.1 jquerylib_0.1.4
## [115] Rcpp_1.0.13-1 parallel_4.4.2
## [117] sparseMatrixStats_1.19.0 lme4_1.1-35.5
## [119] slam_0.1-55 decontam_1.27.0
## [121] viridisLite_0.4.2 mvtnorm_1.3-2
## [123] tidytree_0.4.6 scales_1.3.0
## [125] purrr_1.0.2 crayon_1.5.3
## [127] rlang_1.1.4